MORE is better: a new regression-based evaluation pipeline for testing and combining motif models

Esther T. Chan [1,2,3,], Lourdes Pena Castillo [2,3,], Gwenael Badis [2,3], Shaheynoor Talukder [2,3], Ally Yang [2,3] and Timothy R. Hughes [1,2,3]

[1] Department of Molecular Genetics
[2] Banting and Best Department of Medical Research
[3] Terrence Donnelly Centre for Cellular and Biomolecular Research
University of Toronto, 160 College Street, Toronto, Ontario, Canada, M5S 1A8

Corresponding author: t.hughes@utoronto.ca
Phone: 416-946-7838
Fax: 416-978-8528

Last updated: Sat 19 Jan, 2008

Abstract

The sequence specificity of transcription factors (TFs)is typically represented as a unique position weight matrix (PWM). However, TFs often bind sequences different from the consensus sequence and a generalized single PWM representation may be insufficient to capture the full spectrum of sequence-binding specificities of a DNA-binding protein. Here, we consider the problem of identifying a set of binding specificities that best explain the DNA-binding preferences of a given transcription factor. We describe a novel regression-based evaluation method named MORE (Multiple-motif Regression-based Evaluator) and perform a systematic evaluation of the PWMs generated by four well-known motif finding algorithms (AlignACE, BioProspector, MEME, and MotifSampler) and a new motif-finding algorithm called Kafal (K-mer affinity align). MORE consists of a redundancy reduction step where the number of motif models derived by motif-finding algorithms is reduced by assessing pairwise similarity between the motif models. Next linear regression followed by a cross-validation procedure is applied to model the relationship between motif models and the measured binding data. Finally, the motif model or set of motif models that best explain the observed data is selected.

MORE is a stable (consistent in its selection of motifs) and robust (resistant to noise) evaluation method to select the set of binding preferences that best explain DNA-binding data. Our results provide evidence that multiple PWMs are often a better representation of the binding preferences of a transcription factor than the generalized single PWM representation.

Supplementary Data:

PBM data
- z-scores [gzip]
Position weight matrices
- Agreement [gzip]
- AlignACE [gzip]
- BioProspector [gzip]
- Kafal [gzip]
- MEME [gzip]
- MotifSampler [gzip]
MORE output
- Agreement A PWMs vs B data [txt]
- Agreement B PWMs vs A data [txt]
- Agreement A entire set of PWMs vs B data [txt]
- Agreement B entire set of PWMs vs A data [txt]
- AlignACE A PWMs vs B data [txt]
- AlignACE B PWMs vs A data [txt]
- AlignACE A entire set of PWMs vs B data [txt]
- AlignACE B entire set of PWMs vs A data [txt]
- BioProspector A PWMs vs B data [txt]
- BioProspector B PWMs vs A data [txt]
- BioProspector A entire set of PWMs vs B data [txt]
- BioProspector B entire set of PWMs vs A data [txt]
- Kafal A PWMs vs B data [txt]
- Kafal B PWMs vs A data [txt]
- Kafal A entire set of PWMs vs B data [txt]
- Kafal B entire set of PWMs vs A data [txt]
- MEME A PWMs vs B data [txt]
- MEME B PWMs vs A data [txt]
- MEME A entire set of PWMs vs B data [txt]
- MEME B entire set of PWMs vs A data [txt]
- MotifSampler A PWMs vs B data [txt]
- MotifSampler B PWMs vs A data [txt]
- MotifSampler A entire set of PWMs vs B data [txt]
- MotifSampler B entire set of PWMs vs A data [txt]

MORE is better: a new regression-based evaluation pipeline for testing and combining motif models

Esther T. Chan [1,2,3,*], Lourdes Pena Castillo [2,3,*], Gwenael Badis [2,3], Shaheynoor Talukder [2,3], Ally Yang [2,3] and Timothy R. Hughes [1,2,3]

Abstract

Supplementary Data:

Esther T. Chan [1,2,3,], Lourdes Pena Castillo [2,3,], Gwenael Badis [2,3], Shaheynoor Talukder [2,3], Ally Yang [2,3] and Timothy R. Hughes [1,2,3]