Web supplement to
"Evaluation of methods for modeling transcription factor sequence specificity"

Matthew T. Weirauch1,2, Atina Cote1, Raquel Norel3, Matti Annala4, Yue Zhao5, Todd J. Riley6, Julio Saez Rodriguez7, Thomas Cokelaer7, Anastasia Vedenko8, Shaheynoor Talukder1, DREAM5 consortium, Harmen J. Bussemaker6, Quaid D. Morris1,11, Martha L. Bulyk8,9,10, Gustavo Stolovitzky3, Timothy R. Hughes*1,11

1 Banting and Best Department of Medical Research, University of Toronto, Toronto, ON, Canada
2 Center for Autoimmune Genomics and Etiology (CAGE) and Divisions of Rheumatology and Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, USA
3 IBM Computational Biology Center, Yorktown Heights, New York, NY, USA
4 Department of Signal Processing, Tampere University of Technology, Tampere, Finland
5 Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA
6 Department of Biological Sciences, Columbia University, New York, NY, USA
7 Department of Information Engineering, University of Padova, Padova, Italy
8 Division of Genetics, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA
9 Department of Pathology, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA
10 Harvard-MIT Division of Health Sciences and Technology, Harvard Medical School, Boston, MA, USA
11 Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada

*To whom correspondance should be addressed:

Abstract

Genomic analyses often involve scanning for potential transcription-factor (TF) binding sites using models of the sequence specificity of DNA binding proteins. Many approaches have been developed to model and learn a protein's binding specificity, but these methods have not been systematically compared. Here we applied 26 such approaches to in vitro protein binding microarray data for 66 mouse TFs belonging to various families. For 9 TFs, we also scored the resulting motif models on in vivo data, and found that the best in vitro-derived motifs performed similarly to motifs derived from in vivo data. Our results indicate that simple models based on mononucleotide position weight matrices learned by the best methods perform similarly to more complex models for most TFs examined, but fall short in specific cases (<10%). In addition, the best-performing motifs typically have relatively low information content, consistent with widespread degeneracy in eukaryotic TF sequence preferences.

Supplementary files

PBM array data: raw signal intensities and 8-mer scores

PBM clones