Predicting the Binding Preference of Transcription Factors to Individual DNA k-mers

Trevis M. Alleyne1, Lourdes Peña-Castillo2, Gwenael Badis1, Shaheynoor Talukder1, Michael F. Berger3,5, Andrew R. Gehrke3, Anthony A. Philippakis3,5,6, Martha L. Bulyk3-5,6, Quaid D. Morris1,2, and Timothy R. Hughes1,2§


1Department of Molecular Genetics, University of Toronto, Toronto, ON M4T 2J4

2Banting and Best Department of Medical Research, University of Toronto, Toronto, ON M4T 2J4

3Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 021156

4Department of Pathology, Brigham and Women's Hospital and Harvard Medical School, Boston, MA 021156

5Committee on Higher Degrees in Biophysics, Harvard University, Cambridge, MA 02138 2

6Harvard/MIT Division of Health Sciences and Technology (HST), Harvard Medical School, Boston, MA 02115


§Corresponding author


Abstract

Motivation

Recognition of specific DNA sequences is a central mechanism by which transcription factors (TFs) control gene expression. Many TF binding preferences, however, are unknown or poorly characterized, in part due to the difficulty associated with determining their specificity experimentally, and an incomplete understanding of the mechanisms governing sequence specificity. New techniques that estimate the affinity of TFs to all possible k-mers provide a new opportunity to study DNA-protein interaction mechanisms, and may facilitate inference of binding preferences for members of a given TF family when such information is available for other family members.

Results

We employed a new data set consisting of the relative preferences of mouse homeodomains for all 8-base DNA sequences in order to ask how well we can predict the binding profiles of homeodomains when given only their protein sequences. We evaluated a panel of standard statistical inference techniques, as well as variations of the protein features considered. Nearest-neighbour among functionally-important residues emerged among the most effective methods. Our results support the combinatorial code model of TF-DNA recognition, and suggest a rapid and rational approach for future analyses of TF families.


Protein sequence data
8-mer data
Predicted 8-mer profiles (compressed text files)
Prediction performance
PBM replicate data
Figure data