Web supplement to
"Determination and Inference of Eukaryotic Transcription Factor Sequence Specificity"

Matthew T. Weirauch1,2*, Ally Yang2,*, Mihai Albu2, Atina Cote2, Alejandro Montenegro-Montero3, Philipp Drewe4, Hamed S. Najafabadi2, Samuel A. Lambert5, Ishminder Mann2, Kate Cook5, Hong Zheng2, Alejandra Goity3, Harm van Bakel6, Jean-Claude Lozano7, Mary Galli8, Mathew Lewsey8,9, Eryong Huang10, Tuhin Mukherjee11, Xiaoting Chen11, John S. Reece-Hoyes12, Sridhar Govindarajan13, Gad Shaulsky10, Albertha J.M. Walhout12, François-Yves Bouget7, Gunnar Ratsch4, Luis F. Larrondo3, Joseph R. Ecker8,9,14, Timothy R. Hughes2,5#

1 Center for Autoimmune Genomics and Etiology (CAGE) and Divisions of Rheumatology and Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, USA
2 Banting and Best Department of Medical Research, University of Toronto, Toronto, ON, Canada
3 Departamento de Genética Molecular y Microbiología, Facultad de Ciencias Biológicas, Pontificia Universidad Católica de Chile, Santiago, Chile
4 Computational Biology Center, Sloan-Kettering Institute, New York, NY, USA
5 Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
6 Icahn Institute for Genomics and Multiscale Biology, Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York City, NY, USA
7 Université Pierre et Marie Curie, CNRS, UMR 7621, Observatoire Océanologique de Banyuls sur Mer, France
8 Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA, USA
9 Plant Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA, USA
10 Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
11 School of Electronic and Computing Systems, University of Cincinnati, Cincinnati, OH, USA
12 Program in Systems Biology, University of Massachusetts Medical School, Worcester, MA, USA
13 DNA2.0 Inc, Menlo Park, CA, USA
14 Howard Hughes Medical Institute, The Salk Institute for Biological Studies, La Jolla, CA, USA

*these authors contributed equally

#To whom correspondance should be addressed:

Abstract

Transcription factor (TF) DNA sequence preferences direct their regulatory activity, but are currently known for only ~1% of all eukaryotic TFs. Broadly sampling DNA-binding domain (DBD) types from multiple eukaryotic clades, we determined DNA sequence preferences for >1,000 TFs encompassing 54 different DBD classes from 131 diverse eukaryotes. We find that closely related DBDs almost always have very similar DNA sequence preferences, enabling inference of motifs for ~34% of the ~170,000 known or predicted eukaryotic TFs. Sequences matching both measured and inferred motifs are enriched in ChIP-seq peaks and upstream of transcription start sites in diverse eukaryotic lineages. SNPs defining expression quantitative trait loci in Arabidopsis promoters are also enriched for predicted TF binding sites. Importantly, our motif "library" (http://cisbp.ccbr.utoronto.ca) can be used to identify specific TFs whose binding may be altered by human disease risk alleles. These data present a powerful resource for mapping transcriptional networks across eukaryotes.

Supplemental Text. Click to Download

Supplemental Figures. Click to Download

Supplemental Tables. Click to Download

Files available in "Additional Data 1". Click to Download

Web Supplemental Files. Click to Download

PBM data are available in the CisBP database