Web supplement to
"Perspectives on Codebook: sequence specificity of uncharacterized human transcription factors"

Arttu Jolma1*, Kaitlin U. Laverty1,2*, Ali Fathi1,3*,  Ally W.H. Yang1*, Isaac Yellan1,3*,  Ilya E. Vorontsov4*,  Sachi Inukai5,6, Judith F. Kribelbauer-Swietek5,6,  Antoni J. Gralak5,6, Rozita Razavi1, Mihai Albu1, Alexander Brechalov1, Zain M. Patel13, Vladimir Nozdrin7, Georgy Meshcheryakov8, Ivan Kozin8, Sergey Abramov4,9, Alexandr Boytsov4,9, The Codebook Consortium, Oriol Fornes10, Vsevolod J. Makeev4,#, Jan Grau11, Ivo Grosse11, Philipp Bucher12, Bart Deplancke5,6**, Ivan V. Kulakovskiy4,8**, and Timothy R. Hughes1,3**

1Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada
2Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA
3Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada
4Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
5Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, 1015, Lausanne, Switzerland
6Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
7Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991, Moscow, Russia
8Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Russia
9Altius Institute for Biomedical Sciences, Seattle, WA 98121, USA
10Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children’s Hospital Research Institute, University of British Columbia, Vancouver, BC V5Z 4H4, Canada
11Institute of Computer Science, Martin Luther University Halle-Wittenberg, 06099, Halle, Germany
12Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
#Present address: Cancer Research UK National Biomarker Centre, University of Manchester, Manchester, Manchester, M20 4BX, UK

*these authors contributed equally

**To whom correspondance should be addressed:

Abstract

We describe an effort (“Codebook”) to determine the sequence specificity of 332 putative and largely uncharacterized human transcription factors (TFs), as well as 61 control TFs. Nearly 5,000 independent experiments across multiple in vitro and in vivo assays produced motifs for just over half of the putative TFs analyzed (177, or 53%), of which most are unique to a single TF. The data highlight the extensive contribution of transposable elements to TF evolution, both in cis and trans, and identify tens of thousands of conserved, base-level binding sites in the human genome. The use of multiple assays provides an unprecedented opportunity to benchmark and analyze TF sequence specificity, function, and evolution, as further explored in accompanying manuscripts. 1,421 human TFs are now associated with a DNA binding motif. Extrapolation from the Codebook benchmarking, however, suggests that many of the currently known binding motifs for well-studied TFs may inaccurately describe the TF’s true sequence preferences.

Supplemental Files.