Web supplement to
"GHT-SELEX demonstrates unexpectedly high intrinsic sequence specificity and complex DNA binding of many human transcription factors"

Arttu Jolma1,*, Aldo Hernandez-Corchado2,3*, Ally W.H. Yang1,*, Ali Fathi4,*, Kaitlin U. Laverty1,5*, Alexander Brechalov1, Rozita Razavi1, Mihai Albu1, Hong Zheng1, The Codebook Consortium, Ivan Kulakovskiy6,7, Hamed S. Najafabadi2,3**, and Timothy R. Hughes1,4,**

1Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada
2Department of Human Genetics, McGill University, Montréal, QC H3A 0C7, Canada
3Victor P. Dahdaleh Institute of Genomic Medicine, Montréal, QC H3A 0G1, Canada
4Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada
5Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA
6Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia and Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Russia
7Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia

*these authors contributed equally

#To whom correspondance should be addressed:

Abstract

A long-standing challenge in human regulatory genomics is that transcription factor (TF) DNA-binding motifs are short and degenerate, while the genome is large. Motif scans therefore produce many false-positive binding site predictions. By surveying 179 TFs across 25 families using >1,500 cyclic in vitro selection experiments with fragmented, naked, and unmodified genomic DNA – a method we term GHT-SELEX (Genomic HT-SELEX) – we find that many human TFs possess much higher sequence specificity than anticipated. Moreover, genomic binding regions from GHT-SELEX are often surprisingly similar to those obtained in vivo (i.e. ChIP-seq peaks). We find that comparable specificity can also be obtained from motif scans, but performance is highly dependent on derivation and use of the motifs, including accounting for multiple local matches in the scans. We also observe alternative engagement of multiple DNA-binding domains within the same protein: long C2H2 zinc finger proteins often utilize modular DNA recognition, engaging different subsets of their DNA binding domain (DBD) arrays to recognize multiple types of distinct target sites, frequently evolving via internal duplication and divergence of one or more DBDs. Thus, contrary to conventional wisdom, it is common for TFs to possess sufficient intrinsic specificity to independently delineate cellular targets.

Figures and documents