Hamed S. Najafabadi*,1, Sanie Mnaimneh*,1, Frank W. Schmitges*,1, Michael Garton1, Kathy N. Lam2,
Ally Yang1, Mihai Albu1, Matthew T. Weirauch3,6, Ernest Radovani2, Philip M. Kim1,2,4, Jack Greenblatt1,2,
Brendan J. Frey1,4-6, and Timothy R. Hughes**,1,2,6
1 Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto M5S 3E1, Canada
2 Department of Molecular Genetics, University of Toronto, Toronto M5S 1A8, Canada
3 Center for Autoimmune Genomics and Etiology (CAGE) and Divisions of Rheumatology and Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, USA
4 Department of Computer Science, University of Toronto, Toronto M5S 2E4, Canada
5 Department of Electrical and Computer Engineering, University of Toronto, Toronto, M5S 3G4, Canada
6 6Canadian Institutes For Advanced Research
*These authors made equal contributions to the manuscript.
**To whom correspondance should be addressed:
Abstract
Cys2-His2 zinc finger (C2H2-ZF) proteins represent the largest class of putative human transcription factors (TFs).
Their expansion and diversification in animals, and frequent association with the KRAB domain in vertebrates,
suggest a widespread role in silencing endogenous retroelements (EREs).
However, it is unknown whether most C2H2-ZFs even bind DNA, or what sequences they bind.
We show that most natural C2H2-ZFs bind DNA both in vitro and in vivo, and infer a new DNA recognition code using DNA-binding motifs for thousands of natural C2H2-ZFs.
In vivo binding data for dozens of human C2H2-ZF proteins is generally consistent with our recognition code and indicate that C2H2-ZF
proteins encode the majority of motifs among human TFs.
We show for the first time that most KRAB-C2H2-ZF proteins do bind specific EREs, ranging from currently active to ancient families.
The majority of C2H2-ZF proteins, including KRAB proteins, also show widespread binding to regulatory regions, indicating that humans
contain an extensive and largely unstudied adaptive C2H2-ZF regulatory network that targets a diverse range of genes and pathways.
C2H2 sequences, vectors, and experiment log
Bacterial one-hybrid (B1H) data
- The s-scores for the B1H-filtered set of zinc fingers. Each column represents one experiment, with the bait sequence indicated at the column header: B1H.s_scores (.xlsx 12Mb)
- Motifs derived by regression on B1H s-scores. Each motif is shown as a position weight matrix (PWM) with 4 positions, with each column 'Ni' representing the values assigned to base N at position i of the motifs: B1H.motifs (.xlsx)
Protein binding microarray (PBM) data
- Metadata associated with zinc fingers that were examined by PBM, as well as assessment of PBM experiments: Supplementary Table S1 (.xlsx)
- The z-scores from PBM experiments for all non-redundant 8-mers. Each file contains the z-scores calculated from one experiment, with the OLS ID of the construct used in the experiment as well as the assay ID indicated in the file name: PBM.z_scores (.tar.gz 54Mb)
- Motifs derived from PBM experiments. Zinc fingers are represented by their OLS IDs: PBM.motifs (.xlsx)
Gold standard C2H2-ZF motifs
- A non-redundant set of gold standard (GS) motifs, compiled from previously published motifs, for C2H2-ZF proteins: GSTD.pfm (.txt)
- The protein sequences corresponding to each of the GS motifs: GSTD.fasta (.fasta)
- Alignments of GS motifs with B1H-RC predicted motifs. In each panel, the top logo (with black borders) corresponds to the B1H-RC motif, whereas the bottom logo represents the GS motif: GSTD.vs.B1hRC.align (.pdf)
ChIP-seq data
- Metadata associated with 71 C2H2-ZF proteins examined by ChIP-seq, including 39 proteins with centrally enriched motifs: ChIPSeq.info (.xlsx)
- BAM files and BAM index files for the ChIP-seq experiments. Each experiment is identified by the Ensembl Gene ID of the target protein: bam files (.zip)
- Peaks identified by ChIP-seq analysis of 39 human C2H2-ZF proteins. Proteins are identified by their Ensembl Gene IDs. Peak coordinates and peak summit coordinates are included in separate files for each protein: ChIP_seq.peaks (.tar.gz)
- Motifs identified for these 39 proteins. Four sets of motifs are provided: (i) motifs predicted by B1H-RC, (ii) the top de novo motifs identified by MEME, (iii) de novo motifs identified by MEME that are most similar to the B1H-RC motifs, and (iv) the trimmed version of the latter group, in which the motifs are trimmed based on their alignment with the B1H-RC motifs: ChIPSeq.motifs (.tar.gz)
- Genomic hits, for the above motifs, that overlap the ChIP-seq peaks: ChIPSeq.motifHits.hg19 (.tar.gz)
- Summary of GREAT analysis for the protein-binding sites: ChIPSeq.GREAT.summary (.xlsx)
- Summary table of ChIP-seq results and downstream analysis for 39 human C2H2-ZF proteins: ChIPSeq.summary.figure (.pdf)
B1H-based recognition code
- B1H-RC: Use the online prediction tool at http://zifrc.ccbr.utoronto.ca/. The source code can also be downloaded and compiled locally for predicting motifs: ZifRC (.zip)
- Motifs predicted for human C2H2-ZF domains using B1H-RC.
Each C2H2-ZF domain is identified by its Ensembl Gene ID followed by the index of that domain in the protein.
Also, the alpha helix sequence of the domain is indicated.hs.Ensembl.RF (.xlsx)