C2H2_B1H

Abstract

Cys2-His2 zinc finger (C2H2-ZF) proteins represent the largest class of putative human transcription factors (TFs). Their expansion and diversification in animals, and frequent association with the KRAB domain in vertebrates, suggest a widespread role in silencing endogenous retroelements (EREs). However, it is unknown whether most C2H2-ZFs even bind DNA, or what sequences they bind. We show that most natural C2H2-ZFs bind DNA both in vitro and in vivo, and infer a new DNA recognition code using DNA-binding motifs for thousands of natural C2H2-ZFs. In vivo binding data for dozens of human C2H2-ZF proteins is generally consistent with our recognition code and indicate that C2H2-ZF proteins encode the majority of motifs among human TFs. We show for the first time that most KRAB-C2H2-ZF proteins do bind specific EREs, ranging from currently active to ancient families. The majority of C2H2-ZF proteins, including KRAB proteins, also show widespread binding to regulatory regions, indicating that humans contain an extensive and largely unstudied adaptive C2H2-ZF regulatory network that targets a diverse range of genes and pathways.

C2H2 sequences, vectors, and experiment log

Sequences and source proteins of zinc fingers screened in the bacterial one-hybrid (B1H) assays: OLS.info(.xlsx file, 4.6Mb)

Bacterial one-hybrid (B1H) data

The s-scores for the B1H-filtered set of zinc fingers. Each column represents one experiment, with the bait sequence indicated at the column header: B1H.s_scores (.xlsx 12Mb)
Motifs derived by regression on B1H s-scores. Each motif is shown as a position weight matrix (PWM) with 4 positions, with each column 'N_i' representing the values assigned to base N at position i of the motifs: B1H.motifs (.xlsx)

Protein binding microarray (PBM) data

Metadata associated with zinc fingers that were examined by PBM, as well as assessment of PBM experiments: Supplementary Table S1 (.xlsx)
The z-scores from PBM experiments for all non-redundant 8-mers. Each file contains the z-scores calculated from one experiment, with the OLS ID of the construct used in the experiment as well as the assay ID indicated in the file name: PBM.z_scores (.tar.gz 54Mb)
Motifs derived from PBM experiments. Zinc fingers are represented by their OLS IDs: PBM.motifs (.xlsx)

Gold standard C2H2-ZF motifs

A non-redundant set of gold standard (GS) motifs, compiled from previously published motifs, for C2H2-ZF proteins: GSTD.pfm (.txt)
The protein sequences corresponding to each of the GS motifs: GSTD.fasta (.fasta)
Alignments of GS motifs with B1H-RC predicted motifs. In each panel, the top logo (with black borders) corresponds to the B1H-RC motif, whereas the bottom logo represents the GS motif: GSTD.vs.B1hRC.align (.pdf)

ChIP-seq data

Metadata associated with 71 C2H2-ZF proteins examined by ChIP-seq, including 39 proteins with centrally enriched motifs: ChIPSeq.info (.xlsx)
BAM files and BAM index files for the ChIP-seq experiments. Each experiment is identified by the Ensembl Gene ID of the target protein: bam files (.zip)
Peaks identified by ChIP-seq analysis of 39 human C2H2-ZF proteins. Proteins are identified by their Ensembl Gene IDs. Peak coordinates and peak summit coordinates are included in separate files for each protein: ChIP_seq.peaks (.tar.gz)
Motifs identified for these 39 proteins. Four sets of motifs are provided: (i) motifs predicted by B1H-RC, (ii) the top de novo motifs identified by MEME, (iii) de novo motifs identified by MEME that are most similar to the B1H-RC motifs, and (iv) the trimmed version of the latter group, in which the motifs are trimmed based on their alignment with the B1H-RC motifs: ChIPSeq.motifs (.tar.gz)
Genomic hits, for the above motifs, that overlap the ChIP-seq peaks: ChIPSeq.motifHits.hg19 (.tar.gz)
Summary of GREAT analysis for the protein-binding sites: ChIPSeq.GREAT.summary (.xlsx)
Summary table of ChIP-seq results and downstream analysis for 39 human C2H2-ZF proteins: ChIPSeq.summary.figure (.pdf)

B1H-based recognition code

B1H-RC: Use the online prediction tool at http://zifrc.ccbr.utoronto.ca/. The source code can also be downloaded and compiled locally for predicting motifs: ZifRC (.zip)
Motifs predicted for human C2H2-ZF domains using B1H-RC. Each C2H2-ZF domain is identified by its Ensembl Gene ID followed by the index of that domain in the protein. Also, the alpha helix sequence of the domain is indicated.hs.Ensembl.RF (.xlsx)

Web supplement to "C2H2 zinc finger proteins greatly expand the human regulatory lexicon"