Web supplement to
"GHT-SELEX demonstrates unexpectedly high intrinsic sequence specificity and complex DNA binding of many human transcription factors"
Arttu Jolma1,*,
Aldo Hernandez-Corchado2,3*,
Ally W.H. Yang1,*,
Ali Fathi4,*,
Kaitlin U. Laverty1,5*,
Alexander Brechalov1,
Rozita Razavi1,
Mihai Albu1,
Hong Zheng1,
The Codebook Consortium,
Ivan Kulakovskiy6,7,
Hamed S. Najafabadi2,3**,
and Timothy R. Hughes1,4,**
1Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada
2Department of Human Genetics, McGill University, Montréal, QC H3A 0C7, Canada
3Victor P. Dahdaleh Institute of Genomic Medicine, Montréal, QC H3A 0G1, Canada
4Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada
5Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA
6Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia and Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Russia
7Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
*these authors contributed equally
#To whom correspondance should be addressed:
Abstract
A long-standing challenge in human regulatory genomics is that transcription
factor (TF) DNA-binding motifs are short and degenerate, while the genome is
large. Motif scans therefore produce many false-positive binding site predictions.
By surveying 179 TFs across 25 families using >1,500 cyclic in vitro selection
experiments with fragmented, naked, and unmodified genomic DNA – a method
we term GHT-SELEX (Genomic HT-SELEX) – we find that many human TFs
possess much higher sequence specificity than anticipated. Moreover, genomic
binding regions from GHT-SELEX are often surprisingly similar to those obtained
in vivo (i.e. ChIP-seq peaks). We find that comparable specificity can also be
obtained from motif scans, but performance is highly dependent on derivation
and use of the motifs, including accounting for multiple local matches in the
scans. We also observe alternative engagement of multiple DNA-binding domains
within the same protein: long C2H2 zinc finger proteins often utilize modular DNA
recognition, engaging different subsets of their DNA binding domain (DBD) arrays
to recognize multiple types of distinct target sites, frequently evolving via internal
duplication and divergence of one or more DBDs. Thus, contrary to conventional
wisdom, it is common for TFs to possess sufficient intrinsic specificity to
independently delineate cellular targets.
Figures and documents
-
Figure 1. Overview of GHT-SELEX.
- A.
Schematic of GHT-SELEX, showing parallels with HT-SELEX.
- B.
Example of read accumulation over a TF motif match for NFKB1.
- C.
Genomic binding for four positive control TFs on a genomic region showing (top to bottom) PWM scanning scores (moving average of affinity scores, from MOODS56 scan in linear domain, using a window of size 200bp) for literature (Ref) PWMs and Codebook PWMs, followed by read coverage signal observed in GHT-SELEX and ChIP-seq.
-
Figure 2. MAGIX method for interpretation of GHT-SELEX data. (.csv)
- A.
A brief overview of the statistical framework of the generative model of MAGIX. Open circles, closed circles, and the diamonds represent latent variables, observed variables, and deterministic computations, respectively. si: library size for sample i; xi: vector of sample-level variables for sample i, including an intercept term and a term for the SELEX cycle, in addition to other terms for batch and background effects; βj: vector of model coefficients for interval j; mij: number of observed reads mapping to interval j in sample i. See Methods for description of other variables.
- B.
Example of actual read count data for CTCF over five replicates of four cycles, illustrating enrichment patterns, fitted coefficients (right), and estimated library sizes (bottom).
- C.
Distribution of PWM hits for the top-ranked TF PWM (highest AUROC on GHT-peaks as determined by PMID: 39605530) within the 5,000 highest scoring MAGIX peaks. PWM hits were identified with MOODS56 (P < 0.0001). Solid red lines represent the mean PWM hit position within MAGIX peaks and dashed lines represent one standard deviation about the mean.
-
Figure 3. Analysis of 331 Codebook proteins and 61 control TFs using GHT-SELEX. (.csv)
- A.
Venn diagram of success among GHT-SELEX, HT-SELEX, and ChIP-seq, vs. all proteins tested in GHT- and HT-SELEX, displayed separately for Codebook proteins (left) and positive control TFs (right).
- B.
Bar chart of protein-level success in GHT-SELEX, categorized based by main DBD type. C2H2-zf proteins and those with an unknown DBD (at the beginning of the project) are inset due to large numbers.
-
Figure 4. Correspondence between GHT-SELEX and ChIP-seq peaks.
-
A. Enrichment of ChIP-seq peaks and PWM hits within MAGIX peaks, for two example control TFs. The top 75,000 MAGIX peaks are sorted by their MAGIX enrichment coefficient (purple, left y-axis). Orange line shows the proportion of peaks (in a sliding window of 500 peaks over the ranked peaks, with a step size of 50) that overlap with a ChIP-seq peak (at MACS threshold P < 0.001). Black line shows the AUROC for PWM affinity scores (calculated by AffiMx52>) of MAGIX peaks in the same window vs. 500 random genomic sites.
(.csv)
(.tar.gz)
-
B. Illustration of peak number optimization (for CTCF as an example).
(.csv)
-
C. Histogram of the optimal values of N (peak count) for the 137 TFs that have both GHT-SELEX and ChIP-seq peaks.
(.xlsx)
-
D. Histogram of optimal Jaccard values, compared to the maximum Jaccard for mismatched TFs (i.e. between GHT-SELEX for one TF and ChIP-seq for a randomly selected TF).
(.xlsx)
-
Figure 5. High quality PWMs often predict in vivo binding sites as effectively as GHT-SELEX peaks
-
A. Scatter plot of optimal Jaccard value between GHT-SELEX peaks and ChIP-seq peaks (x-axis) vs. optimal Jaccard value between PWM-predicted sites and ChIP-seq peaks (y-axis), for all 137 TFs (dots).
(.xlsx)
-
B. Scatter plot of optimal N (peak number) for the same peak set comparisons shown in (A).
(.xlsx)
-
C. Scatter plot showing optimal Jaccard value between PWM-predicted sites and ChIP-seq peaks, for maximum-affinity PWM scoring and sum-of-affinities PWM scoring. Points (TFs) are scaled based on the optimal number of peaks (in the sum scoring), and the color reflects the fraction of binding sites comprised of multiple PWM hits.
(.xlsx)
-
D. Scatter plot of the improvement in the optimal Jaccard value associated with sum-of-affinities PWM scoring vs. information content of the PWM. Points’ size and color are the same as panel (C).
(.xlsx)
-
E. Examples of four TFs with multiple motif matches within a single ChIP-seq peak.
-
Figure 6. Alternative engagement of individual C2H2-zf domains at genomic binding sites inferred from the recognition code.
-
A. RCADEEM applied to CTCF. Middle panel displays the top 2,000 nonrepetitive GHT-SELEX peaks. White vertical bars indicate the region that is expected to contact the DNA based on the assumption that each of the C2H2-zf domains define three contiguous bases. Left panel indicates which C2H2-zf domains are inferred to engage each DNA sequence, which is used to determine the row order in the figure. Right panel shows motifs for the major sub-sites, derived from base frequencies in the sequence alignment.
(.zip)
-
B-F. Top 2,000 non-repeat peak sequences, as in (A), for representative TFs with different binding modes, as described in the main text. Above each is shown the sequence logo for the single representative Codebook PWM (top) and a motif generated by RCADEEM that represents all of the observed sequences (bottom).
(.zip)
G. Number of occurrences of each category among all 86 C2H2-zf proteins for which RCADEEM yielded a significant outcome; note that a TF might appear in multiple or no categories.
Figure 7. Evolution of C2H2-zf protein DNA-binding specificities through internal duplication of DBDs and DBD arrays
-
A. RCADEEM results for the pool of top 500 peaks from full-length ZNF721 and two DBD constructs, after removing peaks that overlap repeats. The construct in which the peak was observed is indicated on the left.
-
B. Similarity of C2H2-zf domains of ZNF721 (based on the number of mismatches in the global alignment). Apparent duplicated arrays are encircled by blue border (i.e. syntenic duplications), while single pairs that may also be duplicates are circled in red.
-
C. Scatterplots of the HT-SELEX k-mer scores61 (relative counts) across the three ZNF721 constructs.
-
D. Comparison of average per-base similarity (correlation of nucleotide frequency) in PWMs predicted by the recognition code, for those present in duplicated arrays vs. those duplicated as individual C2H2-zf domains, with duplicated C2H2-zf domains taken as pairs that are separated from each other by 5 or less edits. DBD pairs have been filtered to contain only combinations where both of the DBDs are likely to have retained their ability to bind DNA (have DNA binding functionality score59 > 0.5). Error bars show standard error of the mean.
Table S1. HT and GHT-SELEX ligand sequences and descriptions. Table lists the oligonucleotide sequences used in the assay and describes how they anneal with each other on the synthesis and amplification steps.
Table S2. Experimental batch specific protocol details. Table lists the reagents and experimental conditions that varied between different experimental batches.
Table S3 GHT-SELEX-experiment metadata. Table lists all GHT- and HT-SELEX experiments performed in this study indicating: unique experiment identifier; human readable identifier; plasmid identifiers; HNGC symbol; experimental batch; construct type; protein production approach, position in the 96-well; sequencing strategy; number of selection cycles; and whether the experiment was approved or not. Note that GFP control experiments (i.e. empty plasmids) are also included in the table (5 GHT-SELEX and 7 HT-SELEX).
Table S4: Genomic region overlap of GHT-SELEX and ChIP-seq peaks and PWM-predicted target regions. Table shows the overlap of optimal ChIP-seq peaks with GHT-SELEX/MAGIX and PWM based predictions for each of the TFs where both datasets were available. Columns show the highest Jaccard coefficient between each pair of datasets and the number of peaks that yielded it.
Table S5: C2H2-zf protein DNA-binding mode annotation. Table lists the 86 C2H2 TFs for which RCADEEM result was obtained (out of 120 total C2H2-zf TFs with GHT-SELEX data available) with information of: Total number of C2H2 zinc finger domains; amino acid gaps between these DBDs; number of distinct motifs bound by the TF; modular binding activity annotated for it; whether the protein is likely to contain zinc fingers obtained from internal duplications and whether data was obtained from experiments that expressed different subsets of the TFs C2H2-zf domains.
Table S6: Intra-protein C2H2-zf domain duplication dataset. Table displays all pairs of human C2H2 DBDs that are separated from each other by five or less edits.
Document S1.
Motif centrality and enrichment in GHT-SELEX/MAGIX peaks and its correspondence with ChIP-seq peaks.
Same plots as in Figure 2C and Figure 4A for all the TFs and DBD constructs in this study with approved GHT-SELEX experiments. Top, top-ranked TF PWM (highest AUROC on GHT-peaks as determined by 43). Middle, Distribution of PWM hits within the 5,000 highest scoring MAGIX peaks. Solid red lines represent the mean PWM hit position within MAGIX peaks and dashed lines represent one standard deviation about the mean. Bottom, Enrichment of ChIP-seq peaks and PWM hits within MAGIX peaks. Orange line shows the proportion of peaks (in a sliding window of 500 peaks over the ranked peaks, with a step size of 50) that overlap with a ChIP-seq peak (at MACS threshold P < 0.001). Black line shows the AUROC for PWM affinity scores of MAGIX peaks in the same window vs. 500 random genomic sites.
Document S2.
PFMs of C2H2-zf proteins with alternative binding modes. PFMs representing the different binding modes of C2H2-zf proteins.