Web supplement to
"Extensive binding of uncharacterized human transcription factors to genomic dark matter"
Rozita Razavi1*,
Ali Fathi1*,
Isaac Yellan1*,
Alexander Brechalov1*,
Kaitlin U. Laverty1,2,
Arttu Jolma1,
Aldo Hernandez Corchado3,
Hong Zheng1,
Ally Yang1,
Marjan Barazandeh1,
Chun Hu1,
Ilya Vorontsov4,
Zain Patel1,
The Codebook Consortium,
Ivan Kulakovskiy5,
Philipp Bucher6,
Quaid Morris2,
Hamed S. Najafabadi3,7,
and Timothy R. Hughes1**
1Donnelly Centre and Department of Molecular Genetics, 160 College Street, Toronto, ON M5S 3E1 CANADA
2Memorial Sloan Kettering Cancer Center, Rockefeller Research Laboratories, New York, NY 10065, USA
3Victor P. Dahdaleh Institute of Genomic Medicine, 740 Dr. Penfield Avenue, Room 7202, Montréal, Québec, H3A 0G1, Canada
4Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
5Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Russia
6Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
7Department of Human Genetics, McGill University, Montréal, Québec, H3A 0C7, Canada
*these authors contributed equally
#To whom correspondance should be addressed:
Abstract
Most of the human genome is thought to be non-functional, and includes large segments often referred to as “dark matter” DNA. The genome also encodes hundreds of putative and poorly characterized transcription factors (TFs). We determined genomic binding locations of 166 uncharacterized human TFs in living cells. Nearly half of them associated strongly with known regulatory regions such as promoters and enhancers, often at conserved motif matches and co-localizing with each other. Surprisingly, the other half often associated with genomic dark matter, at largely unique sites, via intrinsic sequence recognition. Dozens of these, which we term “Dark TFs”, mainly bind within regions of closed chromatin. Dark TF binding sites are enriched for transposable elements, and are rarely under purifying selection. Some Dark TFs are KZNFs, which contain the repressive KRAB domain, but many are not: the Dark TFs also include known or potential pioneer TFs. Compiled literature information supports that the Dark TFs exert diverse functions ranging from early development to tumor suppression. Thus, our results sheds light on a large fraction of previously uncharacterized human TFs and their unappreciated activities within the dark matter genome.
Supplemental Files.
-
Figure 1.
-
A. Overview of the TF categories assayed in this study.
(.csv)
-
B. A schematic of the experimental pipeline for production of 369 inducible EGFP-labelled TF cell lines used in ChIP experiments and deriving TF binding sites.
-
C. Samples of representative motifs obtained from different families of control TFs.
-
Figure 2.
-
A. Fraction of ChIP-seq peaks in protein-coding promoters (x-axis) and HEK293 enhancers (y-axis). Point sizes are proportional to the number of peaks for each TF (log scale).
(.xlsx)
-
B. Bottom (square) heatmap: Jaccard similarity coefficient between ChIP-seq peaks of all TF pairs. Top heatmap: Fraction of ChIP-seq peaks falling within genomic regions, as indicated, and other properties of the TFs. Fractions are scaled to fit in [min, max] range across the TFs for better visualization, as indicated in the right. TF ordering is determined by hierarchical clustering with Ward linkage and Euclidean distance, using the tracks 'H3K4me3', 'ATAC-seq', 'B compartment', 'Empty' + 'Heterochromatin', 'Repeats', 'CpG', 'Protein-coding promoters', 'H3K27ac' (the last three not shown), along with the one-hot encoded ‘TF type’ to aid in illustration.
(Jaccard (.xlsx),
Annotations (.xlsx))
-
Figure 3.
-
A. Fraction (A) and absolute number (B) of peaks with direct binding (i.e. TOP sites) for Promoter TFs and Dark TFs. TFs are sorted to compare distributions. The denominator for (A) is the total number of ChIP peaks at the same optimized threshold.
(.xlsx)
-
B. Fraction (A) and absolute number (B) of peaks with direct binding (i.e. TOP sites) for Promoter TFs and Dark TFs. TFs are sorted to compare distributions. The denominator for (A) is the total number of ChIP peaks at the same optimized threshold.
(.xlsx)
-
C. Fraction of GHT-SELEX (x-axis) and ChIP-seq (y-axis) peaks falling in the specified genomic regions (protein-coding promoters, repeats, and empty or heterochromatin), using the peaks at the universal threshold. Dashed lines show the expected fraction if peaks were distributed at random.
(.xlsx)
-
D. Fraction of GHT-SELEX (x-axis) and ChIP-seq (y-axis) peaks falling in the specified genomic regions (protein-coding promoters, repeats, and empty or heterochromatin), using the peaks at the universal threshold. Dashed lines show the expected fraction if peaks were distributed at random.
(.xlsx)
-
E. Fraction of GHT-SELEX (x-axis) and ChIP-seq (y-axis) peaks falling in the specified genomic regions (protein-coding promoters, repeats, and empty or heterochromatin), using the peaks at the universal threshold. Dashed lines show the expected fraction if peaks were distributed at random.
(.xlsx)
-
F. Density of GHT-SELEX signal (left), TOP sites (middle), and CTOP sites (right) by position relative to TSS of protein-coding promoters, for 29 Promoter TFs that have available GHT-SELEX data. Intensity of heatmaps for TOPs (middle) and CTOPs (right) have been normalized by the total number of PWM hits (of TOPs and CTOPs, respectively) in promoters (shown at the right of each heatmap).
(.zip)
-
Figure 4.
-
A. Heatmaps of FDR-corrected phyloP scores across the TOP sites (rows), split into top and bottom segments that contain conserved and unconserved sites. Bars to the right indicate which tests of conservation are satisfied (Likelihood-ratio, Correlation, Wilcoxon), along with overlaps with promoters (P) and specific repeat families if applicable. 100 bp segments are shown with the PWM hit in the middle. Blue/positive phyloP indicates purifying selection, and red/negative phyloP values represent diversifying selection.
(.tar.gz)
-
B. Fraction (B) and absolute number (C) of TOPs that are conserved, for Promoter TFs and Dark TFs, sorted to compare distributions.
(.xslx)
-
C. Fraction (B) and absolute number (C) of TOPs that are conserved, for Promoter TFs and Dark TFs, sorted to compare distributions.
(.xslx)
-
D-H. (D,E,F,G,H) Genome track displays of CTOP sites for ZNF407 (D), ZNF131 and YY1 (E), ZBTB40 at a hAT/Charlie (MER58A) element (F) and its most-conserved TOP (at the PRKACA promoter) (G), ZNF689 at an L1M5 element (H). The Dfam100 repeat model sequence logo is also shown for MER58A (F) and L1M1 (H).
(website)
-
Figure 5.
-
A. Heatmap of –log10 p-values for TFs (x-axis) that are enriched for binding specific TE families (y-axis). Labels show superfamily/family.
(.gz)
-
B. Binding of paralogous TFs, ZNF836 and ZNF841, to a homologous region in the two related LTR families, MSTA-int and THE1-int. Bottom plot shows the average ChIP-seq and GHT-SELEX signal (i.e. read count) across all the instances of MST-int and THE1-int aligned to their consensus.
-
C. Fraction of TOP sites in various repeat elements for two poly-A binding TFs ZNF362 and ZNF384.
(.tsv)
-
D. An example of the Promoter TF ZNF676 binding site targeting an unconserved LTR12C sequence.
(.tsv)
-
Figure 6.
-
A. Heatmap showing the fraction of TOP sites for each TF dating to different mammalian clades in the human lineage, along with information about the TF category, median age of TOP sites and TFs (million years ago, MÝA), and log 10 of total TOP sites.
(.gz)
-
B. Sorted median age of the TOP sites (B) and the age of the TFs (C) are compared for Dark TFs and Promoter TFs.
(.xslx)
-
C. Sorted median age of the TOP sites (B) and the age of the TFs (C) are compared for Dark TFs and Promoter TFs.
(.xslx)
-
Figure 7. Compiled protein-protein interactions (PPIs)60 mostly supported by two independent lines of support and grouped into three categories of TRIM28/33/39 interactions, zinc-finger (ZF) protein interactions, and CBX/HP1 interactions are shown at left. Median binding site age was calculated for TOP sites, only for the TFs with available GHT-SELEX data, shown along with the age of the TF. The fraction of ChIP-seq peaks (using the universal threshold) overlapping with H3K9me3 and H3K27me3 histone marks and with the ChromHMM “empty” state (None) are shown in the middle. For the repeat, in each superclass, the enrichment score (-log(p-value) hypergeometric test) for the most enriched repeat element within that superclass is plotted as a heatmap, and the most enriched repeat subtype across all the superfamilies is mentioned beside. The expert-curated sequence logos are displayed to the right (except for ZNF280D and SCML4 which did not produce any approved PWM), along with the corresponding phenotype for any TF with known biological function through literature review (in the same block).
(.gz)
-
Figure S1.
-
A. Distribution of peak overlap between ChIP-seq replicates, for approved experiments (i.e., produced a motif), not approved experiments (i.e., did not produce a motif), and mismatch replicates (i.e., TF identities permuted), calculated by Kulczynsci II similarity metric (i.e. average of overlaps). The dotted line indicates the threshold at which pairs of not approved experiments were considered successful and thus could be included in downstream analyses.
Biological replicates(.csv),
Mismatch replicates(.gz)
-
B. Distribution of the uniqueness of peaks for different categories of TFs, measured as the fraction of ChIP-seq peaks (at the universal threshold) not overlapping with any peak from any other TF in this study.
(.xslx)
-
C. Distribution of Kulczynsci II similarity metric between ChIP-seq replicates (as in (A)), restricted to those classified as Dark TFs or Others.
(.csv)
-
Figure S2.
-
A. Fraction of ChIP-seq peaks overlapping with GeneHancer annotated enhancers (x-axis) and HEK293 enhancers (defined by H3K4me1-positive regions from ChromHMM; y-axis). Points (TFs) are scaled based on their number of peaks. Colors also display the expression of TFs in HEK293 cells.
(.xslx)
-
B. Characterization of the states of a ChromHMM model with 10 states trained on various HEK293 chromatin data (i.e., H3K9me3, H3K27me3, H3K4me1, H3K4me3, H3K36me3, and H3K27ac from ENCODE, and ATAC-seq and CTCF peaks from this study). Based on the correspondence between emissions and the chromatin marks and genome annotations, the states were assigned to Gene body, TES, Open Promoter/Enhancer, Promoter NFR (nucleosome-free regions), Promoter flanking, Enhancer, CTCF Insulator, Empty (of histone marks), Constitutive Heterochromatin, and Facultative Heterochromatin.
(website - Ancillary Data)
-
Figure S3. A detailed version of Figure 2 including additional tracks, such as gene expression in HEK293 cells (FPKM), number of total ChIP-seq peaks (at the universal threshold of MACS2 P-value≤10-10), TF age, fraction of human protein-coding promoters (out of 20,052) covered by TF peaks, fraction of ChIP-seq peaks falling within: CpG islands, H3K4me3-positive regions, facultative heterochromatin, and constitutive heterochromatin, with the main repeat class bound by the TFs included. The upper triangle in the bottom square is the same as Figure 2, however, the lower triangle here is the similarity between PWMs for each pair of TFs, calculated by MoSBAT101. Gray stripes correspond to the TFs without a selected PWM in the Codebook set.
Biological replicates(.csv),
Mismatch replicates(.gz)
-
Figure S4. Heatmap is from Figure 6, with expanded labels for specific elements enriched in TOPs of each TF.
(.gz)
-
Figure S5. Plots showing the fraction of each TF’s TOPs that are conserved (i.e. ‘CTOPs’) and overlap a major class of transposable elements or non-TE repeats. The proportion of TOPs that are conserved and overlap a repeat class is shown on the y-axis, and the log10 count of these sites is shown on the x-axis. Each TF is coloured according to its classification as a Dark TF, Promoter TF, Enhancer and Other TFs. Only proteins with a fraction greater than 0.1 of conserved TOPs that fall in a repeat class are labeled. TFs discussed in the main text are also labeled.
(.gz)
-
Figure S6. Heatmaps show the proportion of each TF’s TOPs (rows) inferred to be a certain age, as in Figure 5, but with each panel utilizing a different scheme. Top row: Age of each TOP site inferred as that of oldest ancestral genome with a gapless alignment to the human TOP site and minimum 75% identity (left) or 100% identity (right). (Figure 5 shows this same analysis with a 0% identity threshold). Middle row: Age of each TOP site inferred as that of oldest species with a gapless alignment to the human TOP site and minimum 0% identity (left) or 100% identity (right). Bottom row: Age of each TOP site inferred as that of the oldest clade where 60% of the species have a gapless alignment to the human TOP with a minimum 0% identity (left) or 100% identity (right).
(.gz)
-
Figure S7. Conserved binding sites for ZNF518B (red) located in the promoter of ZNF518B itself, and in a predicted enhancer-region ~4kb upstream of its promoter. Binding sites for ZBTB41, KDM2A, TET3, and CXXC4 are also present in this region.
(website Merged_BigWigs)
-
Document S1. Heatmaps of conservation/phyloP score across TOPs for 137 TFs. Same as Figure 4A, for all TFs of the study, heatmaps of phyloP scores in PWM hits (middle column) and flanking sequences of tops are displayed. Bars to the right indicate which tests of conservation are satisfied (Likelihood-ratio, Correlation, Wilcoxon), along with overlaps with promoters (P) and specific repeat families if applicable.