Web supplement to
"Perspectives on Codebook: sequence specificity of uncharacterized human transcription factors"
Arttu Jolma1*,
Kaitlin U. Laverty1,2*,
Ali Fathi1,3*,
Ally W.H. Yang1*,
Isaac Yellan1,3*,
Ilya E. Vorontsov4*,
Sachi Inukai5,6,
Judith F. Kribelbauer-Swietek5,6,
Antoni J. Gralak5,6,
Rozita Razavi1,
Mihai Albu1,
Alexander Brechalov1,
Zain M. Patel13,
Vladimir Nozdrin7,
Georgy Meshcheryakov8,
Ivan Kozin8,
Sergey Abramov4,9,
Alexandr Boytsov4,9,
The Codebook Consortium,
Oriol Fornes10,
Vsevolod J. Makeev4,#,
Jan Grau11,
Ivo Grosse11,
Philipp Bucher12,
Bart Deplancke5,6**,
Ivan V. Kulakovskiy4,8**,
and Timothy R. Hughes1,3**
1Donnelly Centre, University of Toronto, Toronto, ON M5S 3E1, Canada
2Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA
3Department of Molecular Genetics, University of Toronto, Toronto, ON M5S 1A8, Canada
4Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991, Moscow, Russia
5Laboratory of Systems Biology and Genetics, Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, 1015, Lausanne, Switzerland
6Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
7Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119991, Moscow, Russia
8Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Russia
9Altius Institute for Biomedical Sciences, Seattle, WA 98121, USA
10Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children’s Hospital Research Institute, University of British Columbia, Vancouver, BC V5Z 4H4, Canada
11Institute of Computer Science, Martin Luther University Halle-Wittenberg, 06099, Halle, Germany
12Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
#Present address: Cancer Research UK National Biomarker Centre, University of Manchester, Manchester, Manchester, M20 4BX, UK
*these authors contributed equally
**To whom correspondance should be addressed:
Abstract
We describe an effort (“Codebook”) to determine the sequence specificity of 332 putative and largely uncharacterized human transcription factors (TFs), as well as 61 control TFs. Nearly 5,000 independent experiments across multiple in vitro and in vivo assays produced motifs for just over half of the putative TFs analyzed (177, or 53%), of which most are unique to a single TF. The data highlight the extensive contribution of transposable elements to TF evolution, both in cis and trans, and identify tens of thousands of conserved, base-level binding sites in the human genome. The use of multiple assays provides an unprecedented opportunity to benchmark and analyze TF sequence specificity, function, and evolution, as further explored in accompanying manuscripts. 1,421 human TFs are now associated with a DNA binding motif. Extrapolation from the Codebook benchmarking, however, suggests that many of the currently known binding motifs for well-studied TFs may inaccurately describe the TF’s true sequence preferences.
Supplemental Files.
-
Figure 1.
-
Top. Categories of 393 TFs assayed and their associated constructs.
(.csv)
-
Middle. Graphical summary of assays employed. Bottom left, Example of performance (as AUROC) of the best performing PWM for TPRX1, for each combination of experiment type (one for motif derivation, and one for motif testing).
-
Bottom-left. Example of performance (as AUROC) of the best performing PWM for TPRX1, for each combination of experiment type (one for motif derivation, and one for motif testing).
(.csv)
-
Bottom-right. Depiction of the approval process for each individual experiment, including comparison of motifs and/or binding sites between replicates, evaluation of motifs across experiments, and motif similarity between related TFs (see Experiment Evaluation). Heatmap shows the number of approved experiments for all 393 TFs across all experiment types.
(.csv)
-
Figure 2.
-
Symmetric heatmap displaying the similarity between expert-curated PWMs for each pair of Codebook TFs, clustered by using Pearson correlation with average linkage. The PWM similarity metric is the correlation between pairwise affinities to 200,000 random sequences of length 50, as calculated by MoSBAT24. Pullouts and labels illustrate specific points in the main text.
(.csv)
-
Figure 3.
-
A-Top. Number of DACH1 and DACH2 orthologs (union of one-to-one and one-to-many) across Ensembl v111 vertebrates and selected invertebrates. Species order reflects the Ensembl species tree.
(.csv)
-
A-Bottom. AlphaFold3-predicted structure of the DACH1 SKI/SNO/DAC region (residues 130 – 390) bound to a HT-SELEX ligand sequence with a high-scoring PWM hit.
(.cif)
-
B-Top. Sequence logos and sequence relationships of human C-Clamp domains.
-
B-Bottom. AlphaFold3-predicted structure of two full-length SLC2A4RG proteins bound to a CTOP sequence with flanking sequences (chr17:48,048,369-48,048,401), and four Zn2+ ions (grey). The remainder of the proteins (beyond the C-clamp and C2H2-zf domains) are hidden, for visual simplicity.
(.cif)
-
C-Left. Sequence logos of human TFs that are derived from the domestication of Tigger and Pogo DNA transposon DBDs elements, and have known DNA binding motifs. The tree shown is a maximum-likelihood phylogram from FastTree88, using DBD sequence alignment with MAFFT L-INS-I89, rooted on POGK, which is derived from an older family of Tigger-like elements90,91. The sequence logos m shown are Codebook-derived, except for CENPB92.
-
C-Right. Average per-base read count over Tigger15a TOPs in the human genome, for JRK ChIP-seq (orange) and GHT-SELEX (purple), with sequences aligned to the Tigger15a consensus sequence. JRK PWM scores at each base of the Tigger15a consensus sequence are shown in black (plus strand) and grey (minus strand).
-
Figure 4.
-
A. Heatmaps of phyloP scores over the PWM hit and 50 bp flanking for TOP sites for four TFs (two controls and two Codebook TFs). Statistical test results (see main text and Methods) are indicated at right.
(.tar.gz)
-
B.
Left: Donut plot showing the proportion and number of clusters of conserved TOP (CTOP) sites that overlap the genomic features indicated.
(.tar.gz)
Right: Bar plot showing the mean # of individual CTOPs contained within clusters that overlap the examined genomic regions.
(.tar.gz)
-
C. A 1,420-base, CpG-island-overlapping CTOP cluster (chr12:120368293-120369713). Zoonomia 241-mammal phyloP scores and Multiz 471 Mammal alignment PhastCons Conserved Elements are shown.
(CTOPs.bed (on website already, all annotations are available from UCSC))
-
D. Bar plot of the frequency of TFs with CTOPs that occur most frequently in CTOP clusters that overlap CpG and non-CpG protein coding promoters, respectively.
(.tar.gz)
-
E. CTOP cluster overlapping the non-CpG promoter at chr12:57,745,278-57,745,396.
-
F. CTOP site for the KRAB-C2H2-zf protein ZNF689, overlapping an L1ME4a located at chr16:25,403,631-25,403,717.
-
Figure 5. (.xlsx)
-
A. Scheme of the analysis: identification of allele-specific binding sites from Codebook ChIP-Seq and GHT-SELEX data and annotation of allele-specific chromatin accessibility variants with the Codebook motifs.
-
B. Distribution of PWM score (log-odds) fold changes between alleles for non-ASB SNPs, ASBs in peaks, and ASBs in TOPs. Left, 32 positive control TFs, Right, 85 Codebook TFs. P-values: Mann-Whitney U test.
-
C. An example ASV for ZNF70, in chr12:6,763,200-6,765,850, around 1kb upstream of PTMS gene. Onset shows the exact location of the ASV (with A/G alleles) together with the corresponding PWM hit. Allelic read counts for three available ATAC- and DNase-seq samples are shown on the side.
-
D. The ratio of concordant-to-discordant PWM hits for pairs for non-ASVs (red), all ASVs (yellow), ASVs overlapping with peaks (blue), and ASVs in TOPs (green). P-values: Fisher's exact test.
-
E. Left, Fraction of ASVs overlapping with PWM hits for four example TFs, using 4 different thresholds on ASV significance: all SNPs (blue), 25% FDR ASVs (yellow), 10% FDR ASVs (orange), and 5% FDR ASVs (red). Right, Fraction of ASVs at each location within the genome-wide PWM hits of the representative TFs using four thresholds (the same colors as in bar plots). SNP: single-nucleotide polymorphism, ASB: allele-specific binding, ASV: allele-specific chromatin accessibility variant.
-
Figure 6. TFs are categorized into structural classes based on Lambert et al.1. See Table S10 for underlying information.
-
Figure S1.
-
A. Cases in which the external PWM matches that of a well-studied TF that is a frequent “contaminant” motif in ChIP-seq93. In each example, the top sequence logo represents the external PWM, and the bottom sequence logo represents a highly-similar CisBP PWM.
-
B. Cases in which the external PWM (top in each example) is consistent with the Codebook PWM for the same TF (bottom in each example).
-
C. External PWM sequence logos that cannot be explained as known contaminants or artifacts, some of which are supported by multiple lines of evidence, and thus appear accurate.
- Figure S2.
-
A. Histogram displays the maximum information content (IC) for any position within the expert-curated PWM for all Codebook and control TFs. Logos are shown for TFs at various maximum positional IC values, for illustration. Red dashed line indicates an IC of 1.4.
(Column C .csv)
-
B. AUROC scores for original vs. IC-increased PWMs, discriminating ChIP-seq or GHT-SELEX peaks vs. random genomic background loci. (In "Perspectives_FigureS2_data.csv" (columns I-L))
(Columns I-L .csv)
-
C. Maximum Jaccard index for ChIP-seq or GHT-SELEX peak sets; using the approach described for optimized TOPs in Methods, for original vs. IC-increased PWMs. (In "Perspectives_FigureS2_data.csv" (columns E-H))
(Columns E-H .csv)
- Figure S3.
-
A-D. Histograms of Jaccard indices measuring the overlap between two ChIP-seq peak sets for the same TF: A: Codebook ChIP-seq replicates; B, C, D: Codebook ChIP-seq vs. external ChIP-seq performed in HEK293 cells (B), HepG2 cells (C), or K562 cells (D).
(.csv)
-
E. AUROC scores for expert curated Codebook PWMs (columns), discriminating ChIP-seq peaks vs. random genomic background loci. Rows show different cell types.
(.csv)
-
F-G. Comparison of AUROC scores at discriminating ChIP-seq peak sets (as in E), for the 19 TFs that have a Codebook peak set (CP), a Codebook motif (CM), an external peak set (EP), and an external motif (EM), for Codebook ChIP-seq data (F) and external ChIP-seq data (G). The seven TFs with an AUROC of < 0.55 on either axis of either plot are highlighted.
(.csv)
-
H. Sequence logos for the seven TFs highlighted in F and G. All Codebook PWMs shown are supported by ChIP-seq, GHT-SELEX, and HT-SELEX. Asterisk indicates that the Codebook PWM is additionally supported by SMiLE-seq data.
(.csv)
-
Figure S4. Bar graph shows number of individual TOP sites obtained for each TF. Heatmaps below indicate other properties of each TF and its TOP sites.
(.gz)
-
Figure S5. (.xlsx)
-
A. Codebook ASB calling workflow: SNP calling with bcftools, mapping bias correction with WASP, background allelic dosage reconstruction with BABACHI, statistical scoring of the allelic imbalance with MIXALIME, and motif annotation with PERFECTOS-APE.
-
B. Motif concordance of Codebook ASBs. X-axis: ASB significance (i.e. allelic preference; log10 FDR, minus side: preference for Ref, plus side: preference for Alt). Y-axis: log2 PWM score fold-change between Alt vs. Ref. The plot shows only strongly concordant and strongly discordant sites with |log2(Fold Change)| ≥ 1.
-
C. Fraction of Codebook ASBs (combined) coinciding with GTEx eQTLs and ADASTRA known ASBs at different FDR thresholds for ASB calling. Fisher's exact test odds ratios (OR) and P-values for ASBs at 5% FDR (covering 16,724 SNPs, dashed line) are labeled on the plot.
-
D. Workflow for detection of TFs involved in allele-specific chromatin accessibility. UDACHA DNase-seq and ATAC-seq ASVs across different cell types were annotated with Codebook motifs, followed by motif enrichment and motif concordance analysis, combining the resulting P-values across the cell types, and FDR correction for multiple tested motifs. Central call-outs: details of the motif enrichment and motif concordance test using SP140 motif for illustration. SNPs (rs946245, rs77238721, rs11771930, rs2838028, rs2562353, rs12112389, rs147176938, rs6798390) illustrating the cells of the 2x2 contingency tables are actual UDACHA ASVs with or without motif hits of selected TFs.
-
E. Scatterplot of Median Odds Ratios of PWM scores within the ASVs enriched in and concordant with the PWM matches. Black (gray): motifs significant for both (either) DNase-seq and ATAC-seq. The asterisk denotes TFs that exhibit significant enrichment considering peaks-supported PWM hits only.
-
F. Bar plots: Fraction of ASVs overlapping with PWM hits for 13 TFs, using 4 different thresholds on ASV significance: all SNPs (blue), 25% FDR ASVs (yellow), 10% FDR ASVs (orange), and 5% FDR ASVs (red). Line plots: Fraction of ASVs at each location within the genome-wide PWM hits of the representative TFs (P-value < 0.001) using four thresholds (the same colors as in bar plots). SNP: single-nucleotide polymorphism, ASB: allele-specific binding, ASV: allele-specific chromatin accessibility variant.