1. Removed GFPs from Batch 2: Experiments THC_0115 and THC_0169 corresponding to GFP, are "Trash". Rozy has marked them with "Exclude": "The GFPs that were removed were prepared from cells obtained from GB lab and had turned outto be something else … so we excluded them and made our own GFP lines." Plate_1_H10_S80_R1_001.fastq.gz Plate_1_H10_S80_R2_001.fastq.gz Plate_2_H9_S168_R1_001.fastq.gz Plate_2_H9_S168_R2_001.fastq.gz 2. Michelle and Diana conundrum! You see Batch 4 rows have appeared twice (two rows per single THC ID). In the "Memorial Sloan Katherine Cancer Hospital" (Quad's lab), there were two technicians names "Michelle" and "Diana" who did the batch 3 & 4 sequencing for us, however, they disappeared! Batch 3 has been split between them (no overlap between samples), but both of them have done all the Batch 4 samples! Therefore, we have two files per each THC experiment. We behaved them as sequencing replicates and kept both files. The intermediate files have either "M" or "D" in their UID, indicating this issue. 3. Wrong sample ID for some YY1 experiments: The sample ID is not very important, but for THC_0080, THC_0095, THC_0177, THC_0253, and THC_0261, which are YY1, the sample ID used to be pTH13796.1.1, which is wrong. They are replaced with pTH15832.1.1 or pTH15832.1.2 . 4. Bad SNAI1 in Batch 0: The SI0375 sample for SNAI1 is marked as bad (red), because it is mentioned: "The file name does not match. Please remove from data set". 5. Missing experiments in Batch 3: In Batch 3, THC_0272 (NCOA2) and THC_0277 (ZBED5) records are marked as red, because their files are missing. Probably the sequencing machine failed on these samples (no sequence is returned for them). Rozy's email (2023-05-11): Thank you for this update. I am responding to the last section of your email about the ChIP samples names and the removed samples (THC_0272 & THC_277). The gene names have been fixed and I will upload the corrected version along with the new samples section. The two missing samples, from our end, were traced back to samples that were submitted for sequencing but never yielded a sequencing file. Maybe sequencing was not successfully done. We were not sure, so we removed it from the file. 6. RBP in samples: The THC_0281 (MZF1) is partially marked as red, because it is an RBP and not a TF. Though it's marked, we have all the files for that (not removed from the table). 7. Batch 5 missing files: In Batch 5, THC_0578 (ZNF578) and THC_0712 (MYRFL) are marked as "No matching FQ". There isn't any file for them, and they are marked as red. 8. Removing THC_0894 The plasmid id is pTH13667 (ZNF208), but the gene id is ZNF226. I think the plasmid id should be pTH13669 (ZNF226): The original plasmid id was pTH13667, which is marked as "ZNF208" in the plasmids' sheet (I don't know why, because it doesn't appear anywhere else!, not even in the experimental_info sheet). However, the protein id is "ZNF226" in the ChIP sheet. Actually, the plasmid for "ZNF226" is probably pTH13669. I did a peak similarity check, and it really looks like a "ZNF226" not a "ZNF208". So I changed the last digit of plasmid id from 7 to 9. (Even if pTH13667 truly exists, the protein is ZNF226 and should be changed in the plasmids sheet). However, as the similarity to ZNF226 is only "slightly" more than ZNF208 (Jaccard 0.080 vs 0.071), I have decided to remove it. 9. Wrong plasmid or protein ID: The naming of some proteins were wrong or bad. These are the changes (old --> new): C11orf95 --> ZFTA (in THC_0189 and THC_0197) ZUFSP --> ZUP1 (in THC_0287, THC_0415, and THC_0433, for both Diana and Michelle records) OCT4/POU5F1 --> POU5F1 (in THC_0530 and THC_0626) cJUN --> JUN (in THC_0869 and THC_0871) ZNF88 --> ZNF788P (in THC_0650, THC_0696, and THC_0792) THC_0910 and THC_0916: ZNF382 --> ZNF362 The plasmid for these experiments are pTH13685, which is marked as "ZNF362" in the plasmids sheet and also THC_0411, THC_0364, GHT01491, GHT01493, and multiple HT-SELEX experiments. (Double checked with the online "Metadata"(!!!) sheet) The protein id was put as "ZNF382" in the ChIP sheet, which is unlikely to be true. I also did a peak similarity check and it is confirmed. EGFP_OLD --> GFP (in THC_0479, THC_0845, THC_0874, THC_0879) for pTH16502. (No Change): pTH13682 doesn't appear in plasmids sheet. It is assigned to ZNF326 in THC_0332, THC_0388, and THC_0422 (both D and M). Note that is exists in the original Experimental_Info sheet. EGFP --> GFP (in THC_0553, THC_0747, and THC_0875) for pTH13195, to be consistent with GHT. 10. Submission of BigWig files: Our complete set of BigWigs are "Zain_BigWig", but they have been actually made from "Hamed_BAMs_sorted" files. For that, bamCoverage from deepTools was used, as follows: submitjob -w 99 -c 6 -m 10 -N node_x bamCoverage -p 6 --minMappingQuality 30 --binSize 1 --normalizeUsing RPKM --effectiveGenomeSize 2913022398 --skipNonCoveredRegions -of bigwig --bam bam_file --outFileName bw_file 11. Submission of Trimmed files: For ChIP-seq trimming files, we trimmed them again using this command (assuming all of the samples have the "original" Illumina ChIP-seq adaptors): Paired-end (Batches 2, 3, 4 M/D, and 5): submitjob -w 23 -m 2 -c 1 cutadapt --quiet --cores 1 -b AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -B AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o ${output_file_R1} -p ${output_file_R2} ${input_file_R1} ${input_file_R2} Single-end (Batch 1): [the same code!] submitjob -w 99 -m 4 -c 16 cutadapt --quiet --cores 4 -b AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -B AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o ${output_file_R1} -p ${output_file_R2} ${input_file_R1} ${input_file_R2} Batch 0 is ignored, as we are supposed to put the "Schmitges et al." link for them, instead of uploading ourself. Note that the Forward adaptor is: AGATCGGAAGAGC ACACGTCTGAACTCCAGTCA (shared + unique) and the backward adaptor is: AGATCGGAAGAGC GTCGTGTAGGGAAAGAGTGT (shared + unique) 12. About MACS outputs: Each peak file has a ".narrowPeak" file as well, containing these columns: 1. chromosome 2. chromStart 3. chromEnd 4. Name 5. Score 6. Strand 7. Signal Value (overall enrichment of the region) 8. pValue (-log10) 9. qValue (FDR; -log10) 10. Peak (point-source called for this peak; O-based offset from chromStart) 13. Batch 0 samples: In batch 0 (old Six samples from Schmitges et al. paper), there are 39 samples with two read files (corresponding to two different flow cell lanes, like L003 and L004). They are all single-end reads. In the sheet, the second read is put in the "Reads - R2" columns and coloured as blue, and don't confuse them as a read2 paired-end. For the further parts (trimming, mapping, getting peaks, etc.) these read files have been merged into a single sample.