Web supplement to
"Most Dark Matter Transcripts Are Associated With Known Genes"

Harm van Bakel1, Corey Nislow1,2, Benjamin J. Blencowe1,2, Timothy R. Hughes1,2*

1Banting and Best Department of Medical Research, 2Department of Molecular GeneticsDepartment of Molecular Genetics, Terrence Donnelly Centre for Biomolecular Research, University of Toronto, 160 College Street, Toronto, ON, Canada, M5S 3E1. *To whom correspondance should be addressed:

Abstract

A series of reports over the last few years have indicated that a much larger portion of the mammalian genome is transcribed than can be accounted for by currently annotated genes, but the quantity and nature of these additional transcripts remains unclear. Here, we have used data from single- and paired-end RNA-Seq and tiling arrays to assess the quantity and composition of transcripts in PolyA+ RNA from human and mouse tissues. Relative to tiling arrays, RNA-Seq identifies many fewer transcribed regions (“seqfrags”) outside known exons and ncRNAs. Most nonexonic seqfrags are in introns, raising the possibility that they are fragments of pre-mRNAs. The chromosomal locations of the majority of intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and terminator-associated transcripts, or new alternative exons; indeed, reads that bridge splice sites identified 4,544 new exons, affecting 3,554 genes. Most of the remaining seqfrags correspond to either single reads that display characteristics of random sampling from a low-level background or several thousand small transcripts (median length = 111 bp) present at higher levels, which also tend to display sequence conservation and originate from regions with open chromatin. We conclude that, while there are bona fide new intergenic transcripts, their number and abundance is generally low in comparison to known exons, and the genome is not as pervasively transcribed as previously reported.

Human splice junction tracks

Track nameUCSCBed file
Human Brain splice junctions (PE reads)View in genome browserView/Download
Human UHR splice junctions (PE reads)View in genome browserView/Download
Human Adipose splice junctions (SE reads)View in genome browserView/Download
Human Brain (HCT168) splice junctions (SE reads)View in genome browserView/Download
Human Brain (s1368) splice junctions (SE reads)View in genome browserView/Download
Human Colon splice junctions (SE reads)View in genome browserView/Download
Human Heart splice junctions (SE reads)View in genome browserView/Download
Human Liver splice junctions (SE reads)View in genome browserView/Download
Human Lymph Node splice junctions (SE reads)View in genome browserView/Download
Human Skeletal Muscle splice junctions (SE reads)View in genome browserView/Download
Human Testes splice junctions (SE reads)View in genome browserView/Download

Reconstructed human transcript units (TU)

Track nameUCSCBed file
All TUsView in genome browserView/Download
TUs that share exons with annotated genesView in genome browserView/Download
TUs that overlap annotated genes (no shared exons)View in genome browserView/Download
Antisense TUsView in genome browserView/Download
Intergenic TUsView in genome browserView/Download

Intergenic Seqfrags and Seqfrag clusters

Track nameUCSCBed file
Human intergenic seqfragsView in genome browserView/Download
Human intergenic seqfrag clustersView in genome browserView/Download
Mouse intergenic seqfragsView in genome browserView/Download
Mouse intergenic seqfrag clustersView in genome browserView/Download

Paired-end RNA-Seq data

Solexa FastQ files for the human brain and UHR samples: