Web supplement to
"Known sequence features can explain half of all human gene ends"

Aleksei Shkurin1 and Timothy R. Hughes*,1,2

1 Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto M5S 3E1, Canada
2 Department of Molecular Genetics, University of Toronto, Toronto M5S 1A8, Canada

**To whom correspondance should be addressed:

Abstract

Cleavage and polyadenylation (CPA) sites define eukaryotic gene ends. CPA sites are associated with five key sequence recognition elements: the upstream UGUA, the polyadenylation signal (PAS), and U-rich sequences; the CA/UA dinucleotide where cleavage occurs; and GU-rich downstream elements (DSEs). Currently, it is not clear whether these sequences are sufficient to delineate CPA sites. Additionally, numerous other sequences and factors have been described, often in the context of promoting alternative CPA sites and preventing cryptic CPA site usage. Here, we dissect the contributions of individual sequence features to CPA using classical discriminative approaches. We show that models comprised only of the five primary CPA sequence features give highest scores to constitutive CPA sites at the ends of coding genes, relative to the entire pre-mRNA sequence, for 41% of all human genes. U1-hybridizing sequences provide a small boost in performance, but addition of all known RBP RNA binding motifs to the model, however, increases this figure to 49%, and suggests involvement of both known and suspected CPA regulators as well as potential new factors in delineating constitutive CPA sites. To our knowledge, this high effectiveness of established features to predict human gene ends has not previously been documented.

BED files: contains .bed files necessary to recreate dataset used to train the baseline and cryptic models

FASTA files: directory contains full negative datasets used to train and test the baseline model

Heatmaps

PWMs and hexamers

Correction