Web supplement to
"G+C content dominates intrinsic nucleosome occupancy"

Desiree Tillo1 and Timothy R. Hughes1,2*

1Department of Molecular Genetics, 2Banting and Best Department of Medical Research, Terrence Donnelly Centre for Biomolecular Research, University of Toronto, 160 College Street, Toronto, ON, Canada, M5S 3E1. *To whom correspondance should be addressed:

Abstract

Background

The relative preference of nucleosomes to form on individual DNA sequences plays a major role in genome packaging. A wide variety of DNA sequence features are believed to influence nucleosome formation, including periodic dinucleotide signals, poly-A stretches and other short motifs, and sequence properties that influence DNA structure, including base content. It was recently shown by Kaplan et al. that a probabilistic model using composition of all 5-mers within a nucleosome-sized tiling window accurately predicts intrinsic nucleosome occupancy across an entire genome in vitro. However, the model is complicated, and it is not clear which specific DNA sequence properties are most important for intrinsic nucleosome-forming preferences.

Results

We find that a simple linear combination of only 14 simple DNA sequence attributes (G+C content, two transformations of dinucleotide composition, and the frequency of eleven 4-bp sequences) explains nucleosome occupancy in vitro and in vivo in a manner comparable to the Kaplan model. G+C content and frequency of AAAA are the most important features. G+C content is dominant, alone explaining ~50% of the variation in nucleosome occupancy in vitro.

Conclusions

Our findings provide a dramatically simplified means to predict and understand intrinsic nucleosome occupancy. G+C content may dominate because it both reduces frequency of poly-A-like stretches and correlates with many other DNA structural characteristics. Since G+C content is enriched or depleted at many types of features in diverse eukaryotic genomes, our results suggest that variation in nucleotide composition may have a widespread and direct influence on chromatin structure.

Link to paper

Lasso model predictions

Tab delimited files in the format: chromosome <tab> genomic coordinate (midpoint of 150-base window, 0-based) <tab> Lasso model score

Source code