Steps in Grouping
of Sequences (“Clustering”):
- We
constructed an ordered build of the mouse genome by piecing together the
genomic contigs using map downloaded from NCBI.
- The cDNA sequences from MGI were compiled from the MRK_sequences.rpt file
downloaded from the MGI website (available from our website). That file
links MGI marker IDs to GenBank and RefSeq IDs. For each MGI marker, we kept the Accession
ID and sequence of the cDNA containing the
longest coding sequence from RefSeq non XM. If
no such cDNA was available, we then performed
the same search in GenBank. Finally if no GenBank record is associated with the MGI marker, the cDNA from RefSeq XM with the
longest coding sequence is kept.
- All cDNA sequences from all databases (see Table 1) are
mapped to build 32 of the mouse genome using BLAT.
- For
each match of a cDNA to the genome, a BLAT score
is calculated (Score = Number of base matches – Number of base
mismatches).
- Those
matches achieve a score of less then 95% of the length
of the cDNA are removed.
- From
the remaining hits, only those which achieve the maximum hit are retained.
- Using
the cDNA-to-genome matches described above, sequences
are first clustered into groups according to overlap of their start and
stop (cDNAs must hit hit
the same strand to be put into the same group).
- All
sequences within this initial cluster are re-evaluated according to the
individual exon boundaries.
- Starting
with an anchor (preference initially given to the more curated
datasets, see Table 1), all sequences are compared to it. Those sequence with significant
similarity (overlap of 50% of the exons or
greater) are retained in the cluster.
Those without significant similarity are split into separate
clusters. As the iteration through
the cluster progress, the “anchor” can change to another sequence if it
has a larger open reading frame (ORF).
- For
each cluster, a single sequence is selected. This is done in two different ways:
Anchor and Representative. For
Anchor, the anchor sequence is selected.
For Representative, a sequence from MGI or RefSeq
is selected from the cluster if one exists. Otherwise, the sequence with the longest
ORF is taken.
- Sequences
with short ORFs are removed. Those that are predictions must have ORFs greater than 180 amino acids.
- FASTA-formatted
files are created for each of cDNA and protein
sequence for both the Anchor and Representative lists.
Notes:
-Some sequences match genome
sequence not in the ordered build that we created. For example, they match to genomic contigs that are not “localized”. In the case of RefSeq,
we appended these sequences if they had open reading frames greater than 200
amino acids.
Table 1. Mouse cDNA
databases.
cDNA Database
|
Preference Rank
|
mgi
|
1
|
ensembl
|
2
|
refseq
|
3
|
fantom
|
4
|
unigene
|
5
|
ensembl-abinitio
|
6
|
geneid
|
7
|
SGP
|
8
|