Steps in Grouping of Sequences (“Clustering”):

 

 

  1. We constructed an ordered build of the mouse genome by piecing together the genomic contigs using map downloaded from NCBI.
  2. The cDNA sequences from MGI were compiled from the MRK_sequences.rpt file
    downloaded from the MGI website (available from our website). That file links MGI marker IDs to GenBank and RefSeq IDs. For each MGI marker, we kept the Accession ID and sequence of the cDNA containing the longest coding sequence from RefSeq non XM. If no such cDNA was available, we then performed the same search in GenBank. Finally if no GenBank record is associated with the MGI marker, the cDNA from RefSeq XM with the longest coding sequence is kept.
  3. All cDNA sequences from all databases (see Table 1) are mapped to build 32 of the mouse genome using BLAT.
  4. For each match of a cDNA to the genome, a BLAT score is calculated (Score = Number of base matches – Number of base mismatches).
  5. Those matches achieve a score of less then 95% of the length of the cDNA are removed.
  6. From the remaining hits, only those which achieve the maximum hit are retained.
  7. Using the cDNA-to-genome matches described above, sequences are first clustered into groups according to overlap of their start and stop (cDNAs must hit hit the same strand to be put into the same group).
  8. All sequences within this initial cluster are re-evaluated according to the individual exon boundaries.
  9. Starting with an anchor (preference initially given to the more curated datasets, see Table 1), all sequences are compared to it.  Those sequence with significant similarity (overlap of 50% of the exons or greater) are retained in the cluster.  Those without significant similarity are split into separate clusters.  As the iteration through the cluster progress, the “anchor” can change to another sequence if it has a larger open reading frame (ORF).
  10. For each cluster, a single sequence is selected.  This is done in two different ways: Anchor and Representative.  For Anchor, the anchor sequence is selected.  For Representative, a sequence from MGI or RefSeq is selected from the cluster if one exists.  Otherwise, the sequence with the longest ORF is taken.
  11. Sequences with short ORFs are removed.  Those that are predictions must have ORFs greater than 180 amino acids.
  12. FASTA-formatted files are created for each of cDNA and protein sequence for both the Anchor and Representative lists.

 

Notes:

-Some sequences match genome sequence not in the ordered build that we created.  For example, they match to genomic contigs that are not “localized”.  In the case of RefSeq, we appended these sequences if they had open reading frames greater than 200 amino acids.

Table 1.  Mouse cDNA databases.

cDNA Database

Preference Rank

mgi

1

ensembl

2

refseq

3

fantom

4

unigene

5

ensembl-abinitio

6

geneid

7

SGP

8