MouseFunc I

A critical assessment of quantitative gene function assignment from genomic datasets in M. musculus

(Jul 17 - Oct 13, 2006)

Organizers

Introduction

Why predict function? Determination of gene function is the central goal in the field of functional genomics. Genomics experiments have proven valuable in suggesting hypotheses that can be tested by follow-up experimentation. Computational predictions of gene function can serve as a statistically sound form of triage, focusing experimental resources on the hypotheses (predictions) that are more likely to be true. Among strong predictions, the most interesting can be chosen by individual investigators with intuition and specialized knowledge.

The role of prediction in gene function databases. Model organism databases in the Gene Ontology (GO) Consortium (e.g., SGD, FlyBase, and MGI) track the types of evidence that support function annotations. A substantial fraction of annotated genes are annotated solely by virtue of predictions (ISS or IEA evidence codes). In 2004, this fraction was 5% for S. cerevisiae, 50% in D. melanogaster, and 72% in M. musculus. Although predictions have a substantial role in gene function databases, they are not typically assigned measures of confidence.

The need for measures of confidence in prediction. Biologists browsing a gene function annotation database react negatively to annotations that are presented as fact, but which are not confidently known to be true, since this is misleading. When a prediction of gene function is placed alongside conclusions derived from direct experimentation, it should either be high confidence or labeled clearly as a prediction (or both), to avoid misleading. To address this issue, model organism databases have developed evidence codes to label annotations based on prediction or uncritical transfer from other sources. Unfortunately, this provides no guidance as to which predictions are confident and which are weak, and the user is prone to .throw the baby out with the bathwater. by ignoring all predictions. Furthermore, the tolerance of researchers for false positives depends on a complex tradeoff between the importance of the biological question and the cost of follow-up experiments. Thus, to achieve their full potential value predictions should be provided with interpretable levels of confidence, e.g., an estimated probability that the prediction is correct.

Why compare? Assessment of performance of different method on a standardized data set according to standardized performance criteria is the only way to draw meaningful conclusions about the strengths and weaknesses of the algorithms employed. Just as the fields of protein structure prediction, machine learning, natural language processing have benefited from competitions, we hope that an organized comparison will motivate bioinformatics groups to think deeply about an important problem, and that it will provide a focus around which ideas can be exchanged between diverse groups. Furthermore, we expect that that the simple act of sharing prediction results in a common format will make it possible for these results to be compiled and shared with experimental biologists in a transparent and useful way, perhaps via model organism databases. There will be a period of comment and discussion on the dataset, on procedures used to share data, methods, and results and on measures used to evaluate gene function predictions.

Process and Timeline

Step 1: Organization.

A period of invitation, comment and discussion (via MouseFunc@googlegroups.com), and commitment to participate (ending July 14).

Step 2: Release of the training data (July 17).

Briefly, training data consists of * For simplicity, properties of proteins encoded by a given gene will be mapped to that gene ID.

** Anonymization precludes sequence-based prediction methods beyond presence/absence of protein sequence patterns. (Participants agree on their honor to not attempt decoding of the IDs.)

A more complete description of the training data can be found here

Step 3. Submit methods and predictions (September 29Extended to Friday Oct 13th, 2006!

3a: Code sharing. Each participant posts all code used to generate and apply predictive models, together with relevant parameters and the resulting model. It should be possible to generate the final score matrix from input data by running a single script/executable. If the code uses random initializations then the random seeds should be included.

3b: Predictions. Each participant submits a matrix of scores (ranging from 0 to 1) for each gene and GO term to be predicted.

Step 4. Performance assessment (Oct 2 - Oct 16 Oct 16 - Oct 30)

All predictions will be deanonymized and performance will assessed both on the: a) the held-out collection of genes and b) novel predictions for all genes.

4a. Predictions on held-out genes. A variety of performance measures will be applied: area under the ROC curve (AUC), precision at 1% recall (P01R), precision at 10% recall (P10R), precision at 50% recall (P50R), and precision at 80% recall (P80R). These measures will be applied to each GO term individually, and median performance values will be calculated for 12 categories of GO terms (with the indicated # of GO terms in each):

Specificity*** GO Branch
BP CC MF
3-10 952 151 475
11-30 435 97 142
31-100 239 48 111
101-300 100 30 35
*** Here, specificity is defined as the number of genes in the training set assigned to a particular GO term

Median performance values will also be calculated for the GO terms in each row and column of the above table (i.e., for GO terms of a given specificity or in a given branch).

4b. Novel predictions. Predictions for gene/GO term combinations not annotated in the training set will be judged on the basis of new GO term annotations in the ~8 months since data set assembly. The same measures described above will be applied.

Step 5. Publish the results (by Dec 1).

Each participant provides a description of their approach (ideally brief enough to also permit publication elsewhere) together with references. We will write the paper summarizing performance, and submit to Nature Biotechnology (previous interest expressed by editor G. Taroncher).

The Data Set

All files are tab-delimited.  Gene IDs have been anonymized.
All matrices can be downloaded here in a tar.gz file (33Mb)




DATA:


Participants

Below are participants who successfully submitted their predicitions. Congratulations!

Submitted

  1. Yanjun Qi1, Judith Klein-Seetharaman1,2 & Ziv Bar-Joseph1 (1Carnegie Mellon University, 2University of Pittsburgh)
  2. Sara Mostafavi, David Warde-Farley, Chris Grouios & Quaid Morris (University of Toronto)
  3. Guillaume Obozinski, Charles Grant, Gert Lanckriet,  Jian Qiu, Michael Jordan1 & William Stafford Noble (University of Washington, 1University of California - Berkeley)
  4. Murat Tasan, Weidong Tian, Frank Gibbons & Fritz Roth (Harvard Medical School)
  5. Hyunju Lee, Minghua Deng, Ting Chen &  Fengzhu Sun (University of Southern California)
  6. Yuanfang Guan, Chad L. Myers & Olga G. Troyanskaya (Princeton University)
  7. Michele Leone & Andrea Pagnani (Institute for Scientific Interchange, Turin, Italy)
  8. Trupti Joshi, Chao Zhang, Guan Ning Lin & Dong Xu (University of Missouri-Columbia)   
  9. Wan Kyu Kim, Chase Krumpelman, & Edward Marcotte (University of Texas, Austin)

Discussion Forum

Issues arising before and during the competition will be discussed on a Google Discussion Group.
NOTE: Discussion group is not longer active.


Software


Submission Instructions

Please follow exactly the following submission instructions.

The submission files should follow the filename scheme below:

For score matrix:
    FirstAuthorInitials-SecondAuthorInitials-ithAuthorInitials-"result".txt.gz.zip.
example: JP-FD-result.txt.gz.zip

For code:
    FirstAuthorInitials-SecondAuthorInitials-ithAuthorInitials-"code".tar.zip.
example:JP-FD-code.tar.zip

The score matrix file should contain the tab-delimited score values for each gene (one per line) and GO term to be predicted. The IDs of the GO terms and genes should be exactly the same as in the files provided. The score matrix should contain a line for each gene ID in the file "GenesIDs_and_Summary.txt.gz". The score values are the output of the model and should be in the 0 to 1 range. The higher the score the higher is the probability of the gene having the corresponding function.
The score values should contain only digits and at most one decimal point ("."). The score values can also be in scientific notation, e.g. 1.23456e-04.The file “sample_scoreMatrix.txt.gz”shows how the result filecould look like. In addition, the perl script “checkFormat_scoreMatrix.pl” available here can be used to verify the format of the score matrix before submission.

Please include a README file (see sample here) in the code submission file with indications on how to compile (if necessary) and run the code, and on which systems the code has been tested. In particular, if you use some standard programs or require some libraries, please indicate where they can be obtained and the versions that you have used. Please ensure that all parameters used in running the code have been provided, as well as random seeds if any randomization has been used.If you have makefiles, please include them as well.

The submission deadline is 29th September 2006 (Any time zone).  Extended to Friday Oct 13th, 2006!

Submissions should be made by uploading the files here

Only the last submission before the deadline will be evaluated and all other submissions will be discarded.

Methods section

A more complete description of the methods suitable for inclusion as Supplementary Information in the resulting manuscript should be submitted by Friday October 20, 2006 (extended!). The description should include brief comparisons and references to prior work.You can also refer to “(unpublished results)” if you think you may publish this work separately outside of the competition summary paper.Note that there will be an opportunity to revise this section later, but you are encouraged to submit a draft while your memory of the methods is still fresh.

The submission file should follow the filename scheme below:

FirstAuthorInitials-SecondAuthorInitials-ithAuthorInitials-"methods".*
example: JP-FD-methods.*

“*” the file format can be either .pdf, .txt.gz, or .doc

Please upload your methods file here.



The unified set of predictions

To simplify subsequent analyses for ourselves and other investigators, we derived a single set of prediction scores from the set of submitted scores.  We unified the independent submissions for each evaluation category by adopting the scores from the submission with the best Precision at 20% Recall  (P20R) value for that evaluation category (evaluated using held-out genes). The combined predictions averaged 41% precision at 20% recall with 26% of GO terms having a P20R value greater than 90%.


Set of predictions from individual groups


GO annotations for the held-out set and prospective evaluation


Last updated: Wed 3 Sep, 2008