RIKEN Hayashizaki group has been working on the mouse full-length cDNA encyclopedia project since 1995 (see Genome Exploration Research Group web site
). We have been focusing on the collection and sequencing of more than one million mouse cDNAs, Phase I of the project. In Phase II, we have re-arrayed the non-redundant clones and produced full-length sequence for those clones. Functional annotation of the full length mouse cDNAs and deposition of their sequence data with the annotation into the public databases will contribute to the progress of science.
In order to assign functional annotation to uncharacterized cDNAs, we have been developing a semi-automatic annotation tool which refers to the results from the following:
1. homology search including search for orthologous database (human, rat, drosophila, C. elegans, yeast,
2. well-known protein motif search using Pfam and Prosite
3. other data such as, expression data, protein-protein interaction data and other data as may be applicable.
We use the term "functional annotation of genes" to refer to the assignment of attributes to genes. The attributes include Gene Ontology terms, classified into three categories;
- molecular function,
- biological process and
- cellular component
consisting of authorized vocabularies by the Gene Ontology Consortium, loci on chromosomes, related disorders and so on.
However, there are limits to our semi-automatic methods similar in ways used in other databases such as Unigene. For example, curation by biologists is always necessary when annotating genes for which BLAST searches result in only low-similarity matches in E-value.
Based on these issues, we believe we should discuss what is necessary for the functional annotation for the mouse full length cDNAs. Some of the points which need to be discussed include; what is necessary for biologists to curate and the rules of functional annotation. We then want to annotate the mouse full length cDNAs as adequately as possible with experts in the fields of bioinformatics, genome science, biology and other fields during the proposed meeting.
Therefore, we held a meeting for annotating our mouse full length cDNA, named FANTOM (Functional ANnoTation Of Mouse) meeting.
Prior to the meeting, we have developed web-based system, named FANTOM+ for annotating functional information of clones to be facilitated for functional annotation by human effort. FANTOM+ does not only includes the gene function itself, but many other informative data describing functional information. FANTOM+ allowed users to view pre-computed sequence similarity and motif search results, to launch additional searches, and to transfer the annotation from any of these to the FANTOM database.
A variety of tools, including BLASTN, BLASTX (http://www.ncbi.nlm.nih.gov/BLAST/
), FASTA/FASTY (ftp://ftp.virginia.edu/pub/fasta/
), DECODER, EST-WISE (http://www.sanger.ac.uk/Software/Wise2/
), and HMMER (http://hmmer.wustl.edu/
) were used to search a large number of databases including NCBI-nr, Locus Link, SwissProt, SwissProt TrEMBL, TIGR nraa, PFAM, TIGR-FAM, UniGene, the TIGR Gene Indices, the UTR db and UTR site, and a number of species-specific databases. Additional analyses were performed using the bioSCOUT® program from LION Bioscience. Protein domain analyses were conducted by EBI using InterPro.
The Functional Annotation Of Mouse (FANTOM) Meeting was held at the RIKEN Institute in Tsukuba City, Japan from August 28 to September 8, 2000. The main purpose of the meeting was to functionally annotate 21,076 fully sequenced mouse cDNA clones prepared from full-length enriched libraries at the RIKEN Institute, as part of the Mouse Encyclopedia Project. An international group of researchers having a wide range of scientific backgrounds participated in the meeting and contributed to the annotation, which focused on assigning putative function to the RIKEN clones using various computational procedures including sequence comparison, domain analysis and automated mapping to GeneOntology terms.
In the of the meeting, strategies to annotate sequences of 21,076 cDNAs was discussed by bioinformatists and biologists.
There was significant redundancy in the cDNA set. Duplication may have resulted from a number of factors including mistakes made when samples were regridded, internal initiation of reverse transcription, incomplete or variable splicing and differences in polyadenylation site usage. Tu cluster redundant clones, we compared all sequences pairwise using FLAST, a sequence comparison program based on DDS(PubMed
), and grouped them on the basis of sequence using CAP3(http://genome.cs.mtu.edu/cap/cap3.html
) and aligned using CLUSTALW(ftp://ftp.ebi.ac.uk/pub/software/unix/clustalw/
), and visually inspected.
This placed 8,207 clones into 2,957 clusters, reducing the size of the cDNA clone set to 15,826 unique genes and the MGI-confirmed set (see below) to 2,921 unique genes. Further analysis of RIKEN clones in the MGI-confirmed set revealed some instances where non-overlapping clones could be added to existing clusters or grouped together based on curatorial association with the same MGI gene. Therefore, the actual number of genes in the MGI set was reduced from 2,921 to 2,390, and the total number of genes represented by the whole RIKEN set was reduced to 15,295.
For novel genes represented by RIKEN clusters, nomenclature will be taken from the Clone Identifiers of the representative clones for each cluster.
Prior to the FANTOM meeting, a set of RIKEN clones with significant similarity to mouse genes represented in the Mouse Genome Database were selected for annotation by FANTOM participants from the Mouse Genome Informatics (MGI) group of The Jackson Laboratory, to assure proper gene nomenclature assignments. The RIKEN sequences were compared to a reference data set of mouse sequences, and a similarity threshold (E-value <e-50) was used to designate clones for this MGI clone set. RIKEN clones found to be identical by human curation to mouse genes in MGI, constituted the MGI-confirmed clone set.
The aim of the annotation in the meeting was to assign each RIKEN clone a RIKEN definition (riken_def) to indicate its most likely function and/or status on the basis of similarity to known genes.
A supplementary RIKEN definition line (riken_def_suppl) was available in the interface for additional pertinent annotation. Annotation of RIKEN clones with significant similarity to known sequences was guided by the gene/gene product descriptors of the reference sequences to which the RIKEN clones were most similar. In general, the riken_def was derived from the gene descriptor of the reference sequence that had the highest similarity to the RIKEN clone sequence. When the RIKEN clone was highly similar to several genes, an annotation hierarchy was used to choose the riken_def, based on the species of origin and descriptor content for the candidate reference sequences.
Priority was given to reference sequence descriptors from which some functional information could be inferred for the RIKEN clones, even if sequences with less informative descriptors were more similar to the clones. Annotations from highly curated databases (MGI and SwissProt) were preferred and provided convenient entry points into the Gene Ontology vocabularies. Informative descriptors from mouse genes identical to RIKEN clones were the first choice for annotation. Official gene nomenclature was used preferentially for RIKEN clones found to be identical to mouse genes in the Mouse Genome Informatics (MGI) databases (the "MGI-confirmed" set). For RIKEN clones identical to mouse genes not represented in MGI, or with non-identical similarity to known genes, riken_defs were derived from informative gene descriptors according to the following species priority: identical mouse > non-identical mouse > non-mouse mammal > non-mammal. Controlled vocabulary prefix terms "similar to", "homolog to" or "related to" were used in the riken_def line to indicate that a gene descriptor was derived from non-identical mouse, non-mouse mammal, or non-mammal sources, respectively.
RIKEN clones with no significant sequence similarity to known genes were named based on coding potential, protein motif signature and representation in mouse, human or rat EST databases. RIKEN clones with no significant similarity to known sequences, but with predicted protein motifs found in Pfam and/or InterPro were named "<motif name> containing protein". Clones with no known sequence similarity or domain hits, but with coding potential equal to or greater than 100 amino acids and EST representation were named "hypothetical protein". Clones belonging to none of the above groups, but with matches to ESTs were referred to as "unclassifiable transcript". Clones with no EST matches were called "unclassifiable".
New mouse genes discovered in the RIKEN clone set will be assigned official nomenclature in MGI that follows a defined syntax: Gene Symbol= <Riken Clone Identifier> "Rik", Gene Name= "Riken cDNA" <Riken Clone Identifier> "gene" (e.g. 2610307C23Rik, Riken cDNA 2610307C23 gene). Information about RIKEN clones and genes is available through Mouse Genome Informatics web site