GOstatby Tim Beißbarth

Find statistically overrepresented GO terms within a group of genes


The Gene-Ontology database (GO: http://www.geneontology.org) provides a useful tool to annotate and analyze the function of large numbers of genes. Modern experimental techniques, as e.g. DNA microarrays, often result in long lists of genes. To learn about the biology in this kind of data it is desirable to find functional annotation or Gene-Ontology groups which are highly represented in the data. This program (GOstat) should help in the analysis of such lists and will provide statistics about the GO terms contained in the data and sort the GO annotations giving the most representative GO terms first.

Go to search form

Search Form

Input: The program requires a list of gene identifiers, that specify the group of genes of interest. Based on the used GO gene-associations database, the type of gene identifiers may be different (e.g. MGI numbers for mouse, Swissprot accession numbers for human, etc.). The program searches several synonyms for most organisms, e.g. usually for Genbank and EST accession numbers links are provided based on the Unigene clustering. For the Affymetrix arrays, available in the common list Affymetrix identifiers may be used. Further, a second list of gene identifiers can be passed to the program. This is useful, if the data in the first list is coming from a limited number of observations, e.g. a measurement on a small microarray, which does not contain all possibly annotated genes. In this case the second list is used as a reference to search for GO-terms, which are significantly more represented in the first list than in the second. If the second list is left blank, than the complete set of annotated genes in the GO gene-associations database is used as a reference. Alternatively, it is possible to select a predefined set of genes as the second input, from a set of commonly used microarrays.

Algorithm: The program will determine all annotated GO terms and all GO terms that are associated (i.e. lower in the hierarchy) with these for all the genes analyzed. It will then count the number of appearances of each GO term for the genes inside the group and for the reference genes. Fisher's Exact Test (or alternatively Chi-Square test if counts are hight) is performed to judge whether the observed difference is significant or not. Multiple testing correction is applied in order to get a more realistic idea of the p-value, Benjamini and Hochberg correction controls the "false discovery rate" but assumes independence, Benjamini and Yekutieli drop the assumption of independence but are more conservative, the Holm method controls the "family wise error rate" which is most conservative. This will result in a p-value for each GO term that the observed counts could have been due to chance. Frequently, the most significant GO terms are all representing the same subset of genes, as the genes may each have several GO annotations that are similar. To make the resulting group specific GO terms more interpretable the results can be clustered. By the clustering those GOs are grouped, which appear in similar subsets of genes within the list.

Output: The program will result in a list of p-values that state how specific the associated GO terms (Splits) are for the list of genes provided. The output list of GO terms is sorted by the p-value and can be limited by the number of terms as well as by a p-value cutoff. P-values of GO terms which are overrepresented in the dataset are typeset in green, p-values of underrepresented GO terms are colored red. It is possible to display only the over- or underrepresented terms. GOs corresponding to similar subsets of genes can be grouped. Further, a list of the associations of the annotated GO terms (i.e. the highest level in the GO hierarchy) to the unique genes in the input list is printed. The GO terms in the output are linked to a visualization tool for the GO hierarchy (AmiGO). It is possible to download the output as a tabular text rather than as an html-file.

Search Form

In case of problems contact beissbarth@wehi.edu.au.