GOstat2by Tim Beißbarth
beissbarth@wehi.edu.au
To submit Group IDs either upload a text file or paste into the text area:
File:
Text:


You must submit a set of IDs to check against (as opposed to normal GOstat this does not have to be a superset and will be considdered as an independent sample):
File:
Text:
Available GO gene-association databases &
commonly used gene collections

Details


Minimal length of considered GO paths:
e.g. "biological_process"=1, "biological_process%behavior"=2


Subset of GO hierarchy:
Limit search to subset of GO hierarchy that contains a keyword, e.g. "biological_process", "molecular_function", "cellular_component"


Maximal p-value in GO output list:


Maximum number of GOs/groups to display:


Show Over-/Underrepresenteted GOs:


Cluster GOs:
-1 => do not cluster.
Merge GOs if indicating gene lists are inclusions or differ by less than # genes


Display Format:


Correct for multiple testing:

Find statistically overrepresented GO terms within a group of genes

The Gene-Ontology database (GO: http://www.geneontology.org) provides a useful tool to annotate and analyze the function of large numbers of genes. Modern experimental techniques, as e.g. DNA microarrays, often result in long lists of genes. To learn about the biology in this kind of data it is desirable to find functional annotation or Gene-Ontology groups which are highly represented in the data. This program (GOstat) should help in the analysis of such lists and will provide statistics about the GO terms contained in the data and sort the GO annotations giving the most representative GO terms first.

Input: The program requires a list of gene identifiers, that specify the group of genes of interest. Based on the used GO gene-associations database, the type of gene identifiers may be different (e.g. MGI numbers for mouse, Swissprot accession numbers for human, RGD for rat, gene symbols, etc.). The program searches several synonyms for most organisms, e.g. usually for Genbank and EST accession numbers links are provided based on the Unigene clustering (i.e. only for human, mouse, rat and drosophila). Further, a second list of gene identifiers can be passed to the program. This is useful, if the data in the first list is coming from a limited number of observations, e.g. a measurement on a small microarray, which does not contain all possibly annotated genes. In this case the second list is used as a reference to search for GO terms, which are significantly more represented in the first list than in the second. If the second list is left blank, than the complete set of annotated genes in the GO gene-associations database is used as a reference. Alternatively, it is also possible to select a predefined set of genes as the second input, from a set of commonly used microarrays.

Algorithm: The program will determine all annotated GO terms and all GO terms that are associated (i.e. in the path) with these for all the genes analyzed. It will then count the number of appearances of each GO term for the genes inside the group and for the reference genes. Fisher's Exact Test is performed to judge whether the observed difference is significant or not. This will result in a p-value for each GO category that the observed counts could have been due to chance. Frequently, the most significant GO termss are all representing the same subset of genes, as the genes may each have several GO annotations that are similar. To make the resulting group specific GO terms more interpretable the results can be clustered. By the clustering those GOs are grouped, which appear in similar subsets of genes within the list.

Output: The program will result in a list of p-values that state how specific the associated GO terms (Splits) are for the list of genes provided. The output list of GO terms is sorted by the p-value and can be limited by the number of terms as well as by a p-value cutoff. P-values of GO terms which are overrepresented in the dataset are typeset in green, p-values of underrepresented GO terms are colored red. It is possible to display only the over- or underrepresented terms. GOs corresponding to similar subsets of genes can be grouped. Further, a list of the associations of the annotated GO terms (i.e. the highest level in the GO hierarchy) to the unique genes in the input list is printed. The GO terms in the output are linked to a visualization tool for the GO hierarchy (AmiGO). It is possible to download the output as a tabular text rather than as an html-file. It is also possible to store the GOstat result file and use http://gostat.wehi.edu.au/cgi-bin/goStatDisplay.pl later to view the output.