To submit Group IDs either upload a text file or paste into the text area:
File:
Text:
You must submit a set of IDs to check against (as opposed to normal GOstat this does not have to be a superset and will be considdered as an independent sample):
File:
Text:
|
Available GO gene-association databases & commonly used gene collections
Details
Minimal length of considered GO paths:
e.g. "biological_process"=1, "biological_process%behavior"=2
Subset of GO hierarchy:
Limit search to subset of GO hierarchy that contains a keyword,
e.g. "biological_process", "molecular_function", "cellular_component"
Maximal p-value in GO output list:
Maximum number of GOs/groups to display:
Show Over-/Underrepresenteted GOs:
Cluster GOs:
-1 => do not cluster. Merge GOs if indicating gene lists are inclusions or differ by less than # genes
Display Format:
Correct for multiple testing:
|
Find statistically overrepresented GO terms within a group of genes
The Gene-Ontology database (GO: http://www.geneontology.org)
provides a useful tool to annotate and analyze the function of large numbers of genes. Modern experimental
techniques, as e.g. DNA microarrays, often result in long lists of genes. To learn about the biology in
this kind of data it is desirable to find functional annotation or Gene-Ontology groups which are
highly represented in the data. This program (GOstat) should help in the analysis of such lists and
will provide statistics about the GO terms contained in the data and sort the GO annotations giving the most
representative GO terms first.
Input: The program requires a list of gene identifiers, that specify the group of genes of interest.
Based on the used GO gene-associations database, the type of gene identifiers may be different (e.g. MGI
numbers for mouse, Swissprot accession numbers for human, RGD for rat, gene symbols, etc.). The program searches
several synonyms for most organisms, e.g. usually for Genbank and EST accession numbers links are provided based
on the Unigene clustering (i.e. only for human, mouse, rat and
drosophila). Further, a second list
of gene identifiers can be passed to the program. This is useful, if the data in the first list is
coming from a limited number of observations, e.g. a measurement on a small microarray, which does not
contain all possibly annotated genes. In this case the second list is used as a reference to search for
GO terms, which are significantly more represented in the first list than in the second. If the
second list is left blank, than the complete set of annotated genes in the GO gene-associations
database is used as a reference. Alternatively, it is also possible to select a predefined set of genes
as the second input, from a set of commonly used microarrays.
Algorithm: The program will determine all annotated GO terms and all GO terms that are
associated (i.e. in the path) with these for all the genes analyzed. It will then count
the number of appearances of each GO term for the genes inside the group and for the reference genes.
Fisher's Exact Test is performed to judge whether the observed difference is significant or not. This will
result in a p-value for each GO category that the observed counts could have been due to chance.
Frequently, the most significant GO termss are all representing the same subset of genes, as the genes may
each have several GO annotations that are similar. To make the resulting group specific GO terms more
interpretable the results can be clustered. By the clustering those GOs are grouped, which appear in similar
subsets of genes within the list.
Output: The program will result in a list of p-values that state how specific the associated GO terms
(Splits) are for the list of genes provided. The output list of GO terms is sorted by the p-value
and can be limited by the number of terms as well as by a p-value cutoff. P-values of GO terms which
are overrepresented in the dataset are typeset in green, p-values of underrepresented GO terms are colored
red. It is possible to display only the over- or underrepresented terms. GOs corresponding to similar
subsets of genes can be grouped. Further, a list of the associations of the annotated GO terms
(i.e. the highest level in the GO hierarchy) to the unique genes in the input list is printed. The GO terms
in the output are linked to a visualization tool for the GO hierarchy
(AmiGO). It is possible to download the output as a tabular text rather than
as an html-file. It is also possible to store the GOstat result file and use http://gostat.wehi.edu.au/cgi-bin/goStatDisplay.pl
later to view the output.
|