Find statistically overrepresented GO terms within a group of genes
Description:
The Gene-Ontology database (GO: http://www.geneontology.org)
provides a useful tool to annotate and analyze the function of large numbers of genes. Modern experimental
techniques, as e.g. DNA microarrays, often result in long lists of genes. To learn about the biology in
this kind of data it is desirable to find functional annotation or Gene-Ontology groups which are
highly represented in the data. This program (GOstat) should help in the analysis of such lists and
will provide statistics about the GO terms contained in the data and sort the GO annotations giving the most
representative GO terms first.
Go to search form

Input: The program requires a list of gene identifiers, that specify the group of genes of
interest. Based on the used GO gene-associations database, the type of gene identifiers may be different
(e.g. MGI numbers for mouse, Swissprot accession numbers for human, etc.). The program searches several
synonyms for most organisms, e.g. usually for Genbank and EST accession numbers links are provided based on
the Unigene clustering. For the Affymetrix arrays,
available in the common list Affymetrix identifiers may be used. Further, a second list of gene identifiers
can be passed to the program. This is useful, if the data in the first list is coming from a limited number
of observations, e.g. a measurement on a small microarray, which does not contain all possibly annotated
genes. In this case the second list is used as a reference to search for GO-terms, which are significantly
more represented in the first list than in the second. If the second list is left blank, than the complete
set of annotated genes in the GO gene-associations database is used as a reference. Alternatively, it is
possible to select a predefined set of genes as the second input, from a set of commonly used
microarrays.
Algorithm: The program will determine all annotated GO terms and all GO terms that are associated
(i.e. lower in the hierarchy) with these for all the genes analyzed. It will then count the number of
appearances of each GO term for the genes inside the group and for the reference genes. Fisher's Exact Test
(or alternatively Chi-Square test if counts are hight) is performed to judge whether the observed difference
is significant or not. Multiple testing correction is applied in order to get a more realistic idea of the
p-value, Benjamini and Hochberg correction controls the "false discovery rate" but assumes independence,
Benjamini and Yekutieli drop the assumption of independence but are more conservative, the Holm method
controls the "family wise error rate" which is most conservative. This will result in a p-value for each GO
term that the observed counts could have been due to chance. Frequently, the most significant GO terms are
all representing the same subset of genes, as the genes may each have several GO annotations that are
similar. To make the resulting group specific GO terms more interpretable the results can be clustered. By
the clustering those GOs are grouped, which appear in similar subsets of genes within the list.
Output: The program will result in a list of p-values that state how specific the associated GO terms
(Splits) are for the list of genes provided. The output list of GO terms is sorted by the p-value
and can be limited by the number of terms as well as by a p-value cutoff. P-values of GO terms which
are overrepresented in the dataset are typeset in green, p-values of underrepresented GO terms are colored
red. It is possible to display only the over- or underrepresented terms. GOs corresponding to similar
subsets of genes can be grouped. Further, a list of the associations of the annotated GO terms
(i.e. the highest level in the GO hierarchy) to the unique genes in the input list is printed. The GO terms
in the output are linked to a visualization tool for the GO hierarchy
(AmiGO). It is possible to download the output as a tabular text rather than
as an html-file.

In case of problems contact
beissbarth@wehi.edu.au.
|