Finding Which Functional Groups are Enriched in Clusters or Sets
Given two collections of gene (experiment) sets, Genomica allows you to identify pairs of gene (experiment) sets that have a statistically significant overlap between their member genes (experiments). Such an analysis can be useful in many scenarios. For instance, a common method for analyzing expression data is clustering, which results in a partitioning of the set of genes into clusters of similarly expressed genes. You can use Genomica to identify the biological processes represented by the genes in each cluster, by creating a gene set collection from the clusters, and checking for enrichment of this collection against a collection of gene sets that represent biological processes, such as the Gene Ontology (GO) database of functional annotations.
The procedure is simple: given two collections of gene sets, every pair of gene sets are compared using a statistical test based on the hypergeometric distribution and significant overlaps are reported. To find significant overlaps between two collections of gene (experiment) sets, follow the following steps.
Step 1: Load gene (experiment) sets
The first step is to load the gene (experiment) sets that you want to compare. Details on how to load sets and of Genomica file formats for gene and experiment sets are given here. In this tutorial we assume that you load the sample gene sets. Other gene sets are available here. Note that you can also create your own gene sets and that they can come from any source or organism.
Step 2: Find enriched gene sets
Next, select the dialog box from the Analyze -> Gene Sets menu. In the Sets to Analyze panel select the input sets that you want to analyze, which in our case will be the 'Sample Gene Sets'. Note: in this tutorial we search for overlaps between the same collection of gene sets, but in general you can compare different gene sets (e.g., gene sets that represent clusters versus gene sets that represent functional annotations). You can also assign a name to the resulting analysis, which will create an entity for this analysis. For now, we will leave this with the default. The resulting dialog box should look similar to:
Next, we will select the sets that we want to find enrichment for and set the parameters of the statistical tests. The default parameters are suitable for most applications. However, in this tutorial, since the analysis files are small, we will loosen the comparison criteria. From theSets to Find Enrichment For panel, select the Sample Gene Sets, change the minimum number of genes in a gene set required in an enrichment to 2, set the maximum p-value to 0.8, and select 'No Correction' in the multiple hypothesis correction box. The resulting dialog box should look similar to:
Finally, you can display the results in both a graphical and a spreadsheet-like display. From the Display Options panel, select both the graphical and spreadsheet display, resulting in a dialog similar to:
You can now run the analysis by pressing the Analyze button. The graphical display should look similar to:
The graphical view is a matrix of the two collections of gene sets, where each colored entry indicates that the two gene sets have a statistically significant overlap, and the intensity of each colored spot represents the fraction of genes in the overlap. For instance, gene set 1 and gene set 2 have a significant overlap, and of course every gene set has a significant overlap with itself (since we used the same gene sets for the comparison. In a general application you will compare two different collections of gene sets). Note that you can cluster the analysis using the controls in the left control panel, as well as save an image and textual file of the graphical display. The full details of the statistical tests are given in the spreadsheet view, which should look similar to:
Each row correponds to a significant overlap between the member genes of two gene sets, where the columns indicate the enriched set names, the p-value of the enrichment, the number of genes in the overlap, the total number of genes in the gene set, the percent of genes in the overlap, the total number of genes in the gene set, the total number of genes in the dataset, and the total percent of genes that belong to the gene set. In general, the larger the difference is between the percent of genes in the overlap, and the total percent of genes in the gene set in the dataset, the lower and more significant the p-value is.