Create a Module Map: Characterize Expression Data Using Gene Sets
By creating a module map, you can characterize an expression dataset by gene sets and experiment sets that significantly change in it, thereby arriving at a higher level view of the data. You can work with any organism, expression data, gene and experiment sets of your choice. The following tutorial takes you through the three steps you need to take in order to create your own module map. This tool was used to create a module map of cancer, and the full details of the method used to create module maps are available here.
As a general overview, the procedure for creating a module map starts with an expression data and gene and experiment sets of interest. In the first step, for each gene set, the procedure identifies all the arrays in which the gene set is significantly up- (down-) regulated using a statistical test based on the hypergeometric distribution. In the second step, the set of arrays in which each gene set is significantly up- (down-) regulated is tested for enrichment relative to each of the available experiment sets, resulting in a map of gene sets versus the experiment sets in which they are significantly up- (down-) regulated. A map can also be created without experiment sets, in which case the resulting map will consist of gene sets versus the individual experiments in which they are significantly up- (down-) regulated. There is also the option of automatically merging gene sets with similar expression patterns, and refining the gene sets into modules that only include the genes that are significantly consistent with the arrays in which a gene set significantly changes. Finally, once a module map is constructed, you can also return to the gene level and examine detailed views of the original data from which the significant associations were derived.
To create a module map follow the steps below. The required set of instructions you need to perform are highlighted in bold, and the text provides the full details of the various parameters that you can control in other applications.
Step 1: Load expression data
The first step is loading the expression data for which you want to construct a module map. Details on how to load expression and of Genomica file formats for expression data are given here. In this tutorial we assume that you load the sample expression data. Other expression data are available here.
Step 2: Load gene and experiment sets
Next, you need to load the gene and experiment sets that will be used to construct the module map. Details on how to load sets and of Genomica file formats for gene and experiment sets are given here. In this tutorial we assume that you load the sample gene sets and sample experiment sets. Other gene sets are available here. Note that you can also create a module map using only gene sets. >
Step 3: Create the module map
You are now ready to create the module map. Open the dialog box from the Algorithms -> Create a Module Map. From the Gene Sets panel, select the 'Sample gene sets' group of gene sets (this is the set of gene sets you loaded in Step 2 above). Note that in general, you can also filter the gene sets used to create a module map by the minimum and maximum number of member gene s they contain. The dialog box should look similar to the following:
Next, move to the Experiments panel, where you can control the set of experiments from which to create the map, the cutoff for up- and down- regulation, and where you can also filter the enrichments of a gene set in an experiment by a minimum number of genes, a maximum p-value, and a minimum number of experiments per gene set or minimum number of gene sets per experiment. You can also control the statistical correction for multiple hypotheses. The default parameters are suitable for many applications. For now, however, since the sample files we use are small, we will loosen the enrichment p-value criteria. From the Experiments panel, change the max p-value to 0.5 and select 'No correction' in the multiple hypothesis correction box. The dialog box should look similar to the following:
The next step is to set up the enrichments for the experiment attributes. Although in general experiment attributes are not required, in large compendiums of expression data, they can be very useful for understanding what properties are common to the set of experiments in which gene sets are found to be enriched. The 'Experiment Attributes' panel allows you to select the experiment sets that will be tested for enrichment, and filter the enrichments in terms of a maximum p-value, the size of the experiment sets in terms of their member experiments, and control the statistical corrections for multiple hypotheses that are used. The default parameters are suitable for many applications. However, due to the small sample files we use in this tutorial, we will now loosen these criteria. From the Experiment Attributes panel, select the Sample Experiment Sets group (the one you loaded in Step 2 above), change the minimum number of enriched experiments to at least 2 experiments, change the maximum p-value to 0.5, change the minimum number of experiments per experiment set to 2, and select 'No Correction' in the multiple hypothesis correction box. The resulting dialog box should look similar to the following:
In large collections of gene sets, some gene sets may be redundant with each other and some gene sets may contain spurious genes that are not related to the studied phenomenon. Thus, our general method allows automatic merging of gene sets with similar signatures across the expression data, through hierarchical clustering of the gene sets and automated identification of (potentially overlapping) clusters from the resulting clustering tree. This results in construction of modules, which in Genomica, may also be given a name for the resulting module collection. the Modules panel allows control of the parameters of creating these modules as well as control the number of iterations to perform for repeating this refinement process. The full details of this automatic refinement procedure is detailed in the Methods section here. For now, we will use the original gene sets and will not refine them using modules. Thus, from the Modules panel, enter '0' for the maximum number of times to iterate the analysis. The resulting dialog box should look similar to:
Finally, you can select the way in which you want the results to be displayed, including a graphical display and spreadsheet-like display, where the latter will include the full details of the statistical analyses performed. For this tutorial, from the Display panel, select both the graphical and spreadsheet displays. The resulting dialog box should look similar to
You can now create the module map by pressing the Run button. The graphical display should look similar to:
The upper left graphical panel is a matrix of gene sets vs. arrays, where a colored entry indicates that the genes in the gene set are significantly changing in a coordinated fashion in the respective array (defined through a statistical test based on the hypergeometric distribution, see Methods for full details). In the above example, gene set 1 is seen to be coordinately up-regulated in arrays 2,4, and 5, gene set 2 is up-regulated in arrays 2-4, and gene set 3 is down-regulated in array 3 and 6. You can trace back these significant enrichments by looking at the original data. For instance, in array 2, genes 2,3,5,6,7 are up-regulated, and since gene set 2 includes genes 4,5,6,7, it means that 3 of the 4 genes in gene set 2 are up-regulated, out of 5 total up-regulated genes in array 2, and therefore we might conclude that gene set 2 is significantly up-regulated in array 2 with p<0.5 (the hypergeometric p-value in this case is 0.5, see spreadsheet below). The center graphical panel shows the experiment sets that each array belongs to. This does not include any computation, and just shows this raw experiment set association data. Finally, the bottom graphical panel shows the experiment sets that are significantly enriched in the set of up-regulated or down-regulated experiments that each gene set is enriched in. For instance, the green colored entry of gene set 3 vs. array set 3 indicates that the set of arrays in which gene set 3 was found to be down-regulated is significantly enriched for arrays that belong to array set 3. You can trace this association back by examining the upper graphical panels, where you will see that gene set 3 is significantly down-regulated in array 3 and array 6, and that both of these arrays, as well as array 5 belong to array set 3. Thus, 2 of the 2 arrays in which gene set 3 is significantly down-regulated belong to array set 3, out of a total of 3 arrays that belong to this experiment set, and thus we may conclude that this association is significant (the hypergeometric distribution p-value in this case is 0.14, see the spreadsheet below). If you selected the Spreadsheet display in the Display panel when creating a module map, then you will also get two spreadsheet-like tables that show the detail of the statistical tests that were performed. For the gene set vs. array enrichments tests, the spreadsheet should look similar to:
Each row in the above table represents a significant gene set vs. experiment association, and lists the array in which the gene set is significantly expressed, the direction of significant expression (Up/Down regulation), the name of the enriched gene set, the p-value of the enrichment, the number of the gene set genes that were found as up- (down-) regulated in this experiment, the total number of genes in the gene set, the percent of genes up- (down-) regulated in the gene set, the total number of genes in the experiment that are up- (down-) regulated, the total number of genes in the experiment, and the total percent of genes in the experiment that are up- (down-) regulated. In general, the larger the difference is between the percent of genes from the gene set that are up- (down-) regulated, compared to the same percent from the total number of genes in the experiment, the lower and more significant the p-value will be. You will also get a similar spreadsheet-like view for the enrichments of experiment sets in the set of arrays in which a gene set is significantly expressed, which should look similar to the following:
Each row in the above table represents a significant experiment set vs. gene set association, indicating that the set of arrays in which the respective gene set is found to be significantly expressed is significantly enriched with arrays that belong to the respective experiment set. The columns in each row correspond to the experiment attribute significantly enriched, the gene set whose arrays are enriched, the p-value of the enrichment, the number of up- (down-) regulated arrays from the gene set that are associated with the respective experiment set, the total number of up- (down-) regulated arrays in the gene set, the percent of up- (down-) regulated arrays in the gene set that belong to the respective experiment ste, the total number of arrays that belong to the experiment set, the total number of arrays in the data, and the total percent of experiments from the data that belong to the respective experiment set.
Back to the gene level: Viewing the set of genes that significantly change
A module map characterizes an expression data relative to the gene and experiment sets that significantly change in it, thereby providing an informative high level view of the expression data. However, in many cases, we would like to examine some of the significant associations shown in the map in more detail. Genomica allows you to do so, by going back to a gene level view for a particular gene set or group of gene sets of interest. We will now demonstrate how this could be done for two of the gene sets in the above example. If you followed the steps in the tutorial above, then go to the graphical view of the module map, and with your mouse, select an invisible region around the labels for gene set 1 and gene set 2. The two gene sets should now be highlighted in red and you should see a view similar to:
From the left control panel, click on the View Gene Hits button, which brings up a dialog box for controlling which genes will be viewed from these two gene sets. The default parameters are useful for most applications, but in this tutorial, due to the small sample files, we will loosen some of the criteria. For the max p-value, select 0.5, select 'No Correction' in the multiple hypothesis correction box, and choose the Sample Experiment Sets at the bottom to display the significant associations of the arrays of the gene sets with these experiment sets. The dialog box should look siimilar to:
After clicking the Analyze button in the above dialog, you should get a view of the original data that represents those genes that contributed to the overall significance of the gene set in the respective arrays, which should be similar to:
This view shows the raw expression data of genes that belong to the selected gene sets (gene set 1 and gene set 2 in this case), that are significantly consistent with the overall pattern of expression of the gene set in those arrays in which the gene set is significantly expressed (in this case, where this association has p-value<0.5 based on our above selection). The arrays displayed are those arrays in which the union of the gene sets selected are significantly expressed. The panel in the center lists this original expression data, the panel on the bottom lists the raw data association of the genes with the gene sets, and the panel on the upper right shows the raw data association of arrays with experiment sets as well as the arrays in which each of the selected gene sets for display were found to be significantly expressed (this latter part is essentially a sub-matrix from the module map that includes the selected gene sets).