g. genes encoding products in a same metabolic pathway) at the top or bottom of a ranked list of genes L. Candidate genes are ranked by their differential expression between two phenotypes. The statistic is a weighted Kolmogorov-Smirnov-like statistic and significance is calculated using an empirical permutation test [13]. Here we applied an extended version of conventional GSEA in order to produce an enrichment score in a single sample as we have previously [14]. Such a score is necessary if one is to make a predictive call on single samples FDA-approved Drug Library without reference to a larger group of samples. In this approach, the genes are ordered based on either absolute
expression (as in the yellow fever vaccine study) or the relative changes with respect to the baseline level (as in the influenza TIV vaccine
study). In this study, we used C2 collection from Molecular Signature Database (MsigDB). The MsigDB is a publicly available database of annotated gene sets hosted at Broad Institute (http://www.broadinstitute.org/gsea/msigdb/index.jsp) [11]. Currently, there are six major collections from C1 to C6 while C2 is a special collection of gene sets carefully curated AZD1208 manufacturer from online pathway databases, publications in PubMed, and knowledge of domain experts. Each of the ∼3000 gene sets in C2 collection is well described in the MsigDB website including the source, annotation as well as other useful information, thus facilitate the interpretation of the biological meaning associated with it.
To detect gene sets whose enrichment scores are highly correlated with phenotypes, we used a normalized mutual information (NMI) score (Eq. (3)) to evaluate the association between phenotypes (day 7 versus day 0 in the yellow fever vaccine study; or high versus low HAI antibody response in the influenza TIV vaccine study) and gene set enrichment scores. (1) The constellation plot is designed to visualize and Teicoplanin thus to elucidate groups of gene sets enriched in a phenotype of interest (e.g. vaccine response) that correspond to distinct biological processes. We reasoned that gene sets that (i) demonstrate high mutual information with respect to the phenotype; (ii) demonstrate high mutual information with respect to each other; and (iii) share overlapping member genes would be likely to reflect similar biological processes. We estimated similarities between N gene sets using an NMI score and further transformed it into a dissimilarity score, d = 1 – NMI. Previous studies [29] have proved that this dissimilarity metric has all the properties of a true mathematical distance (metric), allowing us to represent the association of gene sets with a proper distance matrix D. We visualized this distance matrix D as a radial plot in which the angle between two gene sets represents the distance d between them, and their proximity to the center reflects their differential enrichment with respect to the phenotype (1 – NMI).