In the following descriptions, "dataname" is a place holder for one of the three datasets - "moran", "scherzer" and "zhang". 0. install packages pipeGS fron cran (Optional if loading directly the pval_comparison_dataname.rda files provided) 1. place the files dataname_all.RData and computeCorr.R in the same folder, change directory into the folder, and run from shell "Rscript ./computeCorr.R dataname", where dataname is moran, scherzer or zhang. 2. The first time running the script for each dataset will take about 20 minutes to compute phat.saddle1, phat1, phat2, phat3 for all 6180 gene sets. It caches all the pvalues in the file pval_comparison_dataname.rda after the first run, so later runs automatically loads the corresponding images. In the end it prints the correlation Table 3 and the time Table 4 in the paper for Moran and Scherzer dataset. Remark 1. dataname_all.RData contain the following objects - X: the binary indicator for treatment and control groups with length n - Y: the gene measurement matrix for all samples with size n X p, column names are labeled with gene symbols - gs: defitions of gene sets with entrez.id - entrezids.map: the mapping from gene symbol to entrez.id - phat.mc: The monte carlo estimates for 6180 gene sets as described in the paper. Remark 2. The gene set defitions are downloaded from http://software.broadinstitute.org/gsea/msigdb/collections.jsp We used c2.all.v5.1.entrez.gmt and c5.all.v5.1.entrez.gmt. The current version is v6.1. Remark 3. The latest "moran", "scherzer" and "zhang" data can obtained from the following urls respectively. "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE8nnn/GSE8397/matrix/GSE8397-GPL96_series_matrix.txt.gz", "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE6nnn/GSE6613/matrix/GSE6613_series_matrix.txt.gz", "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE20nnn/GSE20292/matrix/GSE20292_series_matrix.txt.gz" These files, once downloaded can be read into R with "getGEO(filename = file.path)" - getGEO is a function in bioconductor package GEOquery. These data have changed since we did our first analysis - more control cases have been added- hence we provide the old data in the format in the above mentioned RData files.