PGA: Post-GWAS Analysis

Genome-wide association studies (GWAS) serve as a powerful tool to identify statistically significant, disease-associated variants. However, for many of the disease-associated variants reported by GWAS, the biological mechanisms underlying their association remain unclear. Thus, it is necessary to identify genes potentially affected by these reported variants as possible sources of the disease-association signals. Identifying the candidate disease genes to associate with a variant can be difficult, as many variants lie in non-coding (intronic or intergenic) regions of the genome and therefore share no obvious connection with any gene.

Currently, candidate disease genes are typically determined to be those genes in closest proximity to the variants in question—a method that does not effectively handle variants found in gene deserts, or in close proximity to multiple genes. Furthermore, this method overlooks the possibility that reported variants may be contained in regulatory elements and therefore affect even distant genes.

Here, we present a Java and Perl-based application with a command line interface for post-GWAS analysis. The application incorporates gene regulatory information in producing a list of scored candidate disease genes from a set of GWAS-reported variants, where candidate scores indicate the strength of a gene’s relationship to the disease in question. Thus, our program can identify risk genes both proximal and distal to GWAS-reported variants and determine those risk genes most likely to be the sources of disease-association signals.


PGA: post-GWAS analysis for disease gene identification. Lin JR, Jaroslawicz D, Cai Y, Zhang Q, Wang Z, Zhang ZD (2017) Bioinformatics (In press)


» The stand-alone executable file

» Alzheimer's disease: a case study

User guide

» System requirements

  • Internet access: the program communicates with our server for population genotype data and regulatory information during the execution process.
  • Memory: ~4 GB RAM for a large dataset and 2 GB for a smaller data set.
  • Software: Java 7 or later; VCFtools v0.1.13 or later; Tabix; Perl modules: List::Util, LWP::Simple, File::Spec::Functions

» Running PGA

  • Synopsis: java -cp src edu.yu.einstein.zdzlab.pga.GWASAnalyzer disease population [--c score-cutoff] [--f ]
  • edu.yu.einstein.zdzlab.pga: the package name
  • the main method
  • disease: a mandatory 3-letter disease code (e.g., SCZ, ALZ, etc.)
  • population: a mandatory 3-letter 1KG reference panel population code. For example, 'EUR', 'EAS', and 'AFR' represent European, East Asian, and African populations, respectively.
  • --c score-cutoff: optional flag followed by a score-cutoff (a positive number) to indicate that the user would like to incorporate custom annotation data in the process of identifying distal candidate genes. Only those regulatory elements with a score >= score-cutoff will be incorporated. The custom annotation data must follow a 5-columns BED format. See README.txt for details.
  • --f: optional flag to indicate that predictive features for the disease in question have already been generated in a previous execution with the same training genes and need not be computed again.

  • Note:
  • PGA must be run from the PGA directory (so must use 'java -cp'). If Java throws an 'out of memory error', run PGA with increased heapspace using the -Xmx option (e.g. 'java -Xmx4g -cp src ...').
  • Please see README.txt for the details of other options.

» Input files from the user

  • query_vars.bed: a text file containing a list of input SNPs in the BED format with four tab-delimited columns (chromosome, start position, end position, and SNP ID). There should be no header line. The file must be named: "query_vars.bed" and located in 'PGA/gwascge/input/' directory.
  • [---]_training_genes.txt: [---] represents the 3-letter disease-code (e.g. SCZ, ALZ, etc.) for convenience to differentiate the seed-gene files of different diseases. This training gene file should have > 30 genes to ensure effective candidate gene scoring, and a training gene set containing > 50 genes is recommended. The file must be located in 'PGA/gwascge/input' directory.

» Output files

  • candidate_gene_scores_timestamp.txt: a text file that contains candidate risk gene symbols and their scores. A gene without a symbol is not listed. Genes for which no score could be produced based on the given network/GO data are marked with '--' in the score column. At the end of the file the high scoring cutoff is listed, indicating the score cutoff for which every gene with a score greater than this number should be considered high scoring. The threshold is the value that achieves a prediction precision ≥ 0.8 (See 'Evaluating scores' section).
  • PGA_results_timestamp.txt: a text file that contains risk regions, associated SNPs, and gene symbols (& score). Additionally, each gene symbol may be preceded by a '*', indicating a distal gene. At the end of the file the high scoring cutoff is listed, indicating the score cutoff for which every gene with a score greater than this number should be considered high scoring. The method of obtaining this threshold is explained in the supplement to the program's application note.

    Note: a large number of intermediate output files are produced within the PGA directory. See README.txt in program distribution for details.


  1. Prepare a list of disease GWAS SNPs

    List your query GWAS SNPs in the file 'query_vars.bed' located in 'PGA/gwascge/input' directory in a 4-column tab-delimited format (Chr#    Start    End    SNP ID). An example is shown below:
    chr4	182101688	182101689	rs10018288
    chr2 127852020 127852021 rs10207628
    chr12 106108718 106108719 rs10219670
    chr7 146897402 146897403 rs10273775
    chr18 56752053 56752054 rs1037757
    chr14 92926951 92926952 rs10498633
    0-based and 1-based indexing are both accepted. The file for the case study of Alzheimer's disease can be downloaded here).

  2. Prepare a list of disease seed genes

    List disease seed genes (> 30 genes) in the file 'ALZ_training_genes.txt' located in 'PGA/gwascge/input' directory. Use either gene symbols or Entrez IDs but do not use both in the same seed-gene file. See an example (for Alzheimer's disease).

  3. Run PGA

    Execute the following command under the directory 'PGA':

    java -Xmx4g -cp src edu.yu.einstein.zdzlab.pga.GWASAnalyzer ALZ EUR

    • -Xmx4g specifies 4GB memory to run the program.
    • ALZ is the disease code in this example used to identify the training gene file.
    • EUR is used to calculate linkage disequilibrium (LD) based on the genotype data of Eupopean population. The other two options are EAS (East Asian) and AFR (African).


Our post-GWAS analysis application performs two distinct operations: it identifies candidate disease genes given a set of GWAS-reported variants, and it scores these candidate genes to prioritize those most likely to be the sources of the disease-association signals.

» Identifying risk regions

As GWAS only detect statistical associations from among a pre-select subset of variants, it is necessary to identify all unexamined variants that are in strong linkage disequilibrium (LD) with the GWAS-reported SNPs/indels as potential alternative sources of the disease-association signal. Using VCFtools (Danecek, et al., 2011) and an 1,000 Genomes Project (1KG) reference panel (Genomes Project, et al., 2012), we calculate the LD between each GWAS-reported variant and every 1KG variant within a 400-kb range. An LD block is then formed from all within-range SNPs with r2 > 0.5 and is indexed by the corresponding GWAS-reported variant. We can then use these LD blocks to identify both proximal and distal risk genes.

» Identifying proximal risk genes

We merge all overlapping or adjacent (within 250 kb) LD blocks to form genomic risk regions. Proximal risk genes are identified as those genes that—after extending their ranges by 20 kb on each end—overlap with these genomic risk regions.

» Identifying distal risk genes

In order to incorporate gene regulatory information, we collected lists of expression quantitative trait loci (eQTL) and transcriptional regulatory elements (TREs) with their target genes from ENCODE (Thurman, et al., 2012), FANTOM5 (Andersson, et al., 2014), and GTEx (Consortium, 2013) data. Distal risk genes are determined to be those genes affected by an eQTL or TRE that contains any of the variants found in an LD block (including the GWAS-reported variant itself).

» Scoring risk genes

Analyzing GWAS-reported variants can produce a large number of candidate disease genes—particularly when gene regulatory information is incorporated as well—and it is therefore useful to identify the most likely candidates among these risk genes. Our application employs a statistical method to score the disease-relatedness of risk gene candidates with predictive features extracted from gene networks and annotation based on a set of training genes—genes known to play a role in the etiology of the disease in question.

Given this set of known disease genes D and the set of known genes G (from GENCODE v19), we obtain the set of background genes B = G – D. Then, from the set of known disease genes D we extract our predictive features—the frequent combinations of the neighbors of genes in D in the gene network we employ (need updated FLN), or of the Gene Ontology (GO) terms associated with the genes in D. Our application uses the FP-growth algorithm for frequent itemset mining (HAN et al. 2000) with a support value of ⌈(|D|)/10⌉. Additionally, GO terms of genes in D include both annotated GO terms and their ancestor GO terms along the path of the 'is a' relationship in the GO hierarchy structure. The predictive features generated are limited to 3-itemsets to avoid redundancy and intensive computation.

After extracting our predictive feature set from D, we assign a score to each feature f in the set based on the frequency of its association with genes in D and B:


in which FD is the frequency with which f occurs in D and ND is the number of genes in D. FB and NB have similar meanings. Next, for each candidate disease gene, we identify all the predictive features with which it is associated and assign it the highest score of these features. In the event that a risk gene candidate is a training gene as well, the score Sf of each predictive feature it contains must be adjusted:


As network and annotation scores are treated separately, each gene now has two different scores that must be combined to produce a final gene score:


in which Sf(n) and Sf(a) are the network and annotation-based scores of the gene in question, respectively, and α is a coefficient controlling the amount of influence these two scores have, relative to each other, on the final gene score. α set at 0.4 will yield the best predictive power according to the result of our evaluation (the data is not shown). Every candidate gene is scored using gene sets B and D excluding the candidate gene itself to avoid biased scoring.

Notably, our gene scoring method is limited by information available about known disease genes and is based on the hypothesis that novel disease genes will be involved in the same pathways and mechanisms as known disease genes.

» Evaluating scores

Our application produces a high scoring cutoff threshold in order to give the scores generated some context. The threshold is the value that achieves a prediction precision ≥ 0.8.

» Application output and efficiency

Our post-GWAS analysis application produces a list of candidate disease genes based on user-input, GWAS-reported variants, a list of scored candidate disease genes based on the scoring method outlined above, and intermediate files such as lists of network and annotation-based predictive features that can be reused with the same training gene set in order to avoid redundant computation. The first operation performed by the application, generating risk gene candidates, requires between 15 and 90 seconds per input variant, dependent on the size of the 1KG reference panel in use. The next operation, scoring these candidate genes, can take less than one to several hours, dependent on the size of the network/annotation data in use and the relationship between this data and the training gene set used to generate predictive features. As the frequent itemset mining required to generate predictive features takes a majority of the time in this operation, using pre-generated predictive feature sets significantly reduces the time necessary to score candidate genes.


Andersson, R., et al. An atlas of active enhancers across human cell types and tissues. Nature 2014;507(7493):455-461.

Consortium, G.T. The Genotype-Tissue Expression (GTEx) project. Nature genetics 2013;45(6):580-585.

Danecek, P., et al. The variant call format and VCFtools. Bioinformatics 2011;27(15):2156-2158.

Genomes Project, C., et al. An integrated map of genetic variation from 1,092 human genomes. Nature 2012;491(7422):56-65.

Jiawei, H., et al. Mining frequent patterns without candidate generation. In, SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data. Dallas, Texas, USA.

Thurman, R.E., et al. The accessible chromatin landscape of the human genome. Nature 2012;489(7414):75-82.