Tutorial

Output and display (example of result page)

1. Input data

(Note: The GWAS data uploaded by users will be deleted automatically as soon as the analysis is finished)

1.1 Upload GWAS SNP P-value file (required)

The data should be a text file containing only two columns separated by table and without head line. The first column is SNP ID and the second column is P-value.

rs3934834	0.2743
rs3737728	0.7365
rs6687776	0.2313
rs4970405	0.1973

Note: Our analysis uses -log₁₀(P-value). If your data has been transformed, please un-tick the option "P-value -> -log₁₀(P-value)".

1.2 Specify the most significant SNPs

The most significant SNPs are defined as SNPs with P-value below certain threshold from the GWAS SNP P-value file. They are utilized to search their LD neighborhoods with function (e.g. non-synonymous). The default is P-value < 10^-5 as utilized by NHGRI GWAS Catalog. Users may define the threshold to specify the most significant SNPs, or input/upload rs-IDs of the most significant SNPs. The format is as following:

rs3761847
rs2476601
rs660895

2. Options

2.1 Options for searching LD neighborhoods of the most significant SNPs

HapMap population

Select a HapMap population to help to search the linkage disequilibrium (LD) neighborhoods of the most significant SNPs. The choices are:

CEU (C): Utah residents with Northern and Western European ancestry from the CEPH collection
ASW (A): African ancestry in Southwest USA
CHB (H): Han Chinese in Beijing, China
CHD (D): Chinese in Metropolitan Denver, Colorado
GIH (G): Gujarati Indians in Houston, Texas
JPT (J): Japanese in Tokyo, Japan
LWK (L): Luhya in Webuye, Kenya
MEX (M): Mexican ancestry in Los Angeles, California
MKK (K): Maasai in Kinyawa, Kenya
TSI (T): Toscans in Italy
YRI (Y): Yoruba in Ibadan, Nigeria (West Africa)

Cutoff

Choose the LD measurement (r² or D' respectively) cutoff for defining LD neighborhoods.

Distance (up to 200kb)

The maximum distance to search LD neighborhoods.

2.2 Options for pathway based analysis (PBA)

Rules of mapping SNPs to genes

There are several rules of mapping SNPs to genes: "functional SNPs", "within gene", "5 kb upstream and downstream range of gene", "20 kb upstream and downstream range of gene", "100 kb upstream and downstream range of gene" and "500 kb upstream and downstream range of gene".

Pathways/gene set databases

A pathway/gene set represents the genes involved in the same pathway. Pathway-based analysis search a collection of gene sets to identify the pathways associated to traits.

KEGG

Patyhways/gene sets from KEGG (ttp://www.genome.jp/kegg/pathway.html).

BioCarta

Patyhways/gene sets from BioCarta (http://www.biocarta.com/genes/index.asp).

GO biological process

Level 4 GO (Gene Ontology) terms of biological process domain with curation from MSigDB (http://www.broadinstitute.org/gsea/msigdb/index.jsp) v3.0.

GO molecular function

Level 4 GO (Gene Ontology) terms of molecular function domain with curation from MSigDB (http://www.broadinstitute.org/gsea/msigdb/index.jsp) v3.0.

Upload your own gene sets file

Users can upload their own gene sets.
Up to 100 gene sets. The format requirements of the gene set are:

1) a text file without head line;
2) each gene set per line and table separated;
3) first column is gene set ID, second column is gene set description (use "na" or leave it as blank if not available), and the rest columns are gene HUGO symbols (http://www.genenames.org/). e.g.:

GO0045726	na	NOX1	P61812	Q9Y5S8	TGFB2
GO0016045	na	CD1D	NLRC4	NOD1	NOD2	O75594	P15813	PARG
GO0048536	na	BCL3	JARID2	NFKB2	NKX3-2	P20749	P31314	P78367
GO0010460	na	ADRA1A	ADRA1B	ADRB1	B1N7G2	B1N7G7	CHRNA7

Number of genes in gene set

The size of gene sets can be restrained to avoid the overly narrow or overly broad functional categories. The default minimum and maximum gene number in gene sets are 5 and 100, respectively.

FDR cutoff for PBA

False discovery rate for correcting multiple test for pathway-based analysis. The default FDR value is 0.05.

2.3 Result web link and e-mail notice

When the job is running, there will be a web link in the running page for bookmarking result page. You can also use the email function to get informed when your job is finished.

3. Output and display(example of result page)

In the output interface, there are a list of hypotheses derived from ICSNPathway analysis and two tables. Hypothesis is expressed as [SNP (functional class) -> gene -> pathway(s)]. One table is for candidate causal SNPs and the other table is for candidate causal pathways with detailed information and URLs to externdal database). For candidate causal SNPs, the external URLs include web links to dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/), Ensembl (http://www.ensembl.org/), and SNPedia (http://www.snpedia.com/, this resource is blank for some SNPs). For candidate causal pathways, the external URLs include web links to the original database of the pathways/gene sets and the detailed information of the genes and SNPs in the pathway. The result provides profound evidence on what the candidate causal SNPs are and how they affect the trait - through which biological mechanism.

The output interface also contains the download link, from where a summary of the results can be downloaded.

4. Pathway-based analysis (PBA)

We implement a PBA algorithm, as named i-GSEA (improved-gene set enrichment analysis), on the full list of GWAS SNP P-values to detect pathways associated with traits. Briefly, (a) each SNP is mapped to its nearest gene according to the SNP and gene localization in Ensembl 61 database (http://www.ensembl.org/biomart/martview), and the maximum t = -log(P-value) of the SNPs mapped to a gene is assigned to represent the gene. Then all the genes are ranked by decreasing according their representation values t. (b) For each pathway S, ES (enrichment score, i.e. a Kolmogor-Smirnov like running-sum statistics with weight 1) which measures the tendency that genes of a pathway are located at the top of the ranked gene list is calculated. (c) ES is converted to SPES (significant proportion based enrichment score) by multiplying it to m₁/m₂, where m₁ is the proportion of significant genes (defined as genes mapped with at least one of the 5% most significant SNPs of GWAS) for pathways S, and m₂ is the proportion of significant genes for all the genes in the GWAS. (d) SNP label permutation and normalization are employed to generate the distribution of SPES and to correct gene variation (the bias due to different genes with different number of mapped SNPs) and pathway variation (the bias due to different pathways consisting of different number of genes). (e) Based on all the distributions of SPESs generated by permutation, nominal P-value is calculated and false discovery rate (FDR) is computed for multiple testing correction.

The full description of i-GSEA is described here:

i-GSEA4GWAS: a web server for identification of pathways/gene sets associated with traits by applying an improved gene set enrichment analysis to genome-wide association study. Kunlin Zhang, Sijia Cui, Suhua Chang, Liuyan Zhang, Jing Wang, Nucleic Acids Research (2010) 38 (suppl 2): W90-W95.

Abstract
http://nar.oxfordjournals.org/cgi/content/abstract/gkq324?ijkey=rpopfGT8xvkjfO5&keytype=ref

Full Text
http://nar.oxfordjournals.org/cgi/content/full/gkq324?ijkey=rpopfGT8xvkjfO5&keytype=ref