Welcome to BIGA. Despite being in the early stages of development, our platform is equipped with a diverse range of tools and methods for genetic and epidemiological analysis. We recognize the dynamic and continuous development in these fields. Therefore, our team is committed to updating and expanding our platform to incorporate more advanced and useful tools. We are always interested in hearing back from you. If you have recommendations for additional tools or methods that could enrich our platform, please feel free to share them with us. Your insights and expertise will undoubtedly contribute to the development of our platform and benefit other users worldwide. This is an ongoing project for us, so stay tuned for more exciting updates and advancements in the future. We appreciate your patience and your support as we work diligently to implement more methods and improve your experience on our site.
BIGA GWAS Team
This guide is created as a reference for navigating our platform's diverse features with ease. In this manual, you will find an outline of the data specifications needed to get started, a walkthrough of the platform's backend configurations, and an explanation of all the functionalities.
Index | Column Name | Description | Type |
1 | chr | Indicates the chromosome where the single nucleotide polymorphism (SNP) is located | Require either chr,position or snp column. |
2 | position | Refers to the specific base pair position of the SNP on the chromosome | |
3 | snp | Identifier for the SNP (e.g.,rsID) | |
4 | a1 | Effect allele (allele associated with the risk of disease) | Mandatory |
5 | a2 | Other allele (non-effect allele) | Mandatory |
6 | beta | Refers to the estimated effect size of a particular genetic variant on a trait or disease. (beta > 0 means that effect allele (a1) increases disease risk). | Require either beta or odds_ratio column |
6 | odds_ratio | Odds ratio (1 indicates no effect; above 1 indicates effect allele (a1) increases disease risk) | |
7 | se | Standard error, a smaller standard error indicates a more reliable estimate of the beta/odds_ratio coefficient | Mandatory |
8 | eaf | Effect allele frequency, the frequency of the effect allele (a1) in the study population | Optional |
9 | pvalue | Indicates the statistical significance of the association between the SNP and the trait being studied | Mandatory |
10 | n | Sample size, the number of individuals or samples included in the study | Mandatory |
11 | z | Z-score (0 means no association between a genetic variant and disease risk; above than 0 indicates an increased risk associated with the variant) | Optional |
chr
position
snp
pvalue
a1
a2
n
se
odds_ratio
beta
z
eaf
snp
column for rsID and instead includes chromosome and position columns.liftOver
Python package with the
chainfile provided by UCSC (https://hgdownload.cse.ucsc.edu/goldenpath/hg19/liftOver/, hg19ToHg38.over.chain.gz). Rows that cannot be aligned to a new position in GRCh38 will be removed.position
field in the uploaded file uses a 1-based coordinate system.
Rows containing chromosome X, Y, or M will also be excluded from the process. The pipeline currently
supports human genome version GRCh37
and GRCh38
. If your data is created based on an older
version of the human genome e.g. NCBI36, you will need to update the genomic positions using the liftOver
tool from UCSC. However, if your data includes snp
column, position update is not
required. BIGA will automatically update the chromosome and position information to align with the latest
dbSNP build 156 version.
In analyses like LDSC, LAVA, POPCORN, and SumHer, the effect_allele
and
other_allele
columns are interpreted as A1
and A2
, respectively. If
both odds_ratio
and beta
are present, LDSC uses beta
as the signed
summary statistics.
snp
, chr
,
position
, pvalue
, eaf
, a1
, a2
,
n
, odds_ratio
or beta
, and se
. The harmonized file from
the GWAS Catalog might present diverse column names to represent harmonized columns. For instance, beta
could be labeled as hm_beta
or beta
. BIGA first checks for hm_beta
;
if absent, it defaults to the beta
column. Queries will be rejected if the proportion of
missing values in any harmonized column exceeds 50% or if essential fields are missing.
ID
as snp
, CHROM
as
chr
, POS
as position
, LP
as negative log p-value, AF1
as eaf
, ALT
as a1
, REF
as a2
,
ES
as beta
(If the trait is binary, logOR
will be interpreted as beta
), and SE
as se
. Given that IEU OpenGWAS data might
not be harmonized, BIGA performs the harmonization step for any data acquired from this source. Since the
original file includes a snp
column, specifying the human genome build version is unnecessary.
Additionally, data missing the required columns or with more than 80% missing values will not be
successfully queried.bmi
, into the search field. This action will display all traits containing
bmi
. Select the trait that suits your research needs. Given that this data is already
harmonized and stored locally within BIGA, querying data from Neale Lab is most time-saving among the
three external resources.hm_code
appended at the last column in your harmonized summary statistics.
Each row has a harmonization code. Please note that chromosome X, Y and MT are not included in our harmonized summary statistics.Code | Description of Code |
1 | Palindromic; Infer strand; Forward strand; Alleles correct |
2 | Palindromic; Infer strand; Forward strand; Flipped alleles |
3 | Palindromic; Infer strand; Reverse strand; Alleles correct |
4 | Palindromic; Infer strand; Reverse strand; Flipped alleles |
5 | Palindromic; Assume forward strand; Alleles correct |
6 | Palindromic; Assume forward strand; Flipped alleles |
7 | Palindromic; Assume reverse strand; Alleles correct |
8 | Palindromic; Assume reverse strand; Flipped alleles |
9 | Palindromic; Drop palindromic; Not orientated |
10 | Forward strand; Alleles correct |
11 | Forward strand; Flipped alleles |
12 | Reverse strand; Alleles correct |
13 | Reverse strand; Flipped alleles |
14 | Required fields are not known; Not orientated |
15 | No matching variants in reference VCF; Not orientated |
16 | Multiple matching variants in reference VCF; Not orientated |
17 | Palindromic; Infer strand; EAF or reference VCF AF not known; Not orientated |
18 | Palindromic; Infer strand; EAF < specified minor allele frequency threshold; Not orientated |
To simplify and clarify the steps for querying summary statistics data from the GWAS Catalog, we can break down the process into three steps:
• Go to the GWAS Catalog's search panel.
• Enter the name or description of the phenotype or disease you're interested in.
• Navigate to the traits page and then find the trait you are interested in.
• The first instruction figure below gives an example of finding traits related to stroke.
• Navigate to the study page for your chosen trait, which will be labeled with "Study: [Trait ID]". This page will provide important details like sample size and population, which are necessary for your query.
• On the studies page, look for the "Full Summary Statistics" section.
• If there's an "FTP Download" link, this indicates that the original data is available.
• Click this link to check if the data has been harmonized. A harmonized dataset will have a "harmonised" folder on the FTP site, indicating it's ready for querying with BIGA.
• The second instruction figure below will help you identify the "FTP Download" link and the "harmonised" folder.
• In the table below, we list all available summary statistics that is query-able by BIGA.
• Search this table using the Trait ID from Step 1.
• If your trait is listed, it confirms that the data can be queried using BIGA.
Note: This table is updated daily at 5:30am EST.
To simplify and clarify the steps for querying summary statistics data from the IEU OpenGWAS, we can break down the process into two steps:
• Go to the IEU OpenGWAS search panel as shown in the first instruction figure below.
• At the section of Trait contain, enter the name or description of the phenotype or disease you're interested in.
• Navigate to the data page for your chosen trait, which will be labeled with "Phenotype Name and Dataset: [Trait ID]". This page will provide important details like sample size, population and human genome build version, which are necessary for your query.
• The data page is similar to the second instruction page below.
• If the "Download VCF" and "Download Index" blue buttons appear, the summary statistics data is query-able by BIGA.
• Or you can search trait ID or trait description directly using the table below, we list all available summary statistics that is query-able by BIGA.
• To proceed, enter the trait ID, population, and sample size into the query panel, and select the appropriate human genome build version.
Note: The table is updated monthly at 1:30 AM EST on the 11th day of each month.
The Neale Lab datasets have undergone data harmonization. More details about our data processing can be found at UK Biobank (Neale Lab).
The table below lists all the Neale Lab summary statistics data available for querying.
To find the data you are interested in, you can directly search by phenotype name or trait ID. In the panel of querying,
simply entering the trait ID is sufficient.
For more detailed information about the traits, please refer to Neale Lab's website.
CHR
, SNP
, POS
,
A1
, A2
, N
, AF1
, BETA
, SE
,
P
where position is based on the GRCh37 version. These columns are reinterpreted in BIGA's
format as chr
, snp
, position
, a1
, a2
,
n
, eaf
, beta
, se
, pvalue
.
Since the snp
column is provided, BIGA maps chromosomes and positions by referring to the latest
dbSNP build 156 version file when the snp
is the same. Next, BIGA infers the
orientation (forward or reverse) of palindromic variants by randomly selecting 5% of variants for each
chromosome separately, as shown in the log file.
Chromosome 1: Forward strand rate: 0.998836, Reverse strand rate: 0.001164
Chromosome 2: Forward strand rate: 0.999861, Reverse strand rate: 0.000139
Chromosome 3: Forward strand rate: 0.999919, Reverse strand rate: 8.1e-05
Chromosome 4: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 5: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 6: Forward strand rate: 0.996204, Reverse strand rate: 0.003796
Chromosome 7: Forward strand rate: 0.998482, Reverse strand rate: 0.001518
Chromosome 8: Forward strand rate: 0.999525, Reverse strand rate: 0.000475
Chromosome 9: Forward strand rate: 0.998611, Reverse strand rate: 0.001389
Chromosome 10: Forward strand rate: 0.994752, Reverse strand rate: 0.005248
Chromosome 11: Forward strand rate: 0.994062, Reverse strand rate: 0.005938
Chromosome 12: Forward strand rate: 0.999817, Reverse strand rate: 0.000183
Chromosome 13: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 14: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 15: Forward strand rate: 0.993571, Reverse strand rate: 0.006429
Chromosome 16: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 17: Forward strand rate: 0.99989, Reverse strand rate: 0.00011
Chromosome 18: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 19: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 20: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 21: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 22: Forward strand rate: 1.0, Reverse strand rate: 0.0
Overall Forward strand rate: 0.998702, Reverse strand rate:0.001298
The script above is the log file of ukb_phase1to3_roi_volume101_june_2020_pheno_9
trait (one
trait from the Regional brain volumes group), it's observed that the forward strand rate is above 99%. Therefore,
palindromic variants are inferred to be on the forward strand.
Chromosome 6, Harmonization summary:
Counter({'Forward strand; Correct orientation; Already harmonised': 380854, 'Forward strand; Flipped orientation; Requires harmonisation': 86197, 'Palindromic; Assume forward strand; Correct orientation; Already harmonised': 67744, 'Palindromic; Assume forward strand; Flipped orientation; Requires harmonisation': 15806, 'Reverse strand; Correct orientation; Already harmonised': 1193, 'Reverse strand; Flipped orientation; Requires harmonisation': 234})
The above is the log file that summarizes the result of chromosome 6 during the harmonization.beta
, odds
ratio
, z-score
if provided.beta
, odds ratio
, z-score
if provided.beta
,
odds ratio
, z-score
if provided.chr
and position
column. To
accommodate the requirements of LDSC, LAVA, Popcorn, and SumHer for a human genome version of
GRCh37
, BIGA converts the position data from GRCh38
to GRCh37
. Then,
we get the unified format of input file in BIGA.
CHR
, BP
,
A1
, A2
, N
, EAF
, BETA
, SE
,
P
where position is based on the GRCh37 version. These columns are reinterpreted in BIGA's
format as chr
,position
, a1
, a2
, n
, eaf
,
beta
, se
, pvalue
. Without a snp
column provided, the
snp
column is retrieved during the harmonization step. BIGA maps GRCh37
to GRCh38
locations using a liftover chainfile from UCSC, as the reference file dbSNP build 156 version is built on
GRCh38
version. Next, BIGA infers the orientation (forward or reverse) of palindromic variants
by randomly selecting 5% of variants for each chromosome separately, as shown in the log file.
Chromosome 1: Forward strand rate: 0.999728, Reverse strand rate: 0.000272
Chromosome 2: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 3: Forward strand rate: 0.999973, Reverse strand rate: 2.7e-05
Chromosome 4: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 5: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 6: Forward strand rate: 0.999323, Reverse strand rate: 0.000677
Chromosome 7: Forward strand rate: 0.9998, Reverse strand rate: 0.0002
Chromosome 8: Forward strand rate: 0.999858, Reverse strand rate: 0.000142
Chromosome 9: Forward strand rate: 0.999733, Reverse strand rate: 0.000267
Chromosome 10: Forward strand rate: 0.999042, Reverse strand rate: 0.000958
Chromosome 11: Forward strand rate: 0.999109, Reverse strand rate: 0.000891
Chromosome 12: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 13: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 14: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 15: Forward strand rate: 0.999096, Reverse strand rate: 0.000904
Chromosome 16: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 17: Forward strand rate: 0.99986, Reverse strand rate: 0.00014
Chromosome 18: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 19: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 20: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 21: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 22: Forward strand rate: 1.0, Reverse strand rate: 0.0
Overall Forward strand rate: 0.999778, Reverse strand rate:0.000222
The script above is the log file of Z47.gwas.imputed_v3.both_sexes
trait (one trait from the
Neale Lab group), it can be observed that the forward strand rate is above 99%. Therefore, palindromic variants
are inferred to be on the forward strand.
Chromosome 6, Harmonization summary:
Counter({'Forward strand; Correct orientation; Already harmonised': 684193, 'Palindromic; Assume forward strand; Correct orientation; Already harmonised': 121370, 'Forward strand; Flipped orientation; Requires harmonisation': 1043, 'Palindromic; Assume forward strand; Flipped orientation; Requires harmonisation': 216, 'Reverse strand; Correct orientation; Already harmonised': 174, 'Reverse strand; Flipped orientation; Requires harmonisation': 143})
The above is the log file that summarizes the result of chromosome 6 during the harmonization.beta
, odds
ratio
, z-score
if provided.beta
, odds ratio
, z-score
if
provided.beta
,
odds ratio
, z-score
if provided.chr
and position
column. To
accommodate the requirements of LDSC, LAVA, Popcorn, and SumHer, which all require a human genome version of
GRCh37
, BIGA converts the position data from GRCh38
to GRCh37
. Then,
we get the unified format of input file in BIGA.CHR
,SNP
, POS
,
A1
, A2
, N
, AF1
, BETA
, SE
,
P
where position is based on the GRCh37 version. These columns are reinterpreted in BIGA's
format as chr
, snp
, position
, a1
, a2
,
n
, eaf
, beta
, se
, pvalue
. With the snp
column provided, BIGA maps snp
to chromosome and position by referencing the latest reference
file. Next, BIGA infers the orientation (forward or reverse) of palindromic variants by randomly selecting 5%
of variants for each chromosome separately, as shown in the log file.
Chromosome 1: Forward strand rate: 0.998876, Reverse strand rate: 0.001124
Chromosome 2: Forward strand rate: 0.999965, Reverse strand rate: 3.5e-05
Chromosome 3: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 4: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 5: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 6: Forward strand rate: 0.996888, Reverse strand rate: 0.003112
Chromosome 7: Forward strand rate: 0.99783, Reverse strand rate: 0.00217
Chromosome 8: Forward strand rate: 0.99926, Reverse strand rate: 0.00074
Chromosome 9: Forward strand rate: 0.998916, Reverse strand rate: 0.001084
Chromosome 10: Forward strand rate: 0.995177, Reverse strand rate: 0.004823
Chromosome 11: Forward strand rate: 0.994628, Reverse strand rate: 0.005372
Chromosome 12: Forward strand rate: 0.999821, Reverse strand rate: 0.000179
Chromosome 13: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 14: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 15: Forward strand rate: 0.994196, Reverse strand rate: 0.005804
Chromosome 16: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 17: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 18: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 19: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 20: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 21: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 22: Forward strand rate: 0.999575, Reverse strand rate: 0.000425
Overall Forward strand rate: 0.998788, Reverse strand rate:0.001212
The script above is the log file of ukb_smith_dMRI_2_imputed_pheno9
trait (one trait from the
dMRI group of UKB Oxford), it's observed that the forward strand rate is above 99%. Therefore, palindromic
variants are inferred to be on the forward strand.
Chromosome 6, Harmonization summary:
Counter({'Forward strand; Correct orientation; Already harmonised': 380854, 'Forward strand; Flipped orientation; Requires harmonisation': 86199, 'Palindromic; Assume forward strand; Correct orientation; Already harmonised': 67743, 'Palindromic; Assume forward strand; Flipped orientation; Requires harmonisation': 15807, 'Reverse strand; Correct orientation; Already harmonised': 1193, 'Reverse strand; Flipped orientation; Requires harmonisation': 234})
The above is the log file that summarizes the result of chromosome 6 during the harmonization.beta
, odds
ratio
, z-score
if provided.beta
, odds ratio
, z-score
if provided.beta
,
odds ratio
, z-score
if provided.chr
and position
column.
To accommodate the requirements of LDSC, LAVA, Popcorn, and SumHer, which all require a human genome version
of GRCh37
, BIGA converts the position data from GRCh38
to GRCh37
.
Then, we get the unified format of input file in BIGA.
CHR
,SNPID
, POS
,
Allele1
, Allele2
, N
, AF_Allele2
, BETA
,
SE
, p.value
where position is based on the GRCh37 version. Please note that Allele2
is the effect allele. These columns are reinterpreted in BIGA's format as chr
, snp
,
position
, a2
, a1
, n
, eaf
, beta
,
se
, pvalue
. With the snp
column provided, BIGA maps snp
to chromosome and position by referencing the latest reference file. Next, BIGA infers the
orientation (forward or reverse) of palindromic variants by randomly selecting 5% of variants for each
chromosome separately, as shown in the log file.
Chromosome 1: Forward strand rate: 0.999125, Reverse strand rate: 0.000875
Chromosome 2: Forward strand rate: 0.999894, Reverse strand rate: 0.000106
Chromosome 3: Forward strand rate: 0.99997, Reverse strand rate: 3e-05
Chromosome 4: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 5: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 6: Forward strand rate: 0.997778, Reverse strand rate: 0.002222
Chromosome 7: Forward strand rate: 0.998396, Reverse strand rate: 0.001604
Chromosome 8: Forward strand rate: 0.999838, Reverse strand rate: 0.000162
Chromosome 9: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 10: Forward strand rate: 0.997203, Reverse strand rate: 0.002797
Chromosome 11: Forward strand rate: 0.998023, Reverse strand rate: 0.001977
Chromosome 12: Forward strand rate: 0.999909, Reverse strand rate: 9.1e-05
Chromosome 13: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 14: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 15: Forward strand rate: 0.99764, Reverse strand rate: 0.00236
Chromosome 16: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 17: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 18: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 19: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 20: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 21: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 22: Forward strand rate: 0.999809, Reverse strand rate: 0.000191
Overall Forward strand rate: 0.999342, Reverse strand rate:0.000658
The script above is the log file of GWASsummary_Zoster_Japanese_SakaueKanai2020
trait (one
trait from the Biobank Japan), it's observed that the forward strand rate is above 99%. Therefore,
palindromic variants are inferred to be on the forward strand.
Chromosome 6, Harmonization summary:
Counter({'Forward strand; Correct orientation; Already harmonised': 628517, 'Palindromic; Assume forward strand; Correct orientation; Already harmonised': 103469, 'Reverse strand; Correct orientation; Already harmonised': 1256, 'Forward strand; Flipped orientation; Requires harmonisation': 779, 'Palindromic; Assume forward strand; Flipped orientation; Requires harmonisation': 356, 'Reverse strand; Flipped orientation; Requires harmonisation': 25})
The above is the log file that summarizes the result of chromosome 6 during the harmonization.beta
, odds ratio
,
z-score
if provided.beta
, odds ratio
, z-score
if provided.beta
,
odds ratio
, z-score
if provided.chr
and position
column.
To accommodate the requirements of LDSC, LAVA, Popcorn, and SumHer, which all require a human genome version
of GRCh37
, BIGA converts the position data from GRCh38
to GRCh37
.
Then, we get the unified format of input file in BIGA.
CHROM
,POS
, ALLELE0
,
ALLELE1
, N
, A1FREQ
, BETA
, SE
,
P
where position is based on the GRCh37 version. Please note that ALLELE1
is the
effect allele. These columns are reinterpreted in BIGA's format as chr
, position
,
a2
, a1
, n
, eaf
, beta
, se
,
pvalue
. Without a snp
column provided, the snp
column is retrieved
during the harmonization step. BIGA maps hg19 to hg38 locations using a liftover chainfile from UCSC, as the
reference file dbSNP build 156 version is built on GRCh38
version. Next, BIGA infers the
orientation (forward or reverse) of palindromic variants by randomly selecting 5% of variants for each
chromosome separately, as shown in the log file.
Chromosome 1: Forward strand rate: 0.999762, Reverse strand rate: 0.000238
Chromosome 2: Forward strand rate: 0.99998, Reverse strand rate: 2e-05
Chromosome 3: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 4: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 5: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 6: Forward strand rate: 0.999208, Reverse strand rate: 0.000792
Chromosome 7: Forward strand rate: 0.999884, Reverse strand rate: 0.000116
Chromosome 8: Forward strand rate: 0.999969, Reverse strand rate: 3.1e-05
Chromosome 9: Forward strand rate: 0.999528, Reverse strand rate: 0.000472
Chromosome 10: Forward strand rate: 0.998169, Reverse strand rate: 0.001831
Chromosome 11: Forward strand rate: 0.999345, Reverse strand rate: 0.000655
Chromosome 12: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 13: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 14: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 15: Forward strand rate: 0.999345, Reverse strand rate: 0.000655
Chromosome 16: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 17: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 18: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 19: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 20: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 21: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 22: Forward strand rate: 1.0, Reverse strand rate: 0.0
Overall Forward strand rate: 0.999754, Reverse strand rate:0.000246
The script above is the log file of ZPR1_O75312_OID30801_v1_Neurology_II
trait (one trait from
the UKB-PPP), it is observed that the forward strand rate is above 99%. Therefore, palindromic variants are
inferred to be on the forward strand.
Chromosome 6, Harmonization summary:
Counter({'Forward strand; Correct orientation; Already harmonised': 784340, 'Palindromic; Assume forward strand; Correct orientation; Already harmonised': 139553, 'Forward strand; Flipped orientation; Requires harmonisation': 932, 'Reverse strand; Correct orientation; Already harmonised': 205, 'Palindromic; Assume forward strand; Flipped orientation; Requires harmonisation': 187, 'Reverse strand; Flipped orientation; Requires harmonisation': 178})
The above is the log file that summarizes the result of chromosome 6 during the harmonization.beta
, odds ratio
,
z-score
if provided.beta
, odds ratio
, z-score
if provided.beta
, odds ratio
, z-score
if provided.chr
and position
column.
To accommodate the requirements of LDSC, LAVA, Popcorn, and SumHer, which all require a human genome version
of GRCh37
, BIGA converts the position data from GRCh38
to GRCh37
.
Then, we get the unified format of input file in BIGA.
CHR
,SNP
, POS
,
effect_allele
, other_allele
, N
, effect_allele_freqency
,
beta
, se
, P
where position is based on the GRCh37 version. These
columns are reinterpreted in BIGA's format as chr
, snp
, position
,
a1
, a2
, n
, eaf
, beta
, se
,
pvalue
. With the snp
column provided, BIGA maps snp
to chromosome and
position by referencing the latest reference file. Next, BIGA infers the orientation (forward or reverse) of
palindromic variants by randomly selecting 5% of variants for each chromosome separately, as shown in the
log file.
Chromosome 1: Forward strand rate: 0.999709, Reverse strand rate: 0.000291
Chromosome 2: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 3: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 4: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 5: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 6: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 7: Forward strand rate: 0.999907, Reverse strand rate: 9.3e-05
Chromosome 8: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 9: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 10: Forward strand rate: 0.999918, Reverse strand rate: 8.2e-05
Chromosome 11: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 12: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 13: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 14: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 15: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 16: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 17: Forward strand rate: 0.999802, Reverse strand rate: 0.000198
Chromosome 18: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 19: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 20: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 21: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 22: Forward strand rate: 0.999707, Reverse strand rate: 0.000293
Overall Forward strand rate: 0.999958, Reverse strand rate:4.2e-05
The script above is the log file of finngen_R9_Z21_PROCREATIVE_MANAG_GRCh37_CrossMap
trait (one
trait from the FinnGen), it's observed that the forward strand rate is above 99%. Therefore, palindromic
variants are inferred to be on the forward strand.
Chromosome 6, Harmonization summary:
Counter({'Forward strand; Correct orientation; Already harmonised': 945824, 'Palindromic; Assume forward strand; Correct orientation; Already harmonised': 166105, 'Forward strand; Flipped orientation; Requires harmonisation': 45, 'Palindromic; Assume forward strand; Flipped orientation; Requires harmonisation': 6, 'Reverse strand; Correct orientation; Already harmonised': 5})
The above is the log file that summarizes the result of chromosome 6 during the harmonization.beta
, odds ratio
,
z-score
if provided.beta
, odds ratio
, z-score
if provided.chr
and position
column.
To accommodate the requirements of LDSC, LAVA, Popcorn, and SumHer, which all require a human genome version
of GRCh37
, BIGA converts the position data from GRCh38
to GRCh37
.
Then, we get the unified format of input file in BIGA.
snp
, chr
, position
, pvalue
, eaf
,
a1
, a2
, n
, odds_ratio
or beta
, and
se
. The harmonized file from the GWAS Catalog might present diverse column names to represent
harmonized columns. For instance, beta
could be labeled as `hm_beta` or `beta`. BIGA first
checks for hm_beta
; if absent, it defaults to the beta
column.GRCh37
, BIGA converts the position from GRCh38
to GRCh37
. Then, we
get the unified format of input file in BIGA.
chr
,pos
, Other_allele
, Effect_allele
, af_Effect_allele
, beta
, se
, -log10P
, P
, N
where position is based on the GRCh37 version. pvalue
is calculated from -log10P
. These columns are reinterpreted in BIGA's format as chr
, position
, a2
, a1
, eaf
, beta
, se
, pvalue
, n
. Without a snp
column provided, the snp
column is retrieved during the harmonization step. BIGA maps hg19 to hg38 locations using a liftover chainfile from UCSC, as the reference file dbSNP build 156 version is built on GRCh38
version. Next, BIGA infers the orientation (forward or reverse) of palindromic variants by randomly selecting 5% of variants for each chromosome separately, as shown in the log file.
Chromosome 1: Forward strand rate: 0.999758, Reverse strand rate: 0.000242
Chromosome 2: Forward strand rate: 0.99998, Reverse strand rate: 2e-05
Chromosome 3: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 4: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 5: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 6: Forward strand rate: 0.999614, Reverse strand rate: 0.000386
Chromosome 7: Forward strand rate: 0.999405, Reverse strand rate: 0.000595
Chromosome 8: Forward strand rate: 0.999934, Reverse strand rate: 6.6e-05
Chromosome 9: Forward strand rate: 0.999876, Reverse strand rate: 0.000124
Chromosome 10: Forward strand rate: 0.998454, Reverse strand rate: 0.001546
Chromosome 11: Forward strand rate: 0.99927, Reverse strand rate: 0.00073
Chromosome 12: Forward strand rate: 0.999929, Reverse strand rate: 7.1e-05
Chromosome 13: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 14: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 15: Forward strand rate: 0.998746, Reverse strand rate: 0.001254
Chromosome 16: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 17: Forward strand rate: 0.999937, Reverse strand rate: 6.3e-05
Chromosome 18: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 19: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 20: Forward strand rate: 1.0, Reverse strand rate: 0.0
Chromosome 21: Forward strand rate: 0.999867, Reverse strand rate: 0.000133
Chromosome 22: Forward strand rate: 0.999744, Reverse strand rate: 0.000256
Overall Forward strand rate: 0.999748, Reverse strand rate:0.000252
The script above is the log file of R_subiculum_head
trait (one
trait from the CHIMGEN), it's observed that the forward strand rate is above 99%. Therefore, palindromic
variants are inferred to be on the forward strand.
Chromosome 6, Harmonization summary:
Counter({'Forward strand; Correct orientation; Already harmonised': 775133, 'Palindromic; Assume forward strand; Correct orientation; Already harmonised': 144108, 'Forward strand; Flipped orientation; Requires harmonisation': 1035, 'Palindromic; Assume forward strand; Flipped orientation; Requires harmonisation': 216, 'Reverse strand; Correct orientation; Already harmonised': 197, 'Reverse strand; Flipped orientation; Requires harmonisation': 171})
The above is the log file that summarizes the result of chromosome 6 during the harmonization.beta
, odds ratio
,
z-score
if provided.beta
, odds ratio
, z-score
if provided.beta
,
odds ratio
, z-score
if provided.chr
and position
column.
To accommodate the requirements of LDSC, LAVA, Popcorn, and SumHer, which all require a human genome version
of GRCh37
, BIGA converts the position data from GRCh38
to GRCh37
.
Then, we get the unified format of input file in BIGA.
munge_sumstats.py
from the LDSC software. The script parameters differ based on columns provided in the input file:
beta
and odds_ratio
are present, BIGA prioritizes beta. The
followings are the BIGA's default running script:python2 $ldsc_path/munge_sumstats.py --sumstats $summary_statistics1 --out $output --merge-alleles $ldsc_path/w_hm3.snplist --chunksize 100000 --ignore 'odds_ratio'
If only one signed summary statistic is present:
python2 $ldsc_path/munge_sumstats.py --sumstats $summary_statistics1 --out $output --merge-alleles $ldsc_path/w_hm3.snplist --chunksize 100000
BIGA's format does not require an INFO column. LDSC used the HapMap3 variants (with the --merge-alleles
flag) and the variants in the major histocompatibility complex region were removed. GWAS summary statistics datasets have the sample size column, making the --N
flag
unnecessary.--w-ld-chr
and --ref-ld-chr
flags) were from the 1000 Genomes data and were downloaded from LDSC. Specifically:
--w-ld-chr eur_w_ld_chr/ --ref-ld-chr eur_w_ld_chr/
--w-ld-chr eas_ldscores/ --ref-ld-chr eas_ldscores/
python2 $ldsc_path/ldsc.py --ref-ld-chr $ref_chr --w-ld-chr $w_chr --out $ldsc_path/log/$job_id/one --rg $summary_statistics1,$summary_statistics2'ukb_phase1to3_ica_node_may_2022_imputed_pheno10_harmo.sumstats.gz',$summary_statistics2'ukb_phase1to3_ica_node_may_2022_imputed_pheno41_harmo.sumstats.gz',$summary_statistics2'ukb_phase1to3_ica_node_may_2022_imputed_pheno29_harmo.sumstats.gz',...
cd $ldsc_path/log/$job_id/
mkdir less
for file in *log
do
sed -n -e '/Summary of Genetic Correlation Results/,/Analysis finished/ p' ${file} > less/less_${file}
awk 'NR>2 && NR<79{print}' less/less_${file} > less/temp.txt ; mv less/temp.txt less/less_${file}
done
ref-ld-chr $ref_chr
and --w-ld-chr $w_chr
specify the reference LD scores and their weights, which vary by the population. --out
$ldsc_path/log/$job_id/one
determines the output directory and filename prefix for the LDSC results
for each job. --rg
is used to instruct LDSC to calculate genetic correlation. $summary_statistics1
is the path to the input file. $summary_statistics2
is the path prefix of the curated dataset.
Each file in this dataset represents different phenotypes, and LDSC will calculate the genetic correlations
between the input file and the curated datasets.
sed -n -e '/Summary of Genetic Correlation
Results/,/Analysis finished/ p' ${file}
: This command extracts a specific section from each log file,
starting from where it finds the phrase "Summary of Genetic Correlation Results" and ending at "Analysis
finished". This section is then redirected to a new file in the less directory. awk 'NR>2 &&
NR<79{print}' less/less_${file}
: This awk command further processes each extracted section, selecting
only specific lines (from line 3 to 78), which presumably contain the relevant summarized data. The result
is temporarily saved in less/temp.txt. Afterwards, BIGA will save all the results into the database.
chr
(chromosome),
position
, a1
(allele 1), a2
(allele 2), n
(sample size),
and statistical measures like beta
or odds_ratio
and se
(standard
error) are extracted from this universal format. These are then transformed into a format suitable for
SumHer: chr:position
is labeled as Predictor
, a1
as A1
,
a2
as A2
, and n
remains. The statistical measures are converted into
Z statistics, either using beta/se
or log(odds_ratio)/se
. Additionally, the snp
column filters out SNPs that are not part of the HapMap3 dataset (approximately 1.2 million SNPs). After preparing
these columns, BIGA removes any duplicate Predictors to retain only the first occurrence and keeps rows
with single-character alleles (A, C, G, T).--allow-ambiguous
YES
. --check-sums NO
to avoid related errors.
(arr=("finngen_R9_O15_PREG_OTHER_MAT_DISORD_GRCh37_CrossMap_harmo.txt",...,"finngen_R9_PAIN_GRCh37_CrossMap_harmo.txt")
for trait2 in "${arr[@]}"; do
out=$sumHer_path/log/$job_id/$trait2
$sumHer_path/ldak5.2.$envir --summary $summary_statistics1 --summary2 $summary_statistics2/$trait2 --sum-cors $out --tagfile $sumHer_path/ldak.thin.hapmap.$popu.tagging --check-sums NO --allow-ambiguous YES
done) &
(arr=("finngen_R9_M13_ARTHROSIS_KNEE_GRCh37_CrossMap_harmo.txt" ,...,"finngen_R9_E4_LIPOPROT_GRCh37_CrossMap_harmo.txt")
for trait2 in "${arr[@]}"; do
out=$sumHer_path/log/$job_id/$trait2
$sumHer_path/ldak5.2.$envir --summary $summary_statistics1 --summary2 $summary_statistics2/$trait2 --sum-cors $out --tagfile $sumHer_path/ldak.thin.hapmap.$popu.tagging --check-sums NO --allow-ambiguous YES
done) &
wait
$summary_statistics1
represents the input summary data, either uploaded or queried by the user.
$summary_statistics2
is the directory where BIGA's curated data is stored. $trait2
refers to specific trait filenames from the FinnGen dataset in this case. $popu
is a variable
indicating the population-specific GWAS dataset, which can be either 'gbr' for the European population or
'eas' for East Asian. The script employs a parallel processing approach, indicated by &
at the
end of each command. Two arrays of filenames (`arr`) are iterated over, and for each file, a SumHer analysis
is conducted using the ldak5.2.$envir
command. $envir
can either be
.mac
or .linux
depending on the operation system. This command processes both the
user's input data and BIGA's curated data, applying the necessary configuration settings, like tagging file
selection (--tagfile
), allowing ambiguous alleles (--allow-ambiguous YES
), and
bypassing SNP checks (--check-sums NO
). The wait
command at the end ensures that
all parallel processes are completed before moving on.
snp
, a1
,a2
,n
,beta
or odds_ratio
,pvalue
are extracted from this universal format. These are then
transformed into a format suitable for LAVA: snp
is labeled as SNP
,
a1
as A1
and a2
and A2
, n
as N
,
beta
as BETA
, pvalue
as P
, odds_ratio
as
OR
. Furthermore, the SNPs in the input file are filtered using reference data from Phase 3 of
the 1000 Genomes Project. LAVA loci can be defined using genomic coordinates or a list of SNPs, and BIGA
uses the same loci definition as LAVA, utilizing the 1000 Genomes Project data (Phase 3, build
GRCh37/hg19). Sample overlap file is set to NULL.
testPath = trait1_path
testName = t1_name
infoFile = paste('data/',job_id,'/','input.info.txt',sep = "")
sink(infoFile)
cat("phenotype cases controls filename\n")
temp = paste(testName,"NA","NA",testPath,'\n',sep=" ")
cat(temp,append = FALSE)
sink(NULL)
trait1 = process.input(input.info.file=infoFile,
sample.overlap.file=NULL,
ref.prefix="vignettes/data/ldsc_g1000_eur",phenos=c(t1_name))
In this segment, the script prepares the environment for LAVA analysis. It sets up the path of input summary
data and filename for the trait under study (trait1_path
and t1_name
). The infoFile
is a configuration file that details the input file, including cases and controls set to NA, along with
their respective filenames. The process.input
function is then called to process this input
data, considering the reference file from phase 3 of the 1000 Genomes Project (ref.prefix
).
if (n.run>0){
for (i in 1:n.run) {
start_time <- Sys.time()
# run all lava for each run
for (j in 1:batch_size) {
if (j == 1){
input = readRDS(paste(lava_summary,'/',trait2_names[(i-1)*batch_size+j],sep = ""))
input = appendEnv(trait1,input)
}else{
tmp = readRDS(paste(lava_summary,'/',trait2_names[(i-1)*batch_size+j],sep = ""))
input = appendEnv(input,tmp)
}
}
data_name = paste0((i-1)*batch_size+1,'_',i*batch_size,sep="")
# Read in locus info file
loci = read.loci("support_data/blocks_s2500_m25_f1_w200.GRCh37_hg19.locfile")
n.loc = nrow(loci)
### Analyse
progress = ceiling(quantile(1:n.loc, seq(.05,1,.05)))
b=u=list()
for (i in 1:n.loc) {
if (i %in% progress)print(paste("..",names(progress[which(progress==i)])))
try({locus = process.locus(loci[i,], input)})
if(!is.null(locus)){
# extract some general locus info for the output
loc.info = data.frame(locus = locus$id, chr = locus$chr, start = locus$start, stop = locus$stop, n.snps = locus$n.snps, n.pcs = locus$K)
if (t1_name %in% locus$phenos){
if(length(locus$phenos)>1) {
# run and bivariate tests
out = run.bivar(locus, target=t1_name)
if(!is.null(out$bivar)) {
b[[i]] = cbind(loc.info, out$bivar)}}}}}}}
This portion of the script is dedicated to running bivariate analysis using LAVA. It iterates through
multiple runs (n.run
), each corresponding to a different trait. For each trait, the script
reads loci information and processes each locus. The run.bivar
function performs bivariate
analysis on each locus, comparing it against the target phenotype (t1_name
). The results are
accumulated in the variable b
, which holds the analysis outcomes.
dataframes = [pd.read_csv(file) for file in file_paths]
df = pd.concat(dataframes, ignore_index=True)
This part reads multiple output files (specified in file_paths
) into Pandas DataFrames and then
concatenates them into a single DataFrame (df
). Each file is expected to be in CSV format.
dfBand = pd.read_csv(lava_path+'/vignettes/data/name/cytoBand.csv', header=0)
bands1, bands2, bands3 = [], [], []
#df = df.reset_index()
for index, row in df.iterrows():
chrIndex = "chr"+str(row["chr"])
start = int(row["start"])
end = int(row["stop"])
band = dfBand[(dfBand["chr"] == chrIndex) & (dfBand["start"] <= start) & (dfBand["end"] >= end)]
if band.empty:
bands = dfBand[(dfBand["chr"] == chrIndex) & (dfBand["start"] <= end) & (dfBand["end"] >= start)]
names = list(bands["name"])
if bands.empty:
bands1.append("NA")
bands2.append("NA")
bands3.append("NA")
elif len(bands) <= 2:
if (int(list(bands["start"])[1]) - start) >= (end - int(list(bands["start"])[1])):
bands1.append(names[0])
bands2.append(names[1])
bands3.append("NA")
else:
bands1.append(names[1])
bands2.append(names[0])
bands3.append("NA")
else:
if (int(list(bands["start"])[1]) - start) >= (end - int(list(bands["start"])[2])):
bands1.append(names[1])
bands2.append(names[0])
bands3.append(names[2])
else:
bands1.append(bands["name"][1])
bands2.append(bands["name"][2])
bands3.append(bands["name"][0])
else:
name = list(band["name"])
bands1.append(name[0])
bands2.append("NA")
bands3.append("NA")
df["bands1"] = bands1
df["bands2"] = bands2
df["bands3"] = bands3
This section annotates the genomic data with chromosomal bands. It reads a file containing information about
chromosomal bands (cytoBand.csv
) and iterates through each row of the main DataFrame
(df
). For each row, it determines the chromosomal bands (bands1
,
bands2
, bands3
) that overlap with the genomic region specified by
start
and stop
in the row. This is done by comparing the row's genomic region with
those in the chromosomal band data (dfBand
).
significant.csv
),
which contains only significant results. These results will be displayed on the BIGA result table page.
snp
,
a1
,a2
,n
,beta
, se
,
odds_ratio
,pvalue
,eaf
are extracted from this universal format. These
are then transformed into a format suitable for Popcorn: snp
is labeled as SNP
,
a1
and a2
remains, n
as N
, beta
remains,
se
as SE
, pvalue
as p-value
, odds_ratio
as
OR
. Additionally, Popcorn only requires that the effect allele to be the same column in the file
for both populations (e.g. a1
as effect in both input files from pop1 and pop2). The parameters
AF
in Popcorn can be calculated by 1-eaf
since AF
is the allele
frequency of a2
and eaf
is the frequency of a1
Fit mode
in Popcorn. The default estimator is
regression, following Popcorn's recommendation due to the relative instability of the maximum likelihood
estimation (MLE) methods. You can select between Regression
and MLE
for the
estimator. Another consideration in Popcorn is whether to compute the genetic effect correlation or the
genetic impact correlation. BIGA uses --gen_effet
as default since Popcorn believes that it's a more
realistic model. BIGA uses MAF threshold of 0.05 by default since the larger the cutoff, the closer the
genetic effect and the genetic impact results will be.
arr=("GWASsummary_BrC_Japanese_SakaueKanai2020.auto_harmo.txt",...,"GWASsummary_SAS_Japanese_SakaueKanai2020.auto_harmo.txt")
if [ $use_mle == "yes" -a $use_gen_effect == "yes" ]; then
for trait2 in "${arr[@]}"; do
out=$popcorn_path/log/$job_id/$trait2
python -m popcorn fit -v 1 --use_mle --cfile $popcorn_path/ref/EUR_EAS_all_gen_imp.cscore --gen_effect --sfile1 $summary_statistics1 --sfile2 $biobankjapan_dir/$trait2 $out
done
elif [ $use_mle == "yes" -a $use_gen_effect == "no" ]; then
for trait2 in "${arr[@]}"; do
out=$popcorn_path/log/$job_id/$trait2
python -m popcorn fit -v 1 --use_mle --cfile $popcorn_path/ref/EUR_EAS_all_gen_imp.cscore --sfile1 $summary_statistics1 --sfile2 $biobankjapan_dir/$trait2 $out
done
elif [ $use_mle == "no" -a $use_gen_effect == "yes" ]; then
for trait2 in "${arr[@]}"; do
out=$popcorn_path/log/$job_id/$trait2
python -m popcorn fit -v 1 --cfile $popcorn_path/ref/EUR_EAS_all_gen_imp.cscore --gen_effect --sfile1 $summary_statistics1 --sfile2 $biobankjapan_dir/$trait2 $out
done
else
for trait2 in "${arr[@]}"; do
out=$popcorn_path/log/$job_id/$trait2
python -m popcorn fit -v 1 --cfile $popcorn_path/ref/EUR_EAS_all_gen_imp.cscore --sfile1 $summary_statistics1 --sfile2 $biobankjapan_dir/$trait2 $out
done
fi
$summary_statistics1
represents the input summary data, either uploaded or queried by the user.
$biobankjapan_dir
is the directory where the curated data is stored. $trait2
refers to specific trait filenames from the Biobank Japan dataset. EUR_EAS_all_gen_imp.cscore
is the pre-calculated score. The script executes a series of conditional operations based on user-specified
parameters (use_mle
, use_gen_effect
). It iterates through an array of genetic
traits (listed in arr
), performing the specified analysis for each trait. The if
,
elif
, and else
statements direct the flow of execution.
if [ $samp_prev == "None" ]; then
python2 $ldsc_path/ldsc.py --rg $summary_statistics1,$summary_statistics2 --ref-ld-chr $ref_chr --w-ld-chr $w_chr --out $out
else
python2 $ldsc_path/ldsc.py --rg $summary_statistics1,$summary_statistics2 --ref-ld-chr $ref_chr --w-ld-chr $w_chr --out $out --samp-prev $samp_prev --pop-prev $pop_prev
fi
Here, $summary_statistics1
and $summary_statistics2
represent the file paths for
the first and second traits, respectively. When you opt for conversion to the Liability scale, additional
flags --samp-prev
and --pop-prev
are included in the command to incorporate the
prevalence parameters.
$sumHer_path/ldak5.2.$envir --summary $summary_statistics1 --summary2 $summary_statistics2 --sum-cors $out --tagfile $sumHer_path/ldak.thin.hapmap.$popu.tagging --check-sums NO --allow-ambiguous YES --cutoff $cutoff --genomic-control $geno_control
$summary_statistics1
represents the first input summary data, either uploaded or queried by the
user. $summary_statistics2
represents the second input summary data. $popu
is a
variable indicating the population-specific GWAS dataset, which can be either 'gbr' for the European
population or 'eas' for East Asian. A SumHer analysis is conducted using the ldak5.2.$envir
command. $envir
can either be .mac or .linux
depending on the
operating system. The parameter $cutoff
is employed to determine the threshold percentage for
excluding loci with significant effects. Finally, the option --genomic-control
is set to either
YES or NO to decide on the allowance for multiplicative inflation in the test statistics.
run.univ.bivar(locus, univ.thresh=uinv_thresh,
p.values=p_values, CIs=ci, param.lim=parlim)
Rscript $lava_path/shell/2traits.R $lava_path $summary_statistics1 $summary_statistics2 $job_id $t1_name $t2_name $univ_thresh $p_values $ci $parlim --no-restore --no-save
This is an example script of LAVA pairwise analysis, which is the command used to run R scripts from the
command line.$summary_statistics1
represents the first input summary data, either uploaded or
queried by the user. $summary_statistics2
represents the second input summary data. job_id
is your job running id. t1_name,t2_name
are the traits' names. univ_thresh
is the
p-value threshold for the univariate test. If ci
is set to 'YES', then calculate confidence
intervals. If p_values
is set to `YES`, then calculate p-values. parlim
is the +-
threshold for estimated parameters.
if [ $use_mle == "use_mle" -a $use_gen_effect == "use_gen_effect" ]; then
out=$popcorn_path/2traits/log/$job_id/mle_gen_effect
python -m popcorn fit -v 1 --use_mle --cfile $popcorn_path/ref/EUR_EAS_all_gen_imp.cscore --gen_effect --sfile1 $summary_statistics1 --sfile2 $summary_statistics2 $out
elif [ $use_mle == "use_mle" -a $use_gen_effect == "use_gen_impact" ]; then
out=$popcorn_path/2traits/log/$job_id/mle_gen_imp
python -m popcorn fit -v 1 --use_mle --cfile $popcorn_path/ref/EUR_EAS_all_gen_imp.cscore --sfile1 $summary_statistics1 --sfile2 $summary_statistics2 $out
elif [ $use_mle == "use_regression" -a $use_gen_effect == "use_gen_effect" ]; then
out=$popcorn_path/2traits/log/$job_id/mle_reg_effect
python -m popcorn fit -v 1 --cfile $popcorn_path/ref/EUR_EAS_all_gen_imp.cscore --gen_effect --sfile1 $summary_statistics1 --sfile2 $summary_statistics2 $out
else
out=$popcorn_path/2traits/log/$job_id/mle_reg_imp
python -m popcorn fit -v 1 --cfile $popcorn_path/ref/EUR_EAS_all_gen_imp.cscore --sfile1 $summary_statistics1 --sfile2 $summary_statistics2 $out
fi
$summary_statistics1
represents the first input summary data, either uploaded or queried by the
user. $summary_statistics2
represents the second input summary data. EUR_EAS_all_gen_imp.cscore
is the pre-calculated score. The script executes a series of conditional operations based on user-defined
parameters (use_mle
, use_gen_effect
). The `if`, `elif`, and `else` statements
direct the flow of execution.
ieu-b-30
from IEU OpenGWAS and then running Massive
LDSC Analysis with PGC dataset.
************************************************************************
* BigaGWAS WorkShop
* version 1.1.0
* Yujue Li and Bingxin Zhao (2023-2024)
* Purdue University and University of Pennsylvania
* GNU General Public License v3
************************************************************************
This section is the masthead. It contains basic information about the software, including its name, version,
authors, the associated institutions, and the license under which it is distributed.
************************************************************************
******************* Task Type: Massive LDSC Analysis *******************
************************************************************************
Massive Analysis: Method=massive_ldsc Job_type=group_pgc Job_id=995.
************************************************************************
******************** Task ID and Job Configurations ********************
************************************************************************
Task running ID: 9091bd78-1232-4391-9fed-24c5cf66c694.
Current Job Settings:{'title': 'GCST90000060', 'username': 'admin', 'trait1_name': 'GCST90000060', 'job_id': '995', 'email_to': 'li4476@purdue.edu', 'need_harmonize': False, 'dropna': True, 'header': None, 'snp': False, 'query_error': False, 'chunksize_map_rsid': 200000, 'chunksize_harmo': 200000, 'gwas_catalog_chunksize': 10000, 'chunksize': 100000, 'convert_back_to_hg19': True, 'data_dir': '/Users/yujue/Documents/githubPackage/ldsc/data/995', 'log_dir': '/Users/yujue/Documents/githubPackage/ldsc/log/995', 'summary1_path': '/Users/yujue/Documents/githubPackage/ldsc/data/995/995_ldsc.sumstats.gz', 'compression': 'txt', 'job_type': 'massive_ldsc', 'choose_type': 'group_pgc', 'value': 'pgc', 'LDSC_PATH': '/Users/yujue/Documents/githubPackage/ldsc', 'population': 'european', 'group': 'PGC_and_Brain_Disorders', 'add_prev': False, 'samp_prev': 'nan', 'pop_prev': 'nan', 'chr': None, 'position': None, 'rsid': None, 'n': None, 'a1': None, 'a2': None, 'beta': None, 'eaf': None, 'pvalue': None, 'se': None, 'odds_ratio': None, 'zscore': None, 'hg_version': None, 'hg19_final_data': '/Users/yujue/Documents/githubPackage/ldsc/data/995/995.txt', 'hg19_final_data_gz': '/Users/yujue/Documents/githubPackage/ldsc/data/995/995.txt.gz', 'link1': 'http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics', 'sample_size_catalog': '55134', 'query_party': 'gwas_catalog', 'select_trait1': 'querying', 'query_id': 'GCST90000060', 'celery_id': '9091bd78-1232-4391-9fed-24c5cf66c694'}.
In this section, the log file provides an overview of the current analysis being run. It includes the method
used (`massive_ldsc`), the type of job (`group_pgc`), and a unique job ID (`758`). There's also a task
running ID, which is likely a unique identifier for this specific instance of the task.
************************************************************************
********************** Querying from IEU OpenGWAS **********************
************************************************************************
Task running script:
sh /Users/yujue/Documents/githubPackage/bigagwas/static/shell/query.sh ebi-a-GCST007517 /Users/yujue/Documents/githubPackage/ldsc/data/988 https://gwas.mrcieu.ac.uk/files /opt/homebrew/anaconda3/etc/profile.d/conda.sh ldsc /Users/yujue/Documents/githubPackage/query 298957
ebi-a-GCST007517
/Users/yujue/Documents/githubPackage/ldsc/data/988
https://gwas.mrcieu.ac.uk/files
/opt/homebrew/anaconda3/etc/profile.d/conda.sh
ldsc
/Users/yujue/Documents/githubPackage/query
298957
Read VCF File.
Finish reading VCF file.
Starting to convert into text file for harmonization.
OpenGWAS file process complete. Running time: 1.3004319667816162
....
Queried Data Download Finished. Time consuming: 9.6406888961792 seconds.
************************************************************************
******************* Checking Input data from OpenGWAS ******************
************************************************************************
Queried data includes columns: ['CHR', 'POS', 'SNPID', 'A2', 'A1', 'ebi-a-GCST007517', 'BETA', 'SE', 'LP', 'AF1', 'P_value', 'N']
Input data check Finished.
This part details the querying process from IEU OpenGWAS, including script execution and file processing. It
provides information on reading the downloaded VCF files and the time taken to complete these processes.
Here, it is focused on converting data into the format for harmonization.
*********** Starting to map rsid to chromosome and position ************
Mapping dbSNP156_GRCh38_chr_12-3.txt : 443.08141112327576 seconds
Mapping dbSNP156_GRCh38_chr_10-3.txt : 480.74914479255676 seconds
Mapping dbSNP156_GRCh38_chr_14-1.txt : 582.9732220172882 seconds
...
************************************************************************
********************* Harmonization: Mapping rsid *********************
************************************************************************
Mapping dbSNP156_GRCh38_chr_10-2.txt : 23.715790033340454 seconds
Mapping dbSNP156_GRCh38_chr_10-1.txt : 26.09787106513977 seconds
Mapping dbSNP156_GRCh38_chr_14-2.txt : 27.18862295150757 seconds
Mapping dbSNP156_GRCh38_chr_14-1.txt : 28.28748393058777 seconds
...
Overall Forward strand rate: 1.0, Reverse strand rate:0.0.
************************************************************************
************ Harmonization: Harmonizing data by chromosome *************
************************************************************************
Chromosome 5, Strand Inference Summary:
Counter({'Forward strand variant': 5421, 'Palindormic variant': 696})
Chromosome 5, Harmonization summary:
Counter({'Forward strand; Correct orientation; Already harmonised': 5419, 'Palindromic; Assume forward strand; Correct orientation; Already harmonised': 695, 'Forward strand; Flipped orientation; Requires harmonisation': 2, 'Palindromic; Assume forward strand; Flipped orientation; Requires harmonisation': 1})
Chromosome 7, Strand Inference Summary:
...
Harmonization ends. Time consuming: 182.41979694366455 seconds.
This section covers the mapping of rsid to chromosome and position aligned with GRCh38 and the time taken
for each chromosome. It then goes into harmonizing the summary statistics data, inferring the direction of
the strand, providing detailed summaries and counts for each chromosome. The section ends by noting the
total time spent on harmonization.
************************************************************************
*************** Combining and Sorting Harmonized data ******************
************************************************************************
Combining and Sorting ends. Time consuming: 0.9061808586120605 seconds.
This section indicates that the harmonized data has been combined and sorted by chromosomes. The time taken
for this process is noted, allowing for performance tracking.
************************************************************************
************ Starting to convert back to GRCh37 version ****************
************************************************************************
0 positions did not align with the GRCh37 reference.
Liftover to GRCh37 Finished. Time Consuming: 4.547966957092285 seconds.
Here, the log file records the process of converting the data back to the GRCh37 for analysis because LDSC
needs position built on GRCh37 version. It specifically notes that 1366 positions did not align with the
GRCh37 reference file. The duration of this liftover process is also recorded.
************************************************************************
********* Converting into sumstats.gz for use in LDSC analysis *********
************************************************************************
/Users/yujue/Documents/githubPackage/ldsc/data/988/988.txt.gz
/Users/yujue/Documents/githubPackage/ldsc/data/988/988_ldsc
/opt/homebrew/anaconda3/etc/profile.d/conda.sh
py27
/Users/yujue/Documents/githubPackage/ldsc
no
*********************************************************************
* LD Score Regression (LDSC)
* Version 1.0.1
* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
This part of the log indicates the final conversion of the data into `sumstats.gz` format, which is used in
LDSC analysis. The time taken for this conversion is provided.
************************************************************************
************************* Running LDSC Software ************************
************************************************************************
Starting to run 24 PGC psychiatric disorders traits.
Command line:
sh /Users/yujue/Documents/githubPackage/bigagwas/static/shell/pgc_24.sh 988 /Users/yujue/Documents/githubPackage/ldsc/data/988/988_ldsc.sumstats.gz /Users/yujue/Documents/githubPackage/ldsc /Volumes/sandisk/summary_statistics/harmo/ldsc/data/pgc/hg19/ /opt/homebrew/anaconda3/etc/profile.d/conda.sh py27 /eur_w_ld_chr/
This section shows the initiation of the LDSC software to analyze 24 PGC psychiatric disorders and other brain diseases traits. It
includes the command line used to run the analysis.
*********************************************************************
* LD Score Regression (LDSC)
* version 1.1.0
* (C) 2014-2019 Brendan Bulik-Sullivan and Hilary Finucane
* Broad Institute of MIT and Harvard / MIT Department of Mathematics
* GNU General Public License v3
*********************************************************************
Call:
./ldsc.py \
--ref-ld-chr /Users/yujue/Documents/githubPackage/ldsc/eur_w_ld_chr/ \
--out /Users/yujue/Documents/githubPackage/ldsc/log/758/phenoone \
--rg /Users/yujue/Documents/githubPackage/ldsc/data/758/758_ldsc.sumstats.gz,/Volumes/sandisk/summary_statistics/harmo/ldsc/data/pgc/hg19/8_ocd_aug2017.sumstats.gz
...
Finally, the log file gives the output of the LDSC software being used. More information about this section is
provided by LDSC software itself.