[Linux] GWAS with genetic statistics software PLINK

Introduction

Recently, I started using PLINK, a genetic statistics software. Learn how to use PLINK with reference to this book.

Practice from scratch genetic statistics seminar

This book is written for Windows, so I'll write a reminder of how to do it on a Mac. This time, we will perform a genome-wide association study (GWAS) with PLINK.

reference

The basic usage of PLINK has been posted before [Linux] I tried using the genetic statistics software PLINK

Data to use

bed|bim|fam format file

Phenotype file

This file is a case control data file, with case 2 and control 1, each with a family ID and sample ID, and the family ID and sample ID of the ** SNP.fam ** file. Is tied to.

Store these files in your working directory.

Launch PLINK

Specify the working directory

Specifying the working directory


$ cd /Working directory path/

Move ** PLINK ** (PLINK executable file) to the working directory and start it.

Start PLINK


$ ./plink

SNP filtering

--bfileso,SNPFile (bed|bim|fam形式)を読み込んso,--make-bedData after filtering with new bed|bim|Created as a fam format file. Specify the name of the file to be output with --out in ** SNP_QC **.

This time, SNPs with the following conditions are excluded. --maf 0.05: Minor allele frequency is 5% or less (1% or 0.5% or less is common in normal GWAS) --hwe 0.000001: Hardy-Weinberg equilibrium test with * p * value less than or equal to 10 ^ -6 --indep-pairwise 100 5 0.8: Linkage disequilibrium r2 value is 0.8 or more

Removal of SNP


$ ./plink --bfile SNP --out SNP_QC --maf 0.05 --hwe 0.000001 --indep-pairwise 100 5 0.8 --make-bed

Execution result


PLINK v1.90b6.16 64-bit (19 Feb 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to SNP_QC.log.
Options in effect:
  --bfile SNP
  --hwe 0.000001
  --indep-pairwise 100 5 0.8
  --maf 0.05
  --make-bed
  --out SNP_QC

16384 MB RAM detected; reserving 8192 MB for main workspace.
8830185 variants loaded from .bim file.
381 people (178 males, 203 females) loaded from .fam.
381 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 381 founders and 0 nonfounders present.
Calculating allele frequencies... done.
--hwe: 14399 variants removed due to Hardy-Weinberg exact test.
2692226 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
6123560 variants and 381 people pass filters and QC.
Among remaining phenotypes, 0 are cases and 381 are controls.
--make-bed to SNP_QC.bed + SNP_QC.bim + SNP_QC.fam ... done.
Pruned 366094 variants from chromosome 1, leaving 103214.
Pruned 403320 variants from chromosome 2, leaving 106161.
Pruned 339719 variants from chromosome 3, leaving 85861.
Pruned 353504 variants from chromosome 4, leaving 86916.
Pruned 310443 variants from chromosome 5, leaving 79003.
Pruned 322217 variants from chromosome 6, leaving 81854.
Pruned 285601 variants from chromosome 7, leaving 83237.
Pruned 264127 variants from chromosome 8, leaving 71155.
Pruned 208275 variants from chromosome 9, leaving 76275.
Pruned 245016 variants from chromosome 10, leaving 67270.
Pruned 241219 variants from chromosome 11, leaving 63223.
Pruned 226971 variants from chromosome 12, leaving 62436.
Pruned 177571 variants from chromosome 13, leaving 45339.
Pruned 154730 variants from chromosome 14, leaving 44160.
Pruned 134104 variants from chromosome 15, leaving 45299.
Pruned 142197 variants from chromosome 16, leaving 50213.
Pruned 124358 variants from chromosome 17, leaving 41954.
Pruned 134574 variants from chromosome 18, leaving 38435.
Pruned 105678 variants from chromosome 19, leaving 37131.
Pruned 102672 variants from chromosome 20, leaving 31013.
Pruned 68848 variants from chromosome 21, leaving 23361.
Pruned 63204 variants from chromosome 22, leaving 25608.
Pruning complete.  4774442 of 6123560 variants removed.
Marker lists written to SNP_QC.prune.in and SNP_QC.prune.out .

Check that the following files are output to the working directory.

Genome-wide association study (GWAS)

Perform GWAS analysis using the following command. --pheno: Enter the phenotype file used for GWAS (phenotype1.txt this time) --logistic: Perform logistic regression --ci 0.95: Output 95% confidence interval

GWAS (Logistic Regression Analysis)


$ ./plink --bfile SNP_QC --out SNP_QC_Pheno1 --pheno phenotype1.txt --logistic --ci 0.95

GWAS execution result


PLINK v1.90b6.16 64-bit (19 Feb 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to SNP_QC_Pheno1.log.
Options in effect:
  --bfile SNP_QC
  --ci 0.95
  --logistic
  --out SNP_QC_Pheno1
  --pheno phenotype1.txt

16384 MB RAM detected; reserving 8192 MB for main workspace.
6123560 variants loaded from .bim file.
381 people (178 males, 203 females) loaded from .fam.
381 phenotype values present after --pheno.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 381 founders and 0 nonfounders present.
Calculating allele frequencies... done.
6123560 variants and 381 people pass filters and QC.
Among remaining phenotypes, 188 are cases and 193 are controls.
Writing logistic model association results to SNP_QC_Pheno1.assoc.logistic ...
done.

Confirm that the file named ** SNP_QC_Pheno1.assoc.logistic ** is output to the working directory, and open it with a text editor. The first column is the chromosome number, the second column is the SNP ID, the third column is the chromosome position, and the twelfth column is the * p * value.

Extraction of elements by AWK

From the GWAS result, use the AWK command to extract the columns of "chromosome number", "SNP ID", "chromosome position", and "* p * value". Use the AWK command to set the input file as ** SNP_QC_Pheno1.assoc.logistic ** and the output file as a text file ** SNP_QC_Pheno1.assoc.logistic.P.txt *. In AWK, separate them with ‘’ and write commands in them to execute them. By {print $ 2 "\ t" $ 1 "\ t" $ 3 "\ t" $ 12} The data frame is "2nd column [SNP ID] 1st column [chromosome number] 3rd column [chromosome position] 4th column [ p * value]". The command that "\ t" is separated by tabs. Output as a text file specified by > .

Extract elements from GWAS results with AWK command and output text file


$ awk '{print $2"\t"$1"\t"$3"\t"$12}' SNP_QC_Pheno1.assoc.logistic > SNP_QC_Pheno1.assoc.logistic.P.txt

The output file can also be a CSV file.

Extract elements from GWAS results with AWK command and output CSV file


$ awk '{print $2"\t"$1"\t"$3"\t"$12}' SNP_QC_Pheno1.assoc.logistic > SNP_QC_Pheno1.assoc.logistic.P.csv

Draw a Manhattan plot using this GWAS result. I have already posted how to write the Manhattan plot, so I will omit it this time.

reference

How to write a Manhattan plot in R has also been posted before [R] I drew a Manhattan plot with GWAS results [R] I drew a Manhattan plot with GWAS results 2

Recommended Posts

[Linux] GWAS with genetic statistics software PLINK
[Linux] eQTL analysis with genetic statistics software PLINK
[Linux] I tried using the genetic statistics software PLINK
Statistics with python
Linux (Lubuntu) with OneMix3S
linux with termux app