[PYTHON] [GWAS] Plot the results of principal component analysis (PCA) by PLINK

About this article

--A script that plots the results of principal component analysis (PCA) using genetic statistical analysis software PLINK on a two-dimensional plane. Wrote. --Introduce script input / output files and execution method. --The script is here (link to GitHub)

Input file preparation

1. Preparation of main component load data

file format

Prepare a file containing the family ID in the first column, the personal ID in the second column, and the main component load in the third and subsequent columns. A file in such a format can be obtained by performing principal component analysis using PLINK.

#1 FamID
#2 Individual ID
#3 PC1
#4 PC2
...

Principal component analysis by PLINK

Principal component analysis can be performed with the genetic statistical analysis software PLINK. Principal component analysis is a dimensionality reduction method based on the eigendecomposition of the variance-covariance matrix or correlation matrix. It is used for entanglement adjustment.

$ plink --bfile ${bfile_name} --out ${outfile_name} --pca

As a result of PCA output by PLINK, $ {outfile_name} .eigenvec and $ {outfile_name} .eigenval are obtained. To illustrate the results, use $ {outfile_name} .eigenvec (load of each principal component in each individual).

2. Preparation of group label data

file format

Prepare a file with the family ID in the first column, the individual ID in the second column, and the group label (race, etc.) in the third column. (Let's say populations.txt.)

#1 FamID
#2 Individual ID
#3 Group

How to execute the script

The execution environment is Python3, and pandas and matplotlib are installed. Execute by specifying the following options. --Specify a $ {outfile_name} .eigenvec file for the -e option --Specify a populations.txt file for the -p option --Specify the output directory in the -o option

$ python plot_pca_gwas.py -e ${outfile_name}.eigenvec -p populations.txt -o ${output_directory}/

Check the output result

The following image is obtained as the output result of the script. --pca.png: Plot of the entire population --pca_ {group} .png: Plot for each group

Execution example

Input files include example.eigenvec and [example_population.txt](https: / If you run the script using /github.com/t-yui/bioinformatics_scripts/blob/master/gwas_tools/plinkPCA/plot_examples/example_data/example_population.txt), you will get the following image.

  1. pca.png pca.png

2-1) pca_GROUP1.png pca_GROUP1.png

2-2) pca_GROUP2.png pca_GROUP2.png

2-3) pca_GROUP3.png pca_GROUP3.png

Recommended Posts

[GWAS] Plot the results of principal component analysis (PCA) by PLINK
[GWAS] Plot the results of principal component analysis (PCA) by PLINK
Plot the spread of the new coronavirus
Principal component analysis (Principal component analysis: PCA)
[Python] Comparison of Principal Component Analysis Theory and Implementation by Python (PCA, Kernel PCA, 2DPCA)
Mathematical understanding of principal component analysis from the beginning
Visualize the correlation matrix by principal component analysis in Python
Principal component analysis
Recognize the contour and direction of a shaped object with OpenCV3 and Python3 (Principal component analysis: PCA, eigenvectors)
Robot grip position (Python PCA principal component analysis)
[Statistics] Understand the mechanism of Q-Q plot by animation.
Clustering and principal component analysis by K-means method (beginner)
Principal component analysis Analyze handwritten numbers using PCA. Part 2
Principal component analysis (PCA) and independent component analysis (ICA) in python
Principal component analysis Analyze handwritten numbers using PCA. Part 1
Data analysis based on the election results of the Tokyo Governor's election (2020)
Reuse the results of clustering
Unsupervised learning 3 Principal component analysis
Implementation of independent component analysis
When I was shown a plot such as principal component analysis, "the distributions of these two data are not so different"?
[Python] PCA scratch in the example of "Introduction to multivariate analysis"
Introduction to Python Basics of Machine Learning (Unsupervised Learning / Principal Component Analysis)
100 Language Processing Knock-85 (Truncated SVD): Dimensional compression by principal component analysis
Plot the spread of the new coronavirus
Principal component analysis with Spark ML
Plot of regression line by residual plot
Illustration of the results of the knapsack problem
Python: Unsupervised Learning: Principal Component Analysis
I wrote a corpus reader that reads the results of MeCab analysis
Try sending the aggregated results of two records by email with pykintone