(2017.02.02 OSX El Capitan)
From gene expression profiles, using gene expression data taken from various human tissues, Let's learn and infer which human tissue each sample is derived from. I tried to execute it with reference to Python Machine Learning Programming.
GTEx Portal publishes gene expression data from various human tissues. In the latest GTEx Analysis V6p Release, from 53 human tissues and 544 donors, Gene expression data for a total of 8555 samples can be obtained.
By registering an account on the site, you will be able to download gene expression data.
There is a download link on the page that can be reached from the site's navigation bar Datasets
-> Download
.
RPKM (Reads Per Kilobase of transcript per Million mapped reads) corrected The gene expression data file has the following file names.
It is provided in the GCT (Gene Cluster Text file format) format. The columns are samples (column names are sample names), and the rows are features (row names are gene names).
Annotation (label information) of each sample is It is provided in GTEx Analysis V6 Release and has the following file names.
The sample ID is in the first column, and the 6th and 7th columns describe which tissue it came from.
GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct.gz and GTEx_Data_V6_Annotations_SampleAttributesDS.txt Click the download link for each file to download. Then double-click the .gz file to unzip it.
Use Linux commands to look at the first 10 rows and 3 columns of the gene expression data file. From the first two lines, you can see that the expression level (feature level) of the 56238 gene is 8555 samples. Also, from the gene name in the first column, you can see that Ensembl Gene ID is used. The expression level data itself starts from the 3rd column of the 4th row.
cut -f1,2,3 GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct | head
#1.2
56238 8555
Name Description GTEX-111CU-1826-SM-5GZYN
ENSG00000223972.4 DDX11L1 0
ENSG00000227232.4 WASH7P 6.50895977020264
ENSG00000243485.2 MIR1302-11 0
ENSG00000237613.2 FAM138A 0
ENSG00000268020.2 OR4G4P 0
ENSG00000240361.1 OR4G11P 0
ENSG00000186092.4 OR4F5 0
Use python's pandas package read_csv to read the data.
Since it is tab-delimited, sep ='\ t'
, and because I want to skip the first two lines, I set skiprows = 2
.
The first row read first is the header (column name).
Because I want to read the data of 8555 samples (up to the 8557th column) except the Description in the 2nd column.
ʻUsecols = [0] + list (range (2,8557)). Then, set ʻindex_clo = 0
and specify the first column as the row name.
import pandas as pd
usecols = [0] + list(range(2,8557))
df1 = pd.read_csv('GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct', sep='\t', skiprows=2, usecols=usecols, index_col=0)
>>> df1.index
Index(['ENSG00000223972.4', 'ENSG00000227232.4', 'ENSG00000243485.2',
'ENSG00000237613.2', 'ENSG00000268020.2', 'ENSG00000240361.1',
'ENSG00000186092.4', 'ENSG00000238009.2', 'ENSG00000233750.3',
'ENSG00000237683.5',
...
'ENSG00000198886.2', 'ENSG00000210176.1', 'ENSG00000210184.1',
'ENSG00000210191.1', 'ENSG00000198786.2', 'ENSG00000198695.2',
'ENSG00000210194.1', 'ENSG00000198727.2', 'ENSG00000210195.2',
'ENSG00000210196.2'],
dtype='object', name='Name', length=56238)
>>> df1.columns
Index(['GTEX-111CU-1826-SM-5GZYN', 'GTEX-111FC-0226-SM-5N9B8',
'GTEX-111VG-2326-SM-5N9BK', 'GTEX-111YS-2426-SM-5GZZQ',
'GTEX-1122O-2026-SM-5NQ91', 'GTEX-1128S-2126-SM-5H12U',
'GTEX-113IC-0226-SM-5HL5C', 'GTEX-117YX-2226-SM-5EGJJ',
'GTEX-11DXW-0326-SM-5H11W', 'GTEX-11DXX-2326-SM-5Q5A2',
...
'GTEX-ZVE2-0006-SM-51MRW', 'GTEX-ZVP2-0005-SM-51MRK',
'GTEX-ZVT2-0005-SM-57WBW', 'GTEX-ZVT3-0006-SM-51MT9',
'GTEX-ZVT4-0006-SM-57WB8', 'GTEX-ZVTK-0006-SM-57WBK',
'GTEX-ZVZP-0006-SM-51MSW', 'GTEX-ZVZQ-0006-SM-51MR8',
'GTEX-ZXES-0005-SM-57WCB', 'GTEX-ZXG5-0005-SM-57WCN'],
dtype='object', length=8555)
Add pseudo count 1 to the read gene expression level (RPKM) to convert to log2.
import numpy as np
df1pc = df1 + 1
df1log = np.log2(df1pc)
Randomly select 10 samples and plot the density distribution of gene expression. Use pandas DataFrame.plot.density for the plot.
import matplotlib.pyplot as plt
random_cols = np.random.choice(8555, 10)
df1log.iloc[:, random_cols].plot.density(fontsize=10)
plt.savefig('gtex_log_rpkm_density_random10sample.png', dpi=150)
plt.show()
Since the distribution varies considerably from sample to sample, find a reference distribution from the training dataset. It may be better to perform quantile correction etc.
From GTEx_Data_V6_Annotations_SampleAttributesDS.txt, which contains sample organization information, Read the sample ID in the 1st column and the organization information in the 7th column.
usecols = [0, 6]
df2 = pd.read_csv('GTEx_Data_V6_Annotations_SampleAttributesDS.txt', sep='\t', usecols=usecols, index_col=0)
By transposing the rows and columns of gene expression data and combining it with the sample tissue information data frame, Gives organization information. Use pandas concat.
df = pd.concat([df1log.T, df2], axis=1, join_axes=[df1log.T.index])
From Python Machine Learning Programming 4.2.2 (P99) Since it is difficult to handle the organization information as a character string, convert the character string to an integer and replace it. Since the gene expression data was transposed and combined, the rows are the samples and the columns are the features and tissue information.
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
label = class_le.fit_transform(df['SMTSD'].values)
label_r = class_le.inverse_transform(label)
df['SMTSD'] = label
From Python Machine Learning Programming 4.3 (P102) Divide the labeled gene expression data into features (X) and class labels (y).
smtsd_index = len(df.columns) - 1
X, y = df.iloc[:, :smtsd_index].values, df.iloc[:, smtsd_index].values
Then, divide it into training data and test data.
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.3, random_state=0)
Let's try learning and inference by k-nearest neighbor method. Use sklearn.neighbors.KNeighborsClassifier.
It's time to remove unnecessary variables to free up memory.
del df
del df1
del df1pc
del df1log
del df2
k = 3, 4 Execute in parallel. It took about 10 minutes to calculate on a 4GHz Intel Core i7.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3, n_jobs=4)
knn.fit(X_train, y_train)
print('Training accuracy: ', knn.score(X_train, y_train))
print('Test accuracy: ', knn.score(X_test, y_test))
As shown below, the correct answer rate in the test data was about 91%.
>>> print('Training accuracy: ', knn.score(X_train, y_train))
Training accuracy: 0.95641282565130259
>>> print('Test accuracy: ', knn.score(X_test, y_test))
Test accuracy: 0.90962212699649392
Select the feature amount. Genes with a total gene expression level not exceeding 500 will be removed. Python Machine Learning Programming 4.5.2 (P113) Sequential (chikuji) feature selection Create a class for feature selection by referring to source code of scikit-learn.
from sklearn.base import TransformerMixin
import numpy as np
class GexSumFS(TransformerMixin):
"""
Feature selection based on the sum of gene expression values
"""
def __init__(self, sum):
self.sum = sum
def fit(self, X, y=None):
self.indices_ = np.where(np.sum(X, axis=0) > self.sum)[0]
return self
def transform(self, X, y=None):
return X[:, self.indices_]
Use the above class to filter by total expression level.
sumfs = GexSumFS(sum = 500)
X_train_sumfs = sumfs.fit_transform(X_train)
X_test_sumfs = sumfs.transform(X_test)
The number of genes (features) has decreased to 27754.
>>> X_train_sumfs.shape[1]
27754
Features Let's learn and infer with the selected data.
knn.fit(X_train_sumfs, y_train)
print('Training accuracy: ', knn.score(X_train_sumfs, y_train))
print('Test accuracy: ', knn.score(X_test_sumfs, y_test))
The correct answer rate was about 91%, which was almost the same as before feature selection. The accuracy rate has decreased slightly, Feature selection has the advantage of reducing calculation time a little.
>>> print('Training accuracy: ', knn.score(X_train_sumfs, y_train))
Training accuracy: 0.955577822311
>>> print('Test accuracy: ', knn.score(X_test_sumfs, y_test))
Test accuracy: 0.908843007402
When the gene expression level is standardized (mean 0, variance 1), genes with low expression levels tend to be noisy. Feature selection based on expression level should be more effective.
Quantile normalization
From the density distribution plot, the expression level distribution was quite different from sample to sample. After doing quantile correction, try learning & inference. Create a class that performs quantile correction as shown below.
class QuantileNormalization(TransformerMixin):
"""
Quantile normalization
"""
def __init__(self):
pass
def fit(self, X, y=None):
self.ref_dist_ = np.zeros(X.shape[1])
X_rank = pd.DataFrame(X).rank(axis=1, method='min').astype(int)
for i in range(X.shape[1]):
indices = np.where(X_rank == i+1)
if len(indices[0]) > 0:
self.ref_dist_[i] = np.mean(X[indices])
else:
self.ref_dist_[i] = self.ref_dist_[i-1]
return self
def transform(self, X, y=None):
X_norm = pd.DataFrame(X).rank(axis=1, method='min').values
for i in range(X.shape[1]):
X_norm[X_norm == i+1] = self.ref_dist_[i]
return X_norm
Use the above class to make quantile corrections. It took a whole day to calculate, probably because the implementation method was bad.
q_norm = QuantileNormalization()
X_train_sumfs_qnorm = q_norm.fit_transform(X_train_sumfs)
X_test_sumfs_qnorm = q_norm.transform(X_test_sumfs)
Look at the density distribution of the corrected gene expression profile and confirm that the density distribution is uniform.
random_rows = np.random.choice(range(X_train_sumfs_qnorm.shape[0]), 10)
pd.DataFrame(X_train_sumfs_qnorm[random_rows, :]).T.plot.density(fontsize=10)
plt.savefig('gtex_qnorm_density_random10sample.png', dpi=150)
plt.show()
Let's learn and infer with quantile-corrected data.
knn.fit(X_train_sumfs_qnorm, y_train)
print('Training accuracy: ', knn.score(X_train_sumfs_qnorm, y_train))
print('Test accuracy: ', knn.score(X_test_sumfs_qnorm, y_test))
The inference accuracy rate has increased by about 0.6%.
>>> print('Training accuracy: ', knn.score(X_train_sumfs_qnorm, y_train))
Training accuracy: 0.956078824315
>>> print('Test accuracy: ', knn.score(X_test_sumfs_qnorm, y_test))
Test accuracy: 0.915855083755
Let's try learning & inference after standardizing the data with uniform distribution by Quantile correction. Python Machine Learning Programming 4.4 Refer to Aligning Feature Scales (P106).
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_train_sumfs_qnorm_std = stdsc.fit_transform(X_train_sumfs_qnorm)
X_test_sumfs_qnorm_std = stdsc.transform(X_test_sumfs_qnorm)
knn.fit(X_train_sumfs_qnorm_std, y_train)
print('Training accuracy: ', knn.score(X_train_sumfs_qnorm_std, y_train))
print('Test accuracy: ', knn.score(X_test_sumfs_qnorm_std, y_test))
>>> print('Training accuracy: ', knn.score(X_train_sumfs_qnorm_std, y_train))
Training accuracy: 0.951736806947
>>> print('Test accuracy: ', knn.score(X_test_sumfs_qnorm_std, y_test))
Test accuracy: 0.906505648617
In this case, standardization seemed to reduce the accuracy rate. Standardizing genes with very small variances can result in large values. This may be the reason for the decrease in the accuracy rate.
Instead of standardization, try normalization (scaling from 0 to 1) before learning & inference. Python Machine Learning Programming 4.4 Refer to Aligning Feature Scales (P105).
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
X_train_sumfs_qnorm_norm = mms.fit_transform(X_train_sumfs_qnorm)
X_test_sumfs_qnorm_norm = mms.transform(X_test_sumfs_qnorm)
knn.fit(X_train_sumfs_qnorm_norm, y_train)
print('Training accuracy: ', knn.score(X_train_sumfs_qnorm_norm, y_train))
print('Test accuracy: ', knn.score(X_test_sumfs_qnorm_norm, y_test))
>>> print('Training accuracy: ', knn.score(X_train_sumfs_qnorm_norm, y_train))
Training accuracy: 0.952571810287
>>> print('Test accuracy: ', knn.score(X_test_sumfs_qnorm_norm, y_test))
Test accuracy: 0.910011686794
In this case, normalization also seemed to reduce the accuracy rate. This may be because normalization increases the effect of genes with less variation.
Preliminary removal of less dispersed genes may help standardization. Create a class that eliminates features with small variance as shown below.
from sklearn.base import TransformerMixin
import numpy as np
class GexVarFS(TransformerMixin):
"""
Feature selection based on the variance of gene expression values
"""
def __init__(self, var):
self.var = var
def fit(self, X, y=None):
self.indices_ = np.where(np.var(X, axis=0) > self.var)[0]
return self
def transform(self, X, y=None):
return X[:, self.indices_]
Using the above class, try learning & inferring after feature selection (distribute 1 or less is deleted).
varfs = GexVarFS(var = 1)
X_train_sumfs_qnorm_varfs = varfs.fit_transform(X_train_sumfs_qnorm)
X_test_sumfs_qnorm_varfs = varfs.transform(X_test_sumfs_qnorm)
knn.fit(X_train_sumfs_qnorm_varfs, y_train)
print('Training accuracy: ', knn.score(X_train_sumfs_qnorm_varfs, y_train))
print('Test accuracy: ', knn.score(X_test_sumfs_qnorm_varfs, y_test))
Although the number of genes has decreased and the calculation time has decreased significantly, The accuracy rate has decreased slightly due to feature selection as shown below.
>>> print('Training accuracy: ', knn.score(X_train_sumfs_qnorm_varfs, y_train))
Training accuracy: 0.957915831663
>>> print('Test accuracy: ', knn.score(X_test_sumfs_qnorm_varfs, y_test))
Test accuracy: 0.913128165173
Next, standardize the data selected by the feature selection according to the size of the variance, and learn and infer.
X_train_sumfs_qnorm_varfs_std = stdsc.fit_transform(X_train_sumfs_qnorm_varfs)
X_test_sumfs_qnorm_varfs_std = stdsc.transform(X_test_sumfs_qnorm_varfs)
knn.fit(X_train_sumfs_qnorm_varfs_std, y_train)
print('Training accuracy: ', knn.score(X_train_sumfs_qnorm_varfs_std, y_train))
print('Test accuracy: ', knn.score(X_test_sumfs_qnorm_varfs_std, y_test))
In this case, standardization has increased the accuracy rate. In addition, the accuracy rate has increased slightly due to feature selection and standardization compared to before feature selection.
>>> print('Training accuracy: ', knn.score(X_train_sumfs_qnorm_varfs_std, y_train))
Training accuracy: 0.958750835003
>>> print('Test accuracy: ', knn.score(X_test_sumfs_qnorm_varfs_std, y_test))
Test accuracy: 0.918971562135
In the end, the inference accuracy rate is about 92%, compared to the case where the log-converted RPKM value is used as it is. It improved by about 1%. Dimensionality reduction by principal component analysis, Better results may be obtained by changing the machine learning method to something like SVM.
Finally, let's check the correct answer rate for each organization.
y_pred = knn.predict(X_test_sumfs_qnorm_varfs_std)
inv_dict = dict(zip(y, class_le.inverse_transform(y)))
for i in np.unique(label):
accuracy_score = np.sum((y_pred[y_test == i] == i).astype(int)) / len(y_pred[y_test == i])
print(inv_dict[i] + ': ' + str(accuracy_score))
Adipose - Subcutaneous: 1.0
Adipose - Visceral (Omentum): 0.984126984127
Adrenal Gland: 1.0
Artery - Aorta: 0.985294117647
Artery - Coronary: 0.911764705882
Artery - Tibial: 0.988636363636
Bladder: 1.0
Brain - Amygdala: 0.95
Brain - Anterior cingulate cortex (BA24): 0.76
Brain - Caudate (basal ganglia): 0.709677419355
Brain - Cerebellar Hemisphere: 0.838709677419
Brain - Cerebellum: 0.883720930233
Brain - Cortex: 0.864864864865
Brain - Frontal Cortex (BA9): 0.53125
Brain - Hippocampus: 0.8
Brain - Hypothalamus: 0.964285714286
Brain - Nucleus accumbens (basal ganglia): 0.705882352941
Brain - Putamen (basal ganglia): 0.344827586207
Brain - Spinal cord (cervical c-1): 0.928571428571
Brain - Substantia nigra: 0.666666666667
Breast - Mammary Tissue: 0.823529411765
Cells - EBV-transformed lymphocytes: 1.0
Cells - Transformed fibroblasts: 1.0
Cervix - Ectocervix: 1.0
Cervix - Endocervix: 0.0
Colon - Sigmoid: 0.725
Colon - Transverse: 0.796875
Esophagus - Gastroesophageal Junction: 0.509803921569
Esophagus - Mucosa: 1.0
Esophagus - Muscularis: 0.876543209877
Fallopian Tube: 0.5
Heart - Atrial Appendage: 0.98275862069
Heart - Left Ventricle: 1.0
Kidney - Cortex: 1.0
Liver: 1.0
Lung: 1.0
Minor Salivary Gland: 1.0
Muscle - Skeletal: 1.0
Nerve - Tibial: 1.0
Ovary: 1.0
Pancreas: 1.0
Pituitary: 1.0
Prostate: 1.0
Skin - Not Sun Exposed (Suprapubic): 0.807692307692
Skin - Sun Exposed (Lower leg): 0.925619834711
Small Intestine - Terminal Ileum: 0.96
Spleen: 1.0
Stomach: 0.903225806452
Testis: 1.0
Thyroid: 1.0
Uterus: 0.95
Vagina: 0.909090909091
Whole Blood: 1.0
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split
from sklearn.base import TransformerMixin
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
### read gene expression matrix
usecols = [0] + list(range(2,8557))
df1 = pd.read_csv('GTEx_Analysis_v6p_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct', sep='\t', skiprows=2, usecols=usecols, index_col=0)
df1pc = df1 + 1
df1log = np.log2(df1pc)
### plot gene expression profiles
random_cols = np.random.choice(8555, 10)
df1log.iloc[:, random_cols].plot.density(fontsize=10)
plt.savefig('gtex_log_rpkm_density_random10sample.png', dpi=150)
#plt.show()
### read class labels
usecols = [0, 6]
df2 = pd.read_csv('GTEx_Data_V6_Annotations_SampleAttributesDS.txt', sep='\t', usecols=usecols, index_col=0)
### attach class labels
df = pd.concat([df1log.T, df2], axis=1, join_axes=[df1log.T.index])
### convert string labels
class_le = LabelEncoder()
label = class_le.fit_transform(df['SMTSD'].values)
label_r = class_le.inverse_transform(label)
df['SMTSD'] = label
### split into training and test data
smtsd_index = len(df.columns) - 1
X, y = df.iloc[:, :smtsd_index].values, df.iloc[:, smtsd_index].values
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.3, random_state=0)
### save memory usage
del df
del df1
del df1pc
del df1log
del df2
### feature selection based on gene expression values
class GexSumFS(TransformerMixin):
"""
Feature selection based on the sum of gene expression values
"""
def __init__(self, sum):
self.sum = sum
def fit(self, X, y=None):
self.indices_ = np.where(np.sum(X, axis=0) > self.sum)[0]
return self
def transform(self, X, y=None):
return X[:, self.indices_]
sumfs = GexSumFS(sum = 500)
X_train_sumfs = sumfs.fit_transform(X_train)
X_test_sumfs = sumfs.transform(X_test)
### quantile normalization
class QuantileNormalization(TransformerMixin):
"""
Quantile normalization
"""
def __init__(self):
pass
def fit(self, X, y=None):
self.ref_dist_ = np.zeros(X.shape[1])
X_rank = pd.DataFrame(X).rank(axis=1, method='min').astype(int)
for i in range(X.shape[1]):
indices = np.where(X_rank == i+1)
if len(indices[0]) > 0:
self.ref_dist_[i] = np.mean(X[indices])
else:
self.ref_dist_[i] = self.ref_dist_[i-1]
return self
def transform(self, X, y=None):
X_norm = pd.DataFrame(X).rank(axis=1, method='min').values
for i in range(X.shape[1]):
X_norm[X_norm == i+1] = self.ref_dist_[i]
return X_norm
q_norm = QuantileNormalization()
X_train_sumfs_qnorm = q_norm.fit_transform(X_train_sumfs)
X_test_sumfs_qnorm = q_norm.transform(X_test_sumfs)
random_rows = np.random.choice(range(X_train_sumfs_qnorm.shape[0]), 10)
pd.DataFrame(X_train_sumfs_qnorm[random_rows, :]).T.plot.density(fontsize=10)
plt.savefig('gtex_qnorm_density_random10sample.png', dpi=150)
#plt.show()
### feature selection based on gene expression variance
from sklearn.base import TransformerMixin
import numpy as np
class GexVarFS(TransformerMixin):
"""
Feature selection based on the variance of gene expression values
"""
def __init__(self, var):
self.var = var
def fit(self, X, y=None):
self.indices_ = np.where(np.var(X, axis=0) > self.var)[0]
return self
def transform(self, X, y=None):
return X[:, self.indices_]
varfs = GexVarFS(var = 1)
X_train_sumfs_qnorm_varfs = varfs.fit_transform(X_train_sumfs_qnorm)
X_test_sumfs_qnorm_varfs = varfs.transform(X_test_sumfs_qnorm)
### standardization
stdsc = StandardScaler()
X_train_sumfs_qnorm_varfs_std = stdsc.fit_transform(X_train_sumfs_qnorm_varfs)
X_test_sumfs_qnorm_varfs_std = stdsc.transform(X_test_sumfs_qnorm_varfs)
### learning and prediction
knn = KNeighborsClassifier(n_neighbors=3, n_jobs=4)
knn.fit(X_train_sumfs_qnorm_varfs_std, y_train)
print('Training accuracy: ', knn.score(X_train_sumfs_qnorm_varfs_std, y_train))
print('Test accuracy: ', knn.score(X_test_sumfs_qnorm_varfs_std, y_test))
### accuracy score for each class
y_pred = knn.predict(X_test_sumfs_qnorm_varfs_std)
inv_dict = dict(zip(y, class_le.inverse_transform(y)))
for i in np.unique(label):
accuracy_score = np.sum((y_pred[y_test == i] == i).astype(int)) / len(y_pred[y_test == i])
print(inv_dict[i] + ': ' + str(accuracy_score))
Recommended Posts