Learning Python with ChemTHEATER 03

Part 3 Principal component analysis

Chap.0 Overall flow

In Part 3, principal component analysis is performed. Principal component analysis is a combination of many explanatory variables 1 and fewer indicators and synthetic variables 2 . It is represented by a> . In other words, by statistically summarizing the data, it is possible to grasp the tendency and atmosphere of the entire multidimensional data that is difficult to understand intuitively. For example, there are two variables, height and weight, as indicators of body shape, and these are combined into a one-variable BMI.

In the case of ChemTHEATER data, the concentrations of several chemical substances are often measured for one sample. That is, there are multiple explanatory variables for each sample. This time, we aim to get an overview of the data by performing principal component analysis so as not to damage the information of each explanatory variable as much as possible.

img03.png

Loading the Chap.1 library

%matplotlib inline
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib.ticker as ticker
import sklearn
from sklearn.decomposition import PCA

First, load the required library. The first line is, as usual, a magic command that displays the graph drawn by matplotlib in the Jupyter Notebook.
Next, the second and subsequent lines are the required libraries. These libraries are installed in Anaconda.
The scikit-learn used for principal component analysis this time is an open source
3 machine learning library, and is classified into other categories 4 and regression 4 , clustering 4 Various algorithms such as are also implemented.

Library Overview Purpose of use this time Official URL
NumPy Numerical calculation library Used for numerical calculation in statistical processing https://www.numpy.org
pandas Data analysis library Used for data reading and formatting https://pandas.pydata.org
Matplotlib Graph drawing library Used for data visualization https://matplotlib.org
scikit-learn Machine learning library Used to implement principal component analysis (PCA) https://scikit-learn.org

Reading Chap.2 data

Now that the necessary libraries have been loaded in Chap.1, prepare the data to be handled this time. The data handled this time are three types: finless porpoise ( Neophocaena phocaenoides ), striped dolphin ( Stenella coeruleoalba ), and melon-headed whale ( Peponocephala electra ). In both cases, download measured data and samples from ChemTHEATER's Sample Search and load them in the same way as Part 1 and Part 2.

  1. Select "Sample Search" from the menu bar of Chem THEATER.
  2. Select "Biotic --Mammals --Marine mammals" in "Sample Type".
  3. Select "Neophocaena phocaenoides" from "Scientific Name".
  4. Click the "Search" button to output a list of samples that meet the conditions.
  5. "Export samples" outputs the sample information, and "Export measured data" outputs the measured values of the target chemical substance as a tab-delimited text file.
  6. Repeat for the other two types.

After successfully reading each of the three files, combine the three files measureddata and samples so that they can be used later in the process. At this time, it is convenient to use the Pandas concat function. If the parameter ignore_index is set to True, the column numbers of the Pandas DataFrame 5 </ sup> to be joined will be renumbered. </ p>

data_file1 = "measureddata_20191004033314.tsv"    #Change the string entered in the variable to the tsv file name of your measured data
data_file2 = "measureddata_20191004033325.tsv"    #Change the string entered in the variable to the tsv file name of your measured data
data_file3 = "measureddata_20191004033338.tsv"    #Change the string entered in the variable to the tsv file name of your measured data

data1 = pd.read_csv(data_file1, delimiter="\t")
data2 = pd.read_csv(data_file2, delimiter="\t")
data3 = pd.read_csv(data_file3, delimiter="\t")

data = pd.concat([data1, data2, data3], ignore_index=True)    #Combine with Pandas concat function
data
MeasuredID ProjectID SampleID ScientificName ChemicalID ChemicalName ExperimentID MeasuredValue AlternativeData ...
0 80 PRA000002 SAA000087 Neophocaena phocaenoides CH0000154 TBT EXA000001 170.0000 NaN ...
1 81 PRA000002 SAA000087 Neophocaena phocaenoides CH0000155 DBT EXA000001 220.3591 NaN ...
2 82 PRA000002 SAA000087 Neophocaena phocaenoides CH0000156 MBT EXA000001 44.5445 NaN ...
3 83 PRA000002 SAA000087 Neophocaena phocaenoides CH0000157 ΣBTs EXA000001 434.9036 NaN ...
4 84 PRA000002 SAA000087 Neophocaena phocaenoides CH0000158 TPT EXA000001 12.9220 NaN ...
... ... ... ... ... ... ... ... ... ... ...
9774 24272 PRA000036 SAA002163 Peponocephala electra CH0000151 cis-nonachlor EXA000001 170.0000 NaN ...
9775 24056 PRA000036 SAA002163 Peponocephala electra CH0000152 ΣCHLs EXA000001 1100.0000 NaN ...
9776 24376 PRA000036 SAA002163 Peponocephala electra CH0000153 HCB EXA000001 120.0000 NaN ...
9777 24646 PRA000036 SAA002163 Peponocephala electra CH0000533 TCPMe EXA000001 6.1000 NaN ...
9778 24700 PRA000036 SAA002163 Peponocephala electra CH0000534 TCPMOH EXA000001 16.0000 NaN ...

9779 rows × 13 columns

sample_file1 = "samples_20191004033311.tsv"    #Change the string entered in the variable to the tsv file name of your samples
sample_file2 = "samples_20191004033323.tsv"    #Change the string entered in the variable to the tsv file name of your samples
sample_file3 = "samples_20191004033334.tsv"    #Change the string entered in the variable to the tsv file name of your samples

sample1 = pd.read_csv(sample_file1, delimiter="\t")    
sample2 = pd.read_csv(sample_file2, delimiter="\t")
sample3 = pd.read_csv(sample_file3, delimiter="\t")

sample = pd.concat([sample1, sample2, sample3], ignore_index=True)    #Combine with Pandas concat function
sample
ProjectID SampleID SampleType TaxonomyID UniqCodeType UniqCode SampleName ScientificName CommonName CollectionYear ...
0 PRA000002 SAA000087 ST004 34892 es-BANK EW00884 NaN Neophocaena phocaenoides Finless porpoises 1996 ...
1 PRA000002 SAA000088 ST004 34892 es-BANK EW00812 NaN Neophocaena phocaenoides Finless porpoises 1999 ...
2 PRA000002 SAA000089 ST004 34892 es-BANK EW00873 NaN Neophocaena phocaenoides Finless porpoises 1995 ...
3 PRA000002 SAA000090 ST004 34892 es-BANK EW04787 NaN Neophocaena phocaenoides Finless porpoises 2000 ...
4 PRA000002 SAA000091 ST004 34892 es-BANK EW00867 NaN Neophocaena phocaenoides Finless porpoises 1998 ...
... ... ... ... ... ... ... ... ... ... ... ...
312 PRA000036 SAA002159 ST004 103596 es-BANK EW04779 060301-1 Peponocephala electra Melon-headed whale 2006 ...
313 PRA000036 SAA002160 ST004 103596 es-BANK EW00115 M32625 Peponocephala electra Melon-headed whale 2001 ...
314 PRA000036 SAA002161 ST004 103596 es-BANK EW00122 M32633 Peponocephala electra Melon-headed whale 2001 ...
315 PRA000036 SAA002162 ST004 103596 es-BANK EW00116 M32626 Peponocephala electra Melon-headed whale 2001 ...
316 PRA000036 SAA002163 ST004 103596 es-BANK EW00117 M32627 Peponocephala electra Melon-headed whale 2001 ...

317 rows × 66 columns

Chap.3 Data preparation

After reading the data, the next step is to prepare the data so that it can be analyzed. First, start by extracting the data to be used this time from measured data.
This time, we will use 5 kinds of chemical substances (ΣPCBs, ΣDDTs, ΣPBDEs, ΣCHLs, ΣHCHs) for 3 kinds of finless porpoise, striped dolphin, and melon-headed whale.

First, extract the data whose unit is [ng / g lipid]. This is because, as mentioned in Part 1, data with different units cannot be simply compared.
Next, extract the data whose chemical substance name is one of the above five types, and use the reset_index method to renumber the lines.

data_lipid = data[(data["Unit"] == "ng/g lipid")]
data_lipid = data_lipid[(data_lipid["ChemicalName"] == "ΣPCBs") | (data_lipid["ChemicalName"] == "ΣDDTs") |
                        (data_lipid["ChemicalName"] == "ΣPBDEs") |  (data_lipid["ChemicalName"] == "ΣCHLs") |
                        (data_lipid["ChemicalName"] == "ΣHCHs")]
data_lipid = data_lipid.reset_index(drop=True)
data_lipid[data_lipid["SampleID"] == "SAA001941"]    #Data list with Sample ID SAA001941
MeasuredID ProjectID SampleID ScientificName ChemicalID ChemicalName ExperimentID MeasuredValue AlternativeData ...
115 25112 PRA000030 SAA001941 Neophocaena phocaenoides CH0000033 ΣDDTs EXA000001 9400.0 NaN ...
116 25098 PRA000030 SAA001941 Neophocaena phocaenoides CH0000096 ΣPCBs EXA000001 1100.0 NaN ...
117 25140 PRA000030 SAA001941 Neophocaena phocaenoides CH0000146 ΣHCHs EXA000001 41.0 NaN ...
118 25126 PRA000030 SAA001941 Neophocaena phocaenoides CH0000152 ΣCHLs EXA000001 290.0 NaN ...

After extracting the measured data data, the next step is to integrate the samples and the measured data Data Frame.
In Part1 and Part2, sample information (scientific name, collection site, etc.) was added to the measured data of each chemical substance, but this time, the measured value of the chemical substance is given to each sample, so be careful that the processing is different from before. ..

img08.png

#Create a DataFrame with only SampleID and scientific name
df = sample[["SampleID", "ScientificName"]]
for chem in ["ΣCHLs", "ΣDDTs", "ΣHCHs", "ΣPBDEs", "ΣPCBs"]: #Loop processing by chemical substance name
    #Create a DataFrame for only the chemicals in the chemth list
    data_lipid_chem = data_lipid[data_lipid["ChemicalName"] == chem]
    #Column name, leaving only the SampleID and measurements"MeasuredValue"Changed to chemical substance name
    data_lipid_chem = data_lipid_chem[["SampleID", "MeasuredValue"]].rename(columns={"MeasuredValue": chem})
    data_lipid_chem = data_lipid_chem.drop_duplicates(subset='SampleID')
    #Df with SampleID(ID+Scientific name)And data_lipid_chem(ID+measured value)Merge
    df = pd.merge(df, data_lipid_chem, on="SampleID")
df = df.dropna(how="any") #Remove rows containing NaN
df = df.drop("SampleID", axis=1)
df
ScientificName ΣCHLs ΣDDTs ΣHCHs ΣPBDEs ΣPCBs
0 Neophocaena phocaenoides 770.0 68000.0 1100.0 170.0 5700.0
1 Neophocaena phocaenoides 1200.0 140000.0 3500.0 120.0 6500.0
2 Neophocaena phocaenoides 1000.0 140000.0 5500.0 86.0 5500.0
3 Neophocaena phocaenoides 950.0 130000.0 6600.0 100.0 5800.0
4 Neophocaena phocaenoides 1400.0 280000.0 6100.0 140.0 11000.0
... ... ... ... ... ... ...
89 Peponocephala electra 3000.0 15000.0 170.0 260.0 12000.0
90 Peponocephala electra 5100.0 23000.0 380.0 490.0 19000.0
91 Peponocephala electra 5700.0 33000.0 240.0 300.0 25000.0
92 Peponocephala electra 2800.0 12000.0 220.0 230.0 9300.0
93 Peponocephala electra 5700.0 27000.0 240.0 180.0 21000.0

94 rows × 6 columns

As shown above

, the scientific name is stored in the DataFrame as it is in the current state. Since the species name will be used for visualization in the future, a variable representing the scientific name will be prepared for convenience. First, create a dictionary that associates each scientific name with a value (finless porpoise: 0, striped dolphin: 1, melon-headed whale: 2). </ p>

species = df["ScientificName"].unique().tolist()    #Output a list of scientific names in df
class_dic = {}
for i in range(len(species)):
    class_dic[species[i]] = i
class_dic
{'Neophocaena phocaenoides': 0,
 'Stenella coeruleoalba': 1,
 'Peponocephala electra': 2}

Next, add the Class column to df and enter the value corresponding to the scientific name in each row in the Class column. ```python df["Class"] = 0 for irow in range(len(df)): df.at[irow, "Class"] = class_dic[df.at[irow, "ScientificName"]] df = df.loc[:, ["Class", "ScientificName", "ΣPCBs", "ΣDDTs", "ΣPBDEs", "ΣCHLs", "ΣHCHs"]] df ```

Class ScientificName ΣPCBs ΣDDTs ΣPBDEs ΣCHLs ΣHCHs
0 0 Neophocaena phocaenoides 5700.0 68000.0 170.0 770.0 1100.0
1 0 Neophocaena phocaenoides 6500.0 140000.0 120.0 1200.0 3500.0
2 0 Neophocaena phocaenoides 5500.0 140000.0 86.0 1000.0 5500.0
3 0 Neophocaena phocaenoides 5800.0 130000.0 100.0 950.0 6600.0
4 0 Neophocaena phocaenoides 11000.0 280000.0 140.0 1400.0 6100.0
... ... ... ... ... ... ... ...
89 2 Peponocephala electra 12000.0 15000.0 260.0 3000.0 170.0
90 2 Peponocephala electra 19000.0 23000.0 490.0 5100.0 380.0
91 2 Peponocephala electra 25000.0 33000.0 300.0 5700.0 240.0
92 2 Peponocephala electra 9300.0 12000.0 230.0 2800.0 220.0
93 2 Peponocephala electra 21000.0 27000.0 180.0 5700.0 240.0

94 rows × 7 columns

Since it was easier to use the pivot table, modify the data_lipid and below as follows.

df = pd.pivot_table(data_lipid, index=['ScientificName', 'SampleID'], columns='ChemicalName', values=['MeasuredValue'])
df = df.dropna(how="any") #Remove rows containing NaN
df
MeasuredValue
ChemicalName ΣCHLs ΣDDTs ΣHCHs ΣPBDEs ΣPCBs
ScientificName SampleID
Neophocaena phocaenoides SAA001903 770.0 68000.0 1100.0 170.0 5700.0
SAA001904 1200.0 140000.0 3500.0 120.0 6500.0
SAA001905 1000.0 140000.0 5500.0 86.0 5500.0
SAA001906 950.0 130000.0 6600.0 100.0 5800.0
SAA001907 1400.0 280000.0 6100.0 140.0 11000.0
... ... ... ... ... ... ...
Stenella coeruleoalba SAA001984 730.0 3500.0 40.0 84.0 3200.0
SAA001985 7300.0 44000.0 300.0 530.0 26000.0
SAA001986 12000.0 88000.0 490.0 850.0 42000.0
SAA001987 9100.0 56000.0 380.0 650.0 32000.0
SAA001988 11000.0 67000.0 520.0 610.0 36000.0

94 rows × 5 columns

From this point onward, analysis is performed using the data frame of the pivot table.

Chap.4 Visualization of data overview

Now that the data preparation is complete, let's visualize and check the entire data for the time being. The data handled this time is five-dimensional (there are five explanatory variables). Since humans cannot grasp the five-dimensional space, it is impossible to visualize the distribution of all the five-dimensional data as it is. Here, a distribution map of $ {} _ 5 C_2 = 10 $ is generated.

It is convenient to use pandas' pandas.plotting.scatter_matrix () function to visualize the scatter plot of multiple variables. In the scatter_matrix () function, specify the range to be visualized with the frame parameter. This time, all rows and 2nd column of df 6 and after.

pd.plotting.scatter_matrix(
    frame=df["MeasuredValue"],
    c=list(df["MeasuredValue"].iloc[:, 0]),
    alpha=0.5)
plt.show()

output_03_01.png

Chap.5 Principal component analysis

In

Chap.4, a distribution map of $ {} _ 5 C_2 = 10 $ was generated as an overview of the entire data. However, with this, only the distribution of data for each combination of two types of chemical substances can be confirmed. In other words, it is not possible to get an overview of the entire 5-dimensional data. Therefore, the five-dimensional data is summarized and visualized by principal component analysis. </ p>

First, since the length of each axis is different as it is, standardize 7 for each variable and set the length of all axes in 5 dimensions. Unify.

dfs = df.apply(lambda x: (x-x.mean())/x.std(), axis=0)
dfs
MeasuredValue
ChemicalName ΣCHLs ΣDDTs ΣHCHs ΣPBDEs ΣPCBs
ScientificName SampleID
Neophocaena phocaenoides SAA001903 -0.743053 0.320011 0.179064 -0.308268 -0.768856
SAA001904 -0.588580 1.574027 1.297534 -0.528241 -0.693375
SAA001905 -0.660428 1.574027 2.229592 -0.677822 -0.787727
SAA001906 -0.678390 1.399859 2.742224 -0.616230 -0.759421
SAA001907 -0.516732 4.012392 2.509209 -0.440252 -0.268792
... ... ... ... ... ... ...
Stenella coeruleoalba SAA001984 -0.757422 -0.803378 -0.314927 -0.686621 -1.004736
SAA001985 1.602782 -0.097994 -0.193759 1.275535 1.146484
SAA001986 3.291208 0.668349 -0.105213 2.683360 2.656112
SAA001987 2.249413 0.111009 -0.156477 1.803469 1.712595
SAA001988 2.931969 0.302594 -0.091233 1.627491 2.090001

94 rows × 5 columns

Alternatively, you can use the Standard Scaler for standardization.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df)
scaler.transform(df)
dfs = pd.DataFrame(scaler.transform(df), columns=df.columns)
dfs
MeasuredValue
ChemicalName ΣCHLs ΣDDTs ΣHCHs ΣPBDEs ΣPCBs
0 -0.747037 0.321727 0.180024 -0.309921 -0.772979
1 -0.591736 1.582467 1.304491 -0.531073 -0.697093
2 -0.663969 1.582467 2.241547 -0.681457 -0.791950
3 -0.682027 1.407365 2.756927 -0.619534 -0.763493
4 -0.519503 4.033907 2.522663 -0.442612 -0.270233
... ... ... ... ... ...
89 -0.761484 -0.807686 -0.316615 -0.690303 -1.010123
90 1.611376 -0.098520 -0.194798 1.282374 1.152631
91 3.308856 0.671933 -0.105778 2.697748 2.670354
92 2.261475 0.111604 -0.157316 1.813139 1.721777
93 2.947690 0.304217 -0.091722 1.636218 2.101208

94 rows × 5 columns

After standardization, perform principal component analysis. Here, the result of Standard Scaler is used.
For principal component analysis, scikit-learn's sklearn.decomposition.PCA () is used.
In principal component analysis, first, a synthetic variable (first principal component) is created so that the axis is drawn in the direction that maximizes the variation in the distribution of the data, and then the second principal component, which complements the data lost by the composite variable. Create the third principal component and synthetic variables. After that, if the data is transferred to the space centered on each principal component, the rest is only interpreted.

pca = PCA()
feature = pca.fit(dfs)    #Principal component analysis
feature = pca.transform(dfs)   #Transfer data to principal component space

pd.DataFrame(feature, columns=["PC{}".format(x + 1) for x in range(len(dfs.columns))])
PC1 PC2 PC3 PC4 PC5
0 -0.931614 0.574985 0.348535 -0.188477 -0.178934
1 -0.525203 2.177125 0.038871 -0.242288 -0.498098
2 -0.700197 2.892482 -0.248495 0.251040 -0.282242
3 -0.720222 3.155379 -0.366880 0.634978 -0.017083
4 0.554487 4.585518 0.039662 -0.952484 -0.965309
... ... ... ... ... ...
89 -1.607963 -0.471519 0.077579 0.103413 -0.068820
90 2.190969 -0.704970 0.009171 0.563020 0.053168
91 4.974287 -0.699630 0.023300 0.765578 0.013150
92 3.218248 -0.753396 -0.005256 0.688830 0.065009
93 3.804695 -0.750210 -0.513720 0.727646 -0.224109

94 rows × 5 columns

Let's also find the eigenvectors. The eigenvector is like the "force to tilt the axis" applied when transferring the original data to the principal component axis space.

Calculation of eigenvectors in Python is implemented in components_ included in PCA of scikit-learn.

pd.DataFrame(pca.components_, columns=df.columns[2:], index=["PC{}".format(x + 1) for x in range(len(dfs.columns))])
MeasuredValue
ChemicalName ΣCHLs ΣDDTs ΣHCHs ΣPBDEs ΣPCBs
PC1 0.572482 0.307088 -0.005288 0.488078 0.582848
PC2 -0.212874 0.641425 0.730929 -0.048824 -0.081347
PC3 -0.328297 0.180897 -0.249982 0.782267 -0.430192
PC4 0.427788 -0.529630 0.566847 0.279604 -0.370130
PC5 -0.579799 -0.425488 0.286195 0.263203 0.575856

Now that we know the eigenvectors, we will find the eigenvalues of each principal component. The eigenvalue is a coefficient associated with the eigenvector and is an index indicating the size of each principal component axis.

In Python, the contribution rate can be calculated by using explained_variance_ included in the PCA of scikit-learn.

pd.DataFrame(pca.explained_variance_, index=["PC{}".format(x + 1) for x in range(len(dfs.columns))])
0
PC1 2.301828
PC2 1.526633
PC3 0.681585
PC4 0.450884
PC5 0.092834

Next, find the contribution rate. The contribution rate is an index showing how much information each principal component axis explains in the data. In other words, the main component with a contribution rate of 60% can explain 60% of the total data, and 40% is lost.

In Python, the contribution rate can be calculated by using explained_variance_ratio_ included in the PCA of scikit-learn.

pd.DataFrame(pca.explained_variance_ratio_, index=["PC{}".format(x + 1) for x in range(len(dfs.columns))])
0
PC1 0.455468
PC2 0.302078
PC3 0.134867
PC4 0.089218
PC5 0.018369

Here, the cumulative contribution rate is also calculated. The reason why the cumulative contribution rate is necessary is to see "what percentage of data is lost". In other words, if you look at the main components that have a cumulative contribution rate above a certain level, you can reduce the number of dimensions and get an overview of the entire data.

Here, the cumulative total value is calculated using the cumsum function of numpy.

plt.plot([0] + list( np.cumsum(pca.explained_variance_ratio_)), "-o")
plt.xlabel("Number of principal components")
plt.ylabel("Cumulative contribution rate")
plt.grid()
plt.show()

output_03_02.png

Regarding each generated principal component axis, I'm not sure what it means as it is. However, by visualizing the contribution of each variable in the principal component, the meaning of each principal component can be seen. Here, we plot the contribution of each variable in the first and second principal components on a scatter plot. ```python plt.figure() for x, y, name in zip(pca.components_[0], pca.components_[1], df["MeasuredValue"].columns[:]): plt.text(x, y, name) plt.scatter(pca.components_[0], pca.components_[1], alpha=0.8) plt.grid() plt.xlabel("PC1") plt.ylabel("PC2") plt.show() ```

output_03_03.png

Finally, visualize the result of principal component analysis. Generally, the eigenvalues of 1 or more or the cumulative contribution rate of 80% or more are adopted, and the distribution is examined by the combination of the adopted principal components. In this case, it is assumed that up to the second main component is adopted and visualized in the graph. As with a general scatter plot, you can use the scatter function of matplotlib and specify the first principal component (0th column of feature) and the second principal component (1st column of feature) for each of the parameters X and Y.

plt.figure()
plt.scatter(x=feature[:, 0], y=feature[:, 1], alpha=0.8, c=list(df.iloc[:, 0])) #Describe plot color coding in c
plt.grid()
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()

output_33_0.png

footnote

1 An explanatory variable is a variable that explains the objective variable that is the desired result. In other words, for the objective variable that indicates the result, the variable that causes it is the explanatory variable. Also an independent variable.

2 A variable created by combining multiple variables. In principal component analysis, each principal component is a synthetic variable.

3 Software or libraries whose source code, which is the blueprint of a program, is widely disclosed to the general public. As long as it is within the range permitted by the license, everyone can use it free of charge. Also, since anyone can develop it, it is easy to update from the user's perspective and fix bugs that are not noticed by a small number of people.
Python, Jupyter Notebook, and Anaconda are all open source software.

4 Both are methods used in machine learning. Classification is image classification, etc., and regression is used for future forecasts such as stock prices. Clustering is used to group similar data based on data trends.

5 A data type that stores tabular data in Pandas. Manipulate data by specifying cells, rows, and columns as if you read a CSV or TSV format file in Excel.

6 In python, numbers start with "0".

7 The operation of subtracting the average value from the measured value and dividing by the standard deviation.

Recommended Posts