Learning Python with ChemTHEATER 01

Part1 Basic statistics and their visualization

Overall flow

In Part 1, using the finless porpoise data recorded in ChemTHEATER (https://chem-theatre.com/ ), we analyze the basic statistics that show the basic characteristics of the data, and boxplot. Visualize using figures. img01.png

Loading the Chap.1 library

%matplotlib inline
import numpy as np # as np,As pd etc. set the abbreviation name when calling it programmatically
import pandas as pd #Since it is troublesome to enter numpy etc. each time, it is described like this as a convention
import matplotlib.pyplot as plt
import seaborn as sns

The first line is the setting 1 that the graph output at the time of visualization is displayed in Jupyter Notebook. On the second and subsequent lines, load the library required by this code. Both are libraries installed as standard in Anaconda.
The specific contents of the library are as follows.

Library Overview Purpose of use this time Official URL
NumPy Numerical calculation library Used for numerical calculation in statistical processing https://www.numpy.org
pandas Data analysis library Used for data reading and formatting https://pandas.pydata.org
Matplotlib Graph drawing library Used for data visualization https://matplotlib.org
seaborn Statistical data visualization library Used for data visualization https://seaborn.pydata.org

Reading Chap.2 data

As mentioned above, this time we will deal with finless porpoise ( Neophocaena phocaenoides ) data. Download data related to finless porpoise from ChemTHEATER's Sample Search in TSV format 2 . There are two types of sample search data for ChemTHEATER, measured data and samples, but this time we will use both.

The ChemTHEATER operation manual can be obtained from the ChemTHEATER Wiki site .
The simple operation guide is as follows.

  1. Click "Sample Search" on the menu bar from the top page of ChemTEHATRE.
  2. Select "Biotic --Mammals --Marine mammals" from the "Sample Type" pull-down menu.
  3. Select the scientific name of finless porpoise "Neophocaena phocaenoides" from the pull-down menu of "Scientific Name".
  4. Click the "Search" button.
  5. Click the "Export samples (TSV)" button to export the sample list, and the "Export measured data (TSV)" button to export the measured value list as tab-delimited text files.
  6. Save the exported file in any directory and use it for subsequent analysis.
data_file = "measureddata_20190826061813.tsv"    #Change the string entered in the variable to the tsv file name of your measured data
chem = pd.read_csv(data_file, delimiter="\t")
chem = chem.drop(["ProjectID", "ScientificName", "RegisterDate", "UpdateDate"], axis=1)    #Remove duplicate columns when joining with samples later
sample_file = "samples_20190826061810.tsv"    #Change the string entered in the variable to the tsv file name of your samples
sample = pd.read_csv(sample_file, delimiter="\t")

The downloaded TSV file is read by the pandas function (read_csv). At this time, it is necessary to change the path of each file ("data_file" and "sample_file" in the upper cell) to a file name suitable for each environment. In the sample code, the data file is written in the same directory as the notebook. ## Chap.3 Data preparation

The data handled this time is a compilation of measurement results of various chemical substances. Furthermore, the measurement results and the sample data are scattered. In other words, it is difficult to handle as it is, so it is necessary to shape it into a shape that is easy to handle. Specifically, the following operations are required.

    Combine sample with
  1. chem.
  2. Extract only the measurement data of the chemical substances handled this time.
  3. Delete columns that are not needed for visualization.

Sec.3-1 Extraction of necessary data

First, process 1 and 2 shown above.

Bonding is the work of adding a sample data frame (sample) of a sample of each chemical substance to the data frame (chem) of each chemical substance. Here, the sample data corresponding to the Sample ID of each chem data is attached to the right side of the chem.

df = pd.merge(chem, sample, on="SampleID")

In the data handled this time, each chemical substance is classified in a fairly detailed manner. You may visualize the measurement results for each chemical substance, but this time we want to visualize the outline of the measurement results so that you can see at a glance, so we will extract the data of the total value of each group when the chemical substances are roughly classified. ..

In addition, the data handled this time contains a mixture of two different units. Since it is meaningless to compare different units, this time we will divide the data for each unit. In the code below, Unit is specified in the first line of each, and the lipid and wet conversions are extracted, and then only the variables starting with Σ are extracted. In the second line, variables that start with Σ but are not the total value of chemical concentration are actually deleted from the data.

data_lipid = df[(df["Unit"] == "ng/g lipid") & df["ChemicalName"].str.startswith("Σ")]
data_lipid = data_lipid[(data_lipid["ChemicalName"] != "ΣOH-penta-PCB") & (data_lipid["ChemicalName"] != "ΣOH-hexa-PCB")
                        & (data_lipid["ChemicalName"] != "ΣOH-hepta-PCB") &  (data_lipid["ChemicalName"] != "ΣOH-octa-PCB")]
data_wet = df[(df["Unit"] == "ng/g wet") & df["ChemicalName"].str.startswith("Σ")]
data_wet = data_wet[(data_wet["ChemicalName"] != "ΣOH-penta-PCB") & (data_wet["ChemicalName"] != "ΣOH-hexa-PCB")
                    & (data_wet["ChemicalName"] != "ΣOH-hepta-PCB") &  (data_wet["ChemicalName"] != "ΣOH-octa-PCB")]

Data outline

This is the result of processing with Sec.3-1 (only for those whose unit is ng / g lipid). By deleting unnecessary data, only 131 data for 6 chemical substances could be extracted. ```python data_lipid ```

MeasuredID SampleID ChemicalID ChemicalName ExperimentID MeasuredValue AlternativeData Unit Remarks_x ProjectID ...
4371 24782 SAA001903 CH0000033 ΣDDTs EXA000001 68000.0 NaN ng/g lipid NaN PRA000030 ...
4372 24768 SAA001903 CH0000096 ΣPCBs EXA000001 5700.0 NaN ng/g lipid NaN PRA000030 ...
4381 24754 SAA001903 CH0000138 ΣPBDEs EXA000001 170.0 NaN ng/g lipid NaN PRA000030 ...
4385 24902 SAA001903 CH0000142 ΣHBCDs EXA000001 5.6 NaN ng/g lipid NaN PRA000030 ...
4389 24810 SAA001903 CH0000146 ΣHCHs EXA000001 1100.0 NaN ng/g lipid NaN PRA000030 ...
... ... ... ... ... ... ... ... ... ... ... ...
4812 25125 SAA001940 CH0000152 ΣCHLs EXA000001 160.0 NaN ng/g lipid NaN PRA000030 ...
4816 25112 SAA001941 CH0000033 ΣDDTs EXA000001 9400.0 NaN ng/g lipid NaN PRA000030 ...
4817 25098 SAA001941 CH0000096 ΣPCBs EXA000001 1100.0 NaN ng/g lipid NaN PRA000030 ...
4818 25140 SAA001941 CH0000146 ΣHCHs EXA000001 41.0 NaN ng/g lipid NaN PRA000030 ...
4819 25126 SAA001941 CH0000152 ΣCHLs EXA000001 290.0 NaN ng/g lipid NaN PRA000030 ...

131 rows × 74 columns

Sec.3-2 Delete unnecessary columns

Next, delete unnecessary columns. In general, if the amount of information is the same, it is better to handle a small amount of data, so it is better to delete unnecessary data. In the data frame handled this time, there are many blanks without data. This is treated as N / A (floating point number) on python, and although there is no information, it consumes a certain amount of space. Therefore, all N / A columns (columns without information) are deleted from the data frame. ```python data_lipid = data_lipid.dropna(how='all', axis=1) data_wet = data_wet.dropna(how='all', axis=1) ```

Data outline

This is the result of processing Sec.3-2. It can be seen that the number of columns (the value of columns appearing at the bottom of the table) is considerably smaller than the result of Sec.3-1. ```python data_lipid ```

MeasuredID SampleID ChemicalID ChemicalName ExperimentID MeasuredValue Unit ProjectID SampleType TaxonomyID ...
4371 24782 SAA001903 CH0000033 ΣDDTs EXA000001 68000.0 ng/g lipid PRA000030 ST004 34892 ...
4372 24768 SAA001903 CH0000096 ΣPCBs EXA000001 5700.0 ng/g lipid PRA000030 ST004 34892 ...
4381 24754 SAA001903 CH0000138 ΣPBDEs EXA000001 170.0 ng/g lipid PRA000030 ST004 34892 ...
4385 24902 SAA001903 CH0000142 ΣHBCDs EXA000001 5.6 ng/g lipid PRA000030 ST004 34892 ...
4389 24810 SAA001903 CH0000146 ΣHCHs EXA000001 1100.0 ng/g lipid PRA000030 ST004 34892 ...
... ... ... ... ... ... ... ... ... ... ... ...
4812 25125 SAA001940 CH0000152 ΣCHLs EXA000001 160.0 ng/g lipid PRA000030 ST004 34892 ...
4816 25112 SAA001941 CH0000033 ΣDDTs EXA000001 9400.0 ng/g lipid PRA000030 ST004 34892 ...
4817 25098 SAA001941 CH0000096 ΣPCBs EXA000001 1100.0 ng/g lipid PRA000030 ST004 34892 ...
4818 25140 SAA001941 CH0000146 ΣHCHs EXA000001 41.0 ng/g lipid PRA000030 ST004 34892 ...
4819 25126 SAA001941 CH0000152 ΣCHLs EXA000001 290.0 ng/g lipid PRA000030 ST004 34892 ...

131 rows × 37 columns

Visualization of Chap.4 ΣDDTs

By processing in

Chap.3, we have prepared data for each of the two types of chemical substances. Here, we first visualize the degree of data distribution for ng / g lipids of ΣDDTs. </ p>

Sec.4-1 Data preparation

First, for visualization, ΣDDTs data is extracted from data_lipid. ```python data_ddt = data_lipid[data_lipid["ChemicalName"] == "ΣDDTs"] ddt_vals = data_ddt.loc[:, "MeasuredValue"] data_ddt ```

MeasuredID SampleID ChemicalID ChemicalName ExperimentID MeasuredValue Unit ProjectID SampleType TaxonomyID ...
4371 24782 SAA001903 CH0000033 ΣDDTs EXA000001 68000.0 ng/g lipid PRA000030 ST004 34892 ...
4401 24783 SAA001904 CH0000033 ΣDDTs EXA000001 140000.0 ng/g lipid PRA000030 ST004 34892 ...
4431 24784 SAA001905 CH0000033 ΣDDTs EXA000001 140000.0 ng/g lipid PRA000030 ST004 34892 ...
4461 24785 SAA001906 CH0000033 ΣDDTs EXA000001 130000.0 ng/g lipid PRA000030 ST004 34892 ...
4491 24786 SAA001907 CH0000033 ΣDDTs EXA000001 280000.0 ng/g lipid PRA000030 ST004 34892 ...

5 rows × 37 columns

Sec.4-2 Basic statistics

Since the data was extracted in Sec.4-1, the basic statistics are calculated in order to grasp the characteristics of this data set. Basic statistics are some statistics that summarize a dataset. This time, the total number of data, mean, standard deviation, 1st quartile, median, 3rd quartile, minimum value, and maximum value are calculated. ```python count = ddt_vals.count() mean = ddt_vals.mean() std = ddt_vals.std() q1 = ddt_vals.quantile(0.25) med = ddt_vals.median() q3 = ddt_vals.quantile(0.75) min = ddt_vals.min() max = ddt_vals.max() count, mean, std, q1, med, q3, min, max ```

(24, 102137.5, 81917.43357743236, 41750.0, 70500.0, 140000.0, 9400.0, 280000.0)

In the upper cell, the basic statistics are calculated individually, but pandas has a function that outputs the basic statistics together.

avgs = ddt_vals.describe()
avgs
count        24.000000
mean     102137.500000
std       81917.433577
min        9400.000000
25%       41750.000000
50%       70500.000000
75%      140000.000000
max      280000.000000
Name: MeasuredValue, dtype: float64

Sec.4-3 jitter plot

There is a scatter plot that plots all the data as a way to visualize how the data is scattered (Eguchi: Is this a jitter plot, not a scatter plot?). In Python, you can easily draw a jitter plot by using seaborn's stripplot function. Jitter plot is one of the good visualization methods that can easily describe each heaven by adding fluctuations to the dots even when the dots overlap.

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
sns.stripplot(x="ChemicalName", y="MeasuredValue", data=data_ddt, color='black', ax=ax)  #Drawing of jitter plot
ax.set_ylabel("ng/g lipid")
plt.show()

output_28_0.png

Sec.4-4 Box plot

Unlike Sec.4-3, the boxplot plots a summary of the dataset with basic statistics. Box plots can be drawn with seaborn's boxplot function. Here, by overwriting the boxplot on the jitter plot mentioned earlier, a diagram that makes it easier to understand the distribution of data is output.

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
sns.stripplot(x="ChemicalName", y="MeasuredValue", data=data_lipid[data_lipid["ChemicalName"] == "ΣDDTs"], color='black', ax=ax) #Drawing of jitter plot
sns.boxplot(x="ChemicalName", y="MeasuredValue", data=data_lipid[data_lipid["ChemicalName"] == "ΣDDTs"], ax=ax) #Drawing a box plot
ax.set_ylabel("ng/g lipid")
plt.show()

output_30_0.png

Visualization of each chemical data of finless porpoise ( Neophocaena phocaenoides </ i>)

In Chap.4, only the values of ΣDDTs were visualized. Apply it to visualize data_lipid and data_wet data. ### Sec.5-1 Box plot (application of Sec.4-4)

First, draw a graph for the data of data_lipid in the same way as Sec.4-4. Since the stripplot / boxplot function has a function to automatically classify the data by the column name to be assigned to the parameter x, the collected data can be directly assigned to the parameter data. ```python fig = plt.figure() ax = fig.add_subplot(1,1,1) sns.stripplot(x="ChemicalName", y="MeasuredValue", data=data_lipid, color='black', ax=ax) #Drawing of jitter plot sns.boxplot(x="ChemicalName", y="MeasuredValue", data=data_lipid, ax=ax) #Drawing a box plot ax.set_ylabel("ng/g lipid") plt.show() ```

output_33_0.png

Sec.5-2 Improvement of Sec.5-1

In the graph above, the ΣDDTs vary widely, so the y-axis spacing is wide. Therefore, the remaining 5 types of graphs are crushed and difficult to see. Therefore, the graph is made easier to see by taking the logarithm of the data values.

First, when taking the logarithm, the logarithm of 0 cannot be taken, so check if the data set contains 0. By using the in statement, the presence or absence of the corresponding value is returned as TRUE / FALSE.

0 in data_lipid.loc[:, "MeasuredValue"].values
False

Since we found that the dataset does not contain 0 in the upper cell, we can take the logarithmic value as it is. ```python data_lipid.loc[:, "MeasuredValue"] = data_lipid.loc[:, "MeasuredValue"].apply(np.log10) ```

C:\Users\masah\Anaconda3\lib\site-packages\pandas\core\indexing.py:543: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
sns.stripplot(x="ChemicalName", y="MeasuredValue", data=data_lipid, color='black', ax=ax) #Drawing of jitter plot
sns.boxplot(x="ChemicalName", y="MeasuredValue", data=data_lipid, ax=ax) #Drawing a box plot
ax.set_ylabel("log(ng/g) lipid")
plt.show()

output_38_0.png

Visualize

data_wet in the same way. First, if the dataset contains 0s, substitute 1/2 of the minimum value instead. Then take the logarithm and visualize it. </ p>

if 0 in data_wet.loc[:, "MeasuredValue"].values:
    data_wet["MeasuredValue"].replace(0, data_wet[data_wet["MeasuredValue"] != 0].loc[:, "MeasuredValue"].values.min() / 2) #Minimum value 1/Substitute 2
else:
    pass
data_lipid.loc[:, "MeasuredValue"] = data_lipid["MeasuredValue"].apply(np.log10)
data_lipid.loc[:, "Unit"] = "log(ng/g lipid)"
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
sns.stripplot(x="ChemicalName", y="MeasuredValue", data=data_lipid, color='black', ax=ax)
sns.boxplot(x="ChemicalName", y="MeasuredValue", data=data_lipid, ax=ax)
ax.set_ylabel("log(ng/g lipid)")
plt.show()

output_41_0.png

footnote

1 It is a function of Juyter Notebook called "magic command", and it should be noted that it cannot be used in general python code.

2 Abbreviation for tab-separated values. It is a file format in which data is separated by TAB characters (at the left end of the keyboard). Suitable for holding table-shaped data. A similar file format is CSV (abbreviation for comma-separated values, separated by ", (comma)").

Recommended Posts