Learning Python with ChemTHEATER 02

Part 2 Statistical inference

Chap.0 Overall flow

Part 2 makes statistical inferences. In the natural sciences, it is not possible to directly survey all the survey targets (nature), so sample surveys are common.
At that time, statistical inference is to estimate the distribution of the entire unknown population from the obtained sample (a part of the population). ![img02.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/261241/9644934b-ac72-2959-f1fc-9d09f0d8c031.png)

Loading the Chap.1 library

%matplotlib inline
import numpy as np
from scipy import stats
import math
import pandas as pd
from matplotlib import pyplot as plt

First, start by loading the required libraries. The first line is a magic command to display matplotlib in Jupyter Notebook as in Part 1.

The second and subsequent lines are the libraries used this time. Of these libraries, the math library is built into Python as standard. Other libraries are installed in Anaconda.

By the way, math and Numpy are similar in function, but math is a function built into standard Python, and Numpy is an extension module that makes complex numerical calculations efficient, and their roles are different.

Library Overview Purpose of use this time Official URL
NumPy Numerical calculation library Used for numerical calculation in statistical processing https://www.numpy.org
Scipy Scientific Computational Library Used for calculation of statistical inference https://www.scipy.org
math Standard Numerical Library Used for simple calculations such as square roots https://docs.python.org/ja/3/library/math.html
pandas Data analysis library Used for data reading and formatting https://pandas.pydata.org
Matplotlib Graph drawing library Used for data visualization https://matplotlib.org

Reading Chap.2 data

This time, we will use the data of Skipjack ( Katsuwonus pelamis ). Refer to Chapter.4 of Part 0, and download the bonito sample data and measurement data from the Sample Search of ChemTHEATER.
Once downloaded, move the measured data and samples data to the folder containing this notebook file. After that, after restarting Anaconda, use the read_csv function of pandas as in Part 1 to read both the measurement data and the sample data.

data_file = "measureddata_20190930045953.tsv"    #Change the string entered in the variable to the tsv file name of your measured data
chem = pd.read_csv(data_file, delimiter="\t")
chem = chem.drop(["ProjectID", "ScientificName", "RegisterDate", "UpdateDate"], axis=1)    #Remove duplicate columns when joining with samples later

sample_file = "samples_20190930045950.tsv"    #Change the string entered in the variable to the tsv file name of your samples
sample = pd.read_csv(sample_file, delimiter="\t")

When reading or processing a file programmatically like python, it is better to make a habit of checking whether the file can be read as expected. By the way, in the case of Jupyter Notebook, if you enter only the variable name, the contents of that variable will be displayed in Out, which is convenient.

chem
MeasuredID SampleID ChemicalID ChemicalName ExperimentID MeasuredValue AlternativeData Unit Remarks
0 1 SAA000001 CH0000096 ΣPCBs EXA000001 6.659795 NaN ng/g wet NaN
1 2 SAA000002 CH0000096 ΣPCBs EXA000001 9.778107 NaN ng/g wet NaN
2 3 SAA000003 CH0000096 ΣPCBs EXA000001 5.494933 NaN ng/g wet NaN
3 4 SAA000004 CH0000096 ΣPCBs EXA000001 7.354636 NaN ng/g wet NaN
4 5 SAA000005 CH0000096 ΣPCBs EXA000001 9.390950 NaN ng/g wet NaN
... ... ... ... ... ... ... ... ... ...
74 75 SAA000082 CH0000096 ΣPCBs EXA000001 3.321208 NaN ng/g wet NaN
75 76 SAA000083 CH0000096 ΣPCBs EXA000001 3.285111 NaN ng/g wet NaN
76 77 SAA000084 CH0000096 ΣPCBs EXA000001 0.454249 NaN ng/g wet NaN
77 78 SAA000085 CH0000096 ΣPCBs EXA000001 0.100000 <1.00E-1 ng/g wet NaN
78 79 SAA000086 CH0000096 ΣPCBs EXA000001 0.702224 NaN ng/g wet NaN

79 rows × 9 columns

sample
ProjectID SampleID SampleType TaxonomyID UniqCodeType UniqCode SampleName ScientificName CommonName CollectionYear ...
0 PRA000001 SAA000001 ST008 8226 es-BANK EF00564 NaN Katsuwonus pelamis Skipjack tuna 1998 ...
1 PRA000001 SAA000002 ST008 8226 es-BANK EF00565 NaN Katsuwonus pelamis Skipjack tuna 1998 ...
2 PRA000001 SAA000003 ST008 8226 es-BANK EF00566 NaN Katsuwonus pelamis Skipjack tuna 1998 ...
3 PRA000001 SAA000004 ST008 8226 es-BANK EF00567 NaN Katsuwonus pelamis Skipjack tuna 1998 ...
4 PRA000001 SAA000005 ST008 8226 es-BANK EF00568 NaN Katsuwonus pelamis Skipjack tuna 1998 ...
... ... ... ... ... ... ... ... ... ... ... ...
74 PRA000001 SAA000082 ST008 8226 es-BANK EF00616 NaN Katsuwonus pelamis Skipjack tuna 1999 ...
75 PRA000001 SAA000083 ST008 8226 es-BANK EF00617 NaN Katsuwonus pelamis Skipjack tuna 1999 ...
76 PRA000001 SAA000084 ST008 8226 es-BANK EF00619 NaN Katsuwonus pelamis Skipjack tuna 1999 ...
77 PRA000001 SAA000085 ST008 8226 es-BANK EF00620 NaN Katsuwonus pelamis Skipjack tuna 1999 ...
78 PRA000001 SAA000086 ST008 8226 es-BANK EF00621 NaN Katsuwonus pelamis Skipjack tuna 1999 ...

79 rows × 66 columns

Chap.3 Data preparation

After reading the data, the next step is to prepare the data.
First, the data divided into two (chem and sample) are integrated, and only the necessary data is extracted. This time, we want to use the bonito ΣPCB data, so only the data whose "ChemicalName" column value is "ΣPCB" is extracted.

df = pd.merge(chem, sample, on="SampleID")
data = df[df["ChemicalName"] == "ΣPCBs"]

Next, check if the unit of measurement data is different. This is because if the unit of data is different as in Part 1, it is not possible to simply compare and integrate.

data["Unit"].unique()
array(['ng/g wet'], dtype=object)

You can use the

pandas unique method to see a list of the values contained in that dataframe. Here, if you output the list of values included in the "Unit" column, you can see that it is only "ng / g wet", so this time, it is not necessary to divide the data by unit. </ p>

Finally, the column with only N / A is deleted, and the data preparation is completed.

data = data.dropna(how='all', axis=1)
data
MeasuredID SampleID ChemicalID ChemicalName ExperimentID MeasuredValue AlternativeData Unit ProjectID SampleType ...
0 1 SAA000001 CH0000096 ΣPCBs EXA000001 6.659795 NaN ng/g wet PRA000001 ST008 ...
1 2 SAA000002 CH0000096 ΣPCBs EXA000001 9.778107 NaN ng/g wet PRA000001 ST008 ...
2 3 SAA000003 CH0000096 ΣPCBs EXA000001 5.494933 NaN ng/g wet PRA000001 ST008 ...
3 4 SAA000004 CH0000096 ΣPCBs EXA000001 7.354636 NaN ng/g wet PRA000001 ST008 ...
4 5 SAA000005 CH0000096 ΣPCBs EXA000001 9.390950 NaN ng/g wet PRA000001 ST008 ...
... ... ... ... ... ... ... ... ... ... ... ...
74 75 SAA000082 CH0000096 ΣPCBs EXA000001 3.321208 NaN ng/g wet PRA000001 ST008 ...
75 76 SAA000083 CH0000096 ΣPCBs EXA000001 3.285111 NaN ng/g wet PRA000001 ST008 ...
76 77 SAA000084 CH0000096 ΣPCBs EXA000001 0.454249 NaN ng/g wet PRA000001 ST008 ...
77 78 SAA000085 CH0000096 ΣPCBs EXA000001 0.100000 <1.00E-1 ng/g wet PRA000001 ST008 ...
78 79 SAA000086 CH0000096 ΣPCBs EXA000001 0.702224 NaN ng/g wet PRA000001 ST008 ...

79 rows × 35 columns

Chap.4 point estimation

Of the statistical inferences that infer the population from a sample, point estimation is the pinpoint estimation. Here, we try to estimate the ΣPCB concentration of the population (whole individual in the collection area) from the ΣPCB concentration sample detected in skipjack.

First, in order to calculate year by year and see the transition of the change, check how many years the data is included.

data['CollectionYear'].unique()
array([1998, 1997, 1999, 2001], dtype=int64)

From the unique method above, it was found that the data set contains data for the three years 1997-1999 and 2001. First, let's take out the 1997 data. At this time, change the extracted data to Numpy's ndarray 1 format so that future calculations will be easier.

pcb_1997 = np.array(data[data['CollectionYear']==1997]["MeasuredValue"]) #Extract only 1997 measurements
pcb_1997
array([ 10.72603788,   9.22208078,   7.59790835,  30.95079465,
        15.27462553,  14.15719633,  13.28955903,  14.87712806,
         9.86650189,  18.26554514,   3.39951845,   6.58172781,
        12.43564814,   6.1948639 ,   6.41605666,   4.98827291,
        12.36669815,  31.17955551,   8.16184346,   4.60893266,
        36.85826409,  52.99841724,  39.22500351,  53.92302702,
        69.4308048 ,  73.97686479, 125.3887794 ,  45.39974771,
        54.12726127,  39.77794045, 101.2736126 ,  38.06220403,
       126.8301693 ,  70.25308435,  31.24246301,  21.3958656 ,
        41.85726522,  30.91112132,  81.12597135,  10.76755148,
        24.20442213,  24.57497594,  14.84353549,  59.53687389,
        52.78443082,   8.4644697 ,   4.15293758,   3.31957452,
         4.51832675,   6.98373973])

Similarly, the data for 1998 and 1999 are also extracted.

pcb_1998 = np.array(data[data['CollectionYear']==1998]["MeasuredValue"]) #Extract only 1998 measurements
pcb_1999 = np.array(data[data['CollectionYear']==1999]["MeasuredValue"]) #Extract only 1999 measurements

Here, the unbiased estimators of mean and variance are calculated.
First, the unbiased estimator of the mean ($ \ hat {\ mu} $), which takes advantage of the fact that the expected value of the sample mean ($ \ overline {X} $) is equal to the population mean. (See formula below)
$$ E \left(\overline X \right) = E\left(\frac{1}{n} \sum_{i=1}^{n} \left(x_i\right)\right) = \frac{1}{n}\sum_{i=1}^{n} E\left(x_i\right) = \frac{1}{n} \times n\mu = \mu \\ \therefore \hat\mu = \overline{x} $$

s_mean_1997 = np.mean(pcb_1997)
s_mean_1997
31.775384007760003

Similarly, calculate the unbiased estimator of the variance.
At this time, the expected value of the sample variance ($ S ^ 2 $) does not take the same value as the population variance ($ \ sigma ^ 2 $), and it is necessary to obtain the unbiased variance ($ s ^ 2 $) instead. Pay attention to.
$$\hat\sigma^2 \neq S^2 = \frac{1}{n} \sum_{i=1}^{n} \left( x_i - \overline X \right) \\ \hat\sigma^2 = s^2 = \frac{1}{n-1} \sum_{i=1}^{n} \left( x_i - \overline X \right)$$

The Numpy var function can calculate either variance, and the unbiased variance is output with the parameter ddof = 1. However, note that the sample variance of ddof = 0 is output by default.
$$\mathrm{np.var}\left(x_1 \ldots x_n, \mathrm{ddof=0}\right): S^2 = \frac{1}{n} \sum_{i=1}^{n} \left( x_i - \overline X \right) \\ \mathrm{np.var}\left(x_1 \ldots x_n, \mathrm{ddof=1}\right): \hat\sigma^2 = \frac{1}{n-1} \sum_{i=1}^{n} \left( x_i - \overline X \right)$$

u_var_1997 = np.var(pcb_1997, ddof=1)
u_var_1997 
942.8421749786518

Similarly, calculate the unbiased estimators of the mean and variance for 1998 and 1999.

s_mean_1998, s_mean_1999 = np.mean(pcb_1998), np.mean(pcb_1999)
u_var_1998, u_var_1999 = np.var(pcb_1998, ddof=1), np.var(pcb_1999, ddof=1)

Here, the obtained representative values are summarized. First,

s_mean_1997, s_mean_1998, s_mean_1999
(31.775384007760003, 17.493267312533337, 30.583242522000003)
u_var_1997, u_var_1998, u_var_1999
(942.8421749786518, 240.2211176248311, 1386.7753819003349)

Chap.5 Interval estimation and confidence interval

In Chap.5, unlike the point estimation obtained in Chap.4, interval estimation is performed to estimate the population mean and population variance within a statistically constant range.

Sec.5-1 Interval estimation of population mean

First, before estimating the interval, check the number of data in each year's dataset. The number of data can be calculated by using the len function implemented as standard in python.

n_1997 = len(pcb_1997)
n_1997
50
n_1998, n_1999 = len(pcb_1998), len(pcb_1999)
n_1998, n_1999
(15, 13)

From the above, the total number of data in the data set for each year from 1997 to 1999 is known.

Of these, the 1997 dataset is a large sample of $ n = 50 $, and the 1998 and 1999 datasets are $ n = \ left \ {\ begin {array} {ll}, respectively. 15 & \ left (1998 \ right) \\ 13 & \ left (1999 \ right) \ end {array} \ right. $, Which is a small sample.
Therefore, it should be noted that the subsequent interval estimation process is slightly different.

First, the interval of the population mean is estimated from the 1997 data set. In this case, since the population variance is unknown and the sample is large ($ n> 30 $), the sample mean ($ \ overline X $) is normally distributed $ N \ left (\ mu, \ frac {s ^ 2} { n} \ right) $ can be approximated. Therefore, if the population mean is estimated by the confidence interval ($ \ alpha $), the confidence interval is as shown in the following equation. $\overline X - z_\frac{\alpha}{2} \sqrt{\frac{s^2}{n}} < \mu < \overline X - z_\frac{\alpha}{2} \sqrt{\frac{s^2}{n}} $

In python, if you use Scipy's stas.norm.interval (), you can get the range of alpha × 100% in the normal distribution of mean (loc) and standard deviation (scale) centered on the median.
Here, the confidence interval is calculated with the reliability ($ \ alpha = 0.95 $).

m_interval_1997 = stats.norm.interval(alpha=0.95, loc=s_mean_1997, scale=math.sqrt(pcb_1997.var(ddof=1)/n_1997))
m_interval_1997
(23.26434483549182, 40.28642318002819)

Next, we estimate the interval of the population mean for the 1998 and 1999 datasets. These are small samples ($ n \ leq 30 $) with unknown population variance. In this case, the population mean $ \ mu $ is not the normal distribution $ N \ left (\ mu, \ frac {s ^ 2} {n} \ right) $, but the t distribution with degrees of freedom ($ n-1 $). To use. Therefore, if the population mean is estimated by the confidence interval ($ \ alpha $), the confidence interval is as shown in the following equation. $\overline X - t_\frac{\alpha}{2}\left(n-1\right)\sqrt{\frac{s^2}{n}} < \mu < \overline X + t_\frac{\alpha}{2}\left(n-1\right)\sqrt{\frac{s^2}{n}} $

In python, if you use stats.t.interval (), the range where the t distribution of mean (loc), standard deviation (scale), and degrees of freedom (df) is alpha × 100% is centered on the median. Can be obtained as.
Here, the confidence interval is calculated based on the reliability ($ \ alpha = 0.95 $).

m_interval_1998 = stats.t.interval(alpha=0.95, df=n_1998-1, loc=s_mean_1998, scale=math.sqrt(pcb_1998.var(ddof=1)/n_1998))
m_interval_1999 = stats.t.interval(alpha=0.95, df=n_1999-1, loc=s_mean_1999, scale=math.sqrt(pcb_1999.var(ddof=1)/n_1999))
m_interval_1997, m_interval_1998, m_interval_1999
((23.26434483549182, 40.28642318002819),
 (8.910169386248537, 26.076365238818138),
 (8.079678286109523, 53.086806757890486))

The 95% confidence interval means that the population mean is within that range with a 95% probability. That is, if the reliability ($ \ alpha $) is reduced, the probability that the population mean is included in the confidence interval is reduced, and at the same time, the confidence interval is narrowed.

stats.norm.interval(alpha=0.9, loc=s_mean_1997, scale=math.sqrt(pcb_1997.var(ddof=1)/n_1997))
(24.63269477364296, 38.91807324187704)

Sec.5-2 Interval estimation of population variance

Next, the interval estimation of the population variance is performed. In the interval estimation of the population variance ($ \ sigma ^ 2 $), $ \ frac {\ left (n-1 \ right) s ^ 2} {\ sigma ^ 2} $ has $ (n-1) $ degrees of freedom. Use to follow the $ \ chi ^ 2 $ distribution of. $\chi_\frac{\alpha}{2}\left(n-1\right) \leq \frac{\left( n-1 \right)s^2}{\sigma^2} \leq \chi_{1-\frac{\alpha}{2}}\left(n-1\right)$

First, the percentage points of the $ \ chi ^ 2 $ distribution of degrees of freedom ($ n-1 $) ($ \ chi_ \ frac {\ alpha} {2} \ left (n-1 \ right), \ chi_ Find {1- \ frac {\ alpha} {2}} \ left (n-1 \ right) $). Here, the reliability is 0.95.
In python, Scipy's stats.chi2.interval () can be used to get the range of alpha x 100% of degrees of freedom (df).

chi_025_1997, chi_975_1997 = stats.chi2.interval(alpha=0.95, df=n_1997-1)
chi_025_1997, chi_975_1997
(31.554916462667137, 70.22241356643451)

Next, find the confidence interval. For derivation, refer to the following formula. $\frac{\left(n-1\right)s^2}{\chi_\frac{\alpha}{2}\left(n-1\right)} \leq \sigma^2 \leq \frac{\left(n-1\right)s^2}{\chi_{1-\frac{\alpha}{2}}\left(n-1\right)}$

v_interval_1997 = (n_1997 - 1)*np.var(pcb_1997, ddof=1) / chi_975_1997, (n_1997 - 1)*np.var(pcb_1997, ddof=1) / chi_025_1997
v_interval_1997
(657.8991553778869, 1464.0909168183869)

Similarly, for the 1998 and 1999 datasets, the variance interval is estimated. Note that unlike the interval estimation of the population mean, the distribution of $ \ frac {\ left (n-1 \ right) s ^ 2} {\ sigma ^ 2} $ is $ \ chi ^ 2 regardless of the sample size. Follow the $ distribution.

chi_025_1998, chi_975_1998 = stats.chi2.interval(alpha=0.95, df=n_1998-1)
chi_025_1999, chi_975_1999 = stats.chi2.interval(alpha=0.95, df=n_1999-1)
v_interval_1998 = (n_1998 - 1)*np.var(pcb_1998, ddof=1) / chi_975_1998, (n_1998 - 1)*np.var(pcb_1998, ddof=1) / chi_025_1998
v_interval_1999 = (n_1999 - 1)*np.var(pcb_1999, ddof=1) / chi_975_1999, (n_1999 - 1)*np.var(pcb_1999, ddof=1) / chi_025_1999
v_interval_1997, v_interval_1998, v_interval_1999
((657.8991553778869, 1464.0909168183869),
 (128.76076176378118, 597.4878836139195),
 (713.0969734866349, 3778.8609867211235))
chi_025_1997, chi_975_1997 = stats.chi2.interval(alpha=0.9, df=n_1997-1)
(n_1997 - 1)*np.var(pcb_1997, ddof=1) / chi_975_1997, (n_1997 - 1)*np.var(pcb_1997, ddof=1) / chi_025_1997
(696.4155490924242, 1361.5929987004467)

Chap.6 Visualization of estimation results

Now, let's visualize the population mean estimated in Chap.4 and Chap.5 on a graph.

First, the values of the population means estimated by points in Chap.4 are summarized in chronological order.

x_list = [1997, 1998, 1999]
y_list = [s_mean_1997, s_mean_1998, s_mean_1999]

Next, the confidence intervals of the population mean with a reliability of 95% estimated in Chap.5 are also summarized in chronological order.

interval_list = []
interval_list.append(m_interval_1997)
interval_list.append(m_interval_1998)
interval_list.append(m_interval_1999)
interval_list
[(23.26434483549182, 40.28642318002819),
 (8.910169386248537, 26.076365238818138),
 (8.079678286109523, 53.086806757890486)]

The 95% confidence interval of the population mean cannot be used for visualization as it is, so the width of the confidence interval is calculated.

interval_list = np.array(interval_list).T[1] - y_list
x_list, y_list, interval_list
([1997, 1998, 1999],
 [31.775384007760003, 17.493267312533337, 30.583242522000003],
 array([ 8.51103917,  8.58309793, 22.50356424]))

Finally, visualize with matplotlib. Confidence intervals are displayed with error bars. When displaying the error bar with matplotlib, use the errorbar method.
This method specifies the X-axis value (here year), the Y-axis value (here the population mean of the point estimate), and the length of the error bar (here the width of the confidence interval).

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.errorbar(x=x_list, y=y_list, yerr=interval_list, fmt='o-', capsize=4, ecolor='red')
plt.xticks(x_list)
ax.set_title("Katsuwonus pelamis")
ax.set_ylabel("ΣPCBs [ng/g wet]")
plt.show()

output_59_0.png

footnote

1 A data format for storing n-th order matrices in Numpy.

Recommended Posts

Learning Python with ChemTHEATER 03
Learning Python with ChemTHEATER 05-1
Learning Python with ChemTHEATER 02
Learning Python with ChemTHEATER 01
"Object-oriented" learning with python
Learn Python with ChemTHEATER
Reinforcement learning starting with Python
Machine learning with Python! Preparation
python learning
Beginning with Python machine learning
Python Iteration Learning with Cheminformatics
Machine learning with python (1) Overall classification
Input / output with Python (Python learning memo ⑤)
Perceptron learning experiment learned with Python
"Scraping & machine learning with Python" Learning memo
FizzBuzz with Python3
Scraping with Python
[Python] Learning Note 1
Statistics with python
Scraping with Python
Python with Go
python learning output
Twilio with Python
Play with 2016-Python
Tested with Python
Python learning day 4
with syntax (Python)
Python Deep Learning
Python learning (supplement)
Deep learning × Python
Bingo with python
Zundokokiyoshi with python
python learning notes
Excel with Python
Microcomputer with Python
Cast with python
[Examples of improving Python] Learning Python with Codecademy
Amplify images for machine learning with python
Machine learning with python (2) Simple regression analysis
[Shakyo] Encounter with Python for machine learning
Data analysis starting with python (data preprocessing-machine learning)
[Python] Easy Reinforcement Learning (DQN) with Keras-RL
Build AI / machine learning environment with Python
Serial communication with Python
Zip, unzip with python
Django 1.11 started with Python3.6
Primality test with Python
Python with eclipse + PyDev.
Socket communication with Python
Data analysis with python 2
Scraping with Python (preparation)
Python class (Python learning memo ⑦)
Try scraping with Python.
Sequential search with Python
Machine learning starting with Python Personal memorandum Part2
Python module (Python learning memo ④)
Run Python with VBA
Handling yaml with python
Reinforcement learning 1 Python installation
Solve AtCoder 167 with python
Serial communication with python