[PYTHON] [Introduction to Data Scientists] Descriptive Statistics and Simple Regression Analysis ♬

Last night, I summarized [Introduction to Data Scientists] Basics of scientific calculation, data processing, and how to use graph drawing library. From tonight, I will finally use them to get into the main subject. Tonight I will summarize descriptive statistics and simple regression analysis. I will supplement the explanations in this book. 【Caution】 ["Data Scientist Training Course at the University of Tokyo"](https://www.amazon.co.jp/%E6%9D%B1%E4%BA%AC%E5%A4%A7%E5%AD%A6%E3 % 81% AE% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% B5% E3% 82% A4% E3% 82% A8% E3% 83% B3% E3% 83 % 86% E3% 82% A3% E3% 82% B9% E3% 83% 88% E8% 82% B2% E6% 88% 90% E8% AC% 9B% E5% BA% A7-Python% E3% 81 % A7% E6% 89% 8B% E3% 82% 92% E5% 8B% 95% E3% 81% 8B% E3% 81% 97% E3% 81% A6% E5% AD% A6% E3% 81% B6 % E3% 83% 87% E2% 80% 95% E3% 82% BF% E5% 88% 86% E6% 9E% 90-% E5% A1% 9A% E6% 9C% AC% E9% 82% A6% I will read E5% B0% 8A / dp / 4839965250 / ref = tmm_pap_swatch_0? _ Encoding = UTF8 & qid = & sr =) and summarize the parts that I have some doubts or find useful. Therefore, I think the synopsis will be straightforward, but please read it, thinking that the content has nothing to do with this book.

Chapter 3 Descriptive Statistics and Simple Regression Analysis

Chapter 3-1 Types of statistical analysis

3-1-1 Descriptive and inference statistics

Statistical analysis methods are divided into descriptive statistics and inference statistics.

3-1-1-1 Descriptive statistics

"A method to grasp the characteristics of the collected data, organize it in an easy-to-understand manner, and make it easy to see. Calculate the characteristics of the data by calculating the mean, standard deviation, etc., classify the data, and express it using figures and graphs. Descriptive statistics are what you do. "

3-1-1-2 Inference statistics

"The idea of inference statistics is to perform a precise analysis using a model based on a probability distribution from only partial data, and infer the whole to obtain statistics." "It is also used to predict the future from historical data. This chapter describes simple regression analysis, which is the basis of inference statistics. More complex inference statistics will be dealt with in the next four chapters."

3-1-2 Importing the library

import numpy as np
import scipy as sp
import pandas as pd
from pandas import Series, DataFrame

import matplotlib as mpl
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

from sklearn import linear_model

Install sklearn below.


$ sudo pip3 install scikit-learn

As shown below, it seems that it can also be used with Rasipi4.

 $ python3
Python 3.7.3 (default, Jul 25 2020, 13:03:44) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from sklearn import linear_model 
>>>

For the time being, python3-sklearn-doc was not found, but it seems that it could be installed under Debian / Ubuntu.

$ sudo apt-get install python3-sklearn python3-sklearn-lib

Verification will be done below to see if there is a problem with simple regression analysis.

Chapter 3-2 Data reading and interaction

...abridgement 3-2-1-5 From the following site, get the data student.zip with the following program. https://archive.ics.uci.edu/ml/machine-learning-databases/00356/student.zip

import requests, zipfile
from io import StringIO
import io

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00356/student.zip'
r = requests.get(url, stream = True)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

The following four files have been expanded. student.txt student-mat.csv student-merge.R student-pcr.csv

3-2-2 Data reading and confirmation

Connect to the above import and execute the following

student_data_math = pd.read_csv('./chap3/student-mat.csv')
print(student_data_math.head())

Data; You can check the delimiter.

school;sex;age;address;famsize;Pstatus;Medu;Fedu;Mjob;Fjob;reason;guardian;traveltime;studytime;failures;schoolsup;famsup;paid;activities;nursery;higher;internet;romantic;famrel;freetime;goout;Dalc;Walc;health;absences;G1;G2;G3
0  GP;"F";18;"U";"GT3";"A";4;4;"at_home";"teacher...                                                                                                                                                                              
1  GP;"F";17;"U";"GT3";"T";1;1;"at_home";"other";...                                                                                                                                                                              
2  GP;"F";15;"U";"LE3";"T";1;1;"at_home";"other";...                                                                                                                                                                              
3  GP;"F";15;"U";"GT3";"T";4;2;"health";"services...                                                                                                                                                                              
4  GP;"F";16;"U";"GT3";"T";3;3;"other";"other";"h...

Change the read to; and reload.

student_data_math = pd.read_csv('./chap3/student-mat.csv', sep =';')
print(student_data_math.head())

It looked beautiful.

  school sex  age address famsize Pstatus  Medu  Fedu  ... goout Dalc Walc health  absences  G1  G2  G3
0     GP   F   18       U     GT3       A     4     4  ...     4    1    1      3         6   5   6   6
1     GP   F   17       U     GT3       T     1     1  ...     3    1    1      3         4   5   5   6
2     GP   F   15       U     LE3       T     1     1  ...     2    2    3      3        10   7   8  10
3     GP   F   15       U     GT3       T     4     2  ...     2    1    1      5         2  15  14  15
4     GP   F   16       U     GT3       T     3     3  ...     2    1    2      5         4   6  10  10

[5 rows x 33 columns]

3-2-3 Check the nature of the data

print(student_data_math.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64
 7   Fedu        395 non-null    int64
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64
 13  studytime   395 non-null    int64
 14  failures    395 non-null    int64
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    object
 20  higher      395 non-null    object
 21  internet    395 non-null    object
 22  romantic    395 non-null    object
 23  famrel      395 non-null    int64
 24  freetime    395 non-null    int64
 25  goout       395 non-null    int64
 26  Dalc        395 non-null    int64
 27  Walc        395 non-null    int64
 28  health      395 non-null    int64
 29  absences    395 non-null    int64
 30  G1          395 non-null    int64
 31  G2          395 non-null    int64
 32  G3          395 non-null    int64
dtypes: int64(16), object(17)
memory usage: 102.0+ KB

Looking at the contents of cat student.txt, it seems that this data has the following contents.

Translated in this book

$ cat student.txt
# Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:
1 school - student's school (binary: "GP" - Gabriel Pereira or "MS" - Mousinho da Silveira)
2 sex - student's sex (binary: "F" - female or "M" - male)
3 age - student's age (numeric: from 15 to 22)
4 address - student's home address type (binary: "U" - urban or "R" - rural)
5 famsize - family size (binary: "LE3" - less or equal to 3 or "GT3" - greater than 3)
6 Pstatus - parent's cohabitation status (binary: "T" - living together or "A" - apart)
7 Medu - mother's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8 Fedu - father's education (numeric: 0 - none,  1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
9 Mjob - mother's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
10 Fjob - father's job (nominal: "teacher", "health" care related, civil "services" (e.g. administrative or police), "at_home" or "other")
11 reason - reason to choose this school (nominal: close to "home", school "reputation", "course" preference or "other")
12 guardian - student's guardian (nominal: "mother", "father" or "other")
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16 schoolsup - extra educational support (binary: yes or no)
17 famsup - family educational support (binary: yes or no)
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19 activities - extra-curricular activities (binary: yes or no)
20 nursery - attended nursery school (binary: yes or no)
21 higher - wants to take higher education (binary: yes or no)
22 internet - Internet access at home (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)

# these grades are related with the course subject, Math or Portuguese:
31 G1 - first period grade (numeric: from 0 to 20)
31 G2 - second period grade (numeric: from 0 to 20)
32 G3 - final grade (numeric: from 0 to 20, output target)

Additional note: there are several (382) students that belong to both datasets . 
These students can be identified by searching for identical attributes
that characterize each student, as shown in the annexed R file.

3-2-4 Quantitative and qualitative data

・ Quantitative data The data is represented by continuous values to which the four arithmetic operations can be applied, and the ratio is meaningful. Example: number of people, amount of money, etc. ・ Qualitative data It is discontinuous data to which the four arithmetic operations cannot be applied, and is used to express the state. Example; ranking, category, etc.

Gender is qualitative data

print(student_data_math['sex'].head())
0    F
1    F
2    F
3    F
4    F
Name: sex, dtype: object

The number of absentees is quantitative data

print(student_data_math['absences'].head())
0     6
1     4
2    10
3     2
4     4
Name: absences, dtype: int64

3-2-4-2 Calculate the average value for each axis

print(student_data_math.groupby('sex')['age'].mean())
sex
F    16.730769
M    16.657754
Name: age, dtype: float64

Women study.

print(student_data_math.groupby('sex')['studytime'].mean())
sex
F    2.278846
M    1.764706
Name: studytime, dtype: float64

Descriptive statistics

3-3-1 Histogram

fig, (ax1) = plt.subplots(1, 1, figsize=(8,6))
y1 = student_data_math['absences']
ax1.hist(y1, bins = 10, range =(0.0,max(y1)))
ax1.set_ylabel('count')
ax1.set_xlabel('absences')
plt.grid(True)
plt.show()

3-3-2 Mean, median, mode

print('Average value{}'.format(student_data_math['absences'].mean()))
print('Median{}'.format(student_data_math['absences'].median()))
print('Mode{}'.format(student_data_math['absences'].mode()))

Mean 5.708860759493671
Median 4.0
Mode 0 0
dtype: int64

Enlarge the figure above to verify it in the figure.

The horizontal axis is corrected by 0.5.

fig, (ax1) = plt.subplots(1, 1, figsize=(8,6))
y1 = student_data_math['absences']
ax1.hist(y1, bins = 30, range =(0.0,30))  #,max(y1)
x0 = student_data_math['absences'].mean()
ax1.plot(x0+0.5, 70,  'red', marker = 'o',markersize=10,label ='mean')
x0 = student_data_math['absences'].median()
ax1.plot(x0+0.5, 70, 'blue', marker = 'o',markersize=10,label ='median')
x0 = student_data_math['absences'].mode()
ax1.plot(x0+0.5, 70, 'black', marker = 'o',markersize=10,label ='mode')
ax1.legend()
ax1.set_ylabel('count')
ax1.set_xlabel('absences')
plt.grid(True)
plt.show()

3-3-3 Variance and standard deviation

Definition formula Variance $ σ ^ 2 $

σ^2 = \frac{1}{n}\Sigma_{i=1}^{n}(x_i-\bar{x})^2

Standard deviation $ σ $ std（standered deviation）

σ = \sqrt{\frac{1}{n}\Sigma_{i=1}^{n}(x_i-\bar{x})^2}

print('Distributed{}'.format(student_data_math['absences'].var(ddof=0)))
print('standard deviation{}'.format(student_data_math['absences'].std(ddof = 0)))
print('standard deviation{}'.format(np.sqrt(student_data_math['absences'].var())))
Variance 63.887389841371565
Standard deviation 7.99295876640006
Standard deviation 8.00309568710818

Plot the mean ± standard deviation.

fig, (ax1) = plt.subplots(1, 1, figsize=(8,6))
y1 = student_data_math['absences']
ax1.hist(y1, bins = 30, range =(0.0,30))  #,max(y1)
x0 = student_data_math['absences'].mean()
ax1.plot(x0+0.5, 70,  'red', marker = 'o',markersize=10,label ='mean')
x1 = student_data_math['absences'].std(ddof=0)
ax1.plot(x0+x1+0.5, 70, 'blue', marker = 'o',markersize=10,label ='mean+std')
ax1.plot(x0-x1+0.5, 70, 'black', marker = 'o',markersize=10,label ='mean-std')
ax1.legend()
ax1.set_ylabel('count')
ax1.set_xlabel('absences')
plt.grid(True)
plt.show()

3-3-4 Summary statistics and percentile values

Percentile values are ranked when the total number is 100 25th percentile, 25th percentile, 1st quartile The 75th is the 75th percentile, the third quartile 50th percentile, median

print('Summary statistics', student_data_math['absences'].describe())
Summary statistic count 395.000000
mean       5.708861
std        8.003096
min        0.000000
25%        0.000000
50%        4.000000
75%        8.000000
max       75.000000
Name: absences, dtype: float64

Find the interquartile range

25th percentile; describe (4) 75th percentile: describe (6) Difference; describe (6)-describe (4) print ('75-25 Percentile', student_data_math ['absences'] .describe () [6]-student_data_math ['absences'] .describe () [4]) 75-25 Percentile 8.0

fig, (ax1) = plt.subplots(1, 1, figsize=(8,6))
y1 = student_data_math['absences']
ax1.hist(y1, bins = 30, range =(0.0,30))  #,max(y1)
x0 = student_data_math['absences'].median()
ax1.plot(x0+0.5, 70,  'red', marker = 'o',markersize=10,label ='median')
x1 = student_data_math['absences'].describe()[4]
ax1.plot(x1+0.5, 70, 'blue', marker = 'o',markersize=10,label ='25percentile')
x1 = student_data_math['absences'].describe()[6]
ax1.plot(x1+0.5, 70, 'black', marker = 'o',markersize=10,label ='75percentile')
ax1.legend()
ax1.set_ylabel('count')
ax1.set_xlabel('absences')
plt.grid(True)
plt.show()

3-3-4-2 describe () for all rows

print('Full column summary statistics', student_data_math.describe())
Full column summary statistics
              age        Medu        Fedu  traveltime  ...    absences          G1          G2          G3
count  395.000000  395.000000  395.000000  395.000000  ...  395.000000  395.000000  395.000000  395.000000
mean    16.696203    2.749367    2.521519    1.448101  ...    5.708861   10.908861   10.713924   10.415190
std      1.276043    1.094735    1.088201    0.697505  ...    8.003096    3.319195    3.761505    4.581443
min     15.000000    0.000000    0.000000    1.000000  ...    0.000000    3.000000    0.000000    0.000000
25%     16.000000    2.000000    2.000000    1.000000  ...    0.000000    8.000000    9.000000    8.000000
50%     17.000000    3.000000    2.000000    1.000000  ...    4.000000   11.000000   11.000000   11.000000
75%     18.000000    4.000000    3.000000    2.000000  ...    8.000000   13.000000   13.000000   14.000000
max     22.000000    4.000000    4.000000    4.000000  ...   75.000000   19.000000   19.000000   20.000000

[8 rows x 16 columns]

3-3-5 Box plot

Box plot is (minimum value, number 1). 1 quartile, median, 3rd quartile, maximum) is expressed by a box and a whiskers as follows.

fig, (ax1,ax2) = plt.subplots(2, 1, figsize=(8,2*4))
y1 = student_data_math['G1']
ax1.hist(y1, bins = 30, range =(0.0,max(y1)))  #,max(y1)
x0 = student_data_math['G1'].median()
ax1.plot(x0+0.5, 60,  'red', marker = 'o',markersize=10,label ='median')
x1 = student_data_math['G1'].describe()[4]
ax1.plot(x1+0.5, 60, 'blue', marker = 'o',markersize=10,label ='25percentile')
x1 = student_data_math['G1'].describe()[6]
ax1.plot(x1+0.5, 60, 'black', marker = 'o',markersize=10,label ='75percentile')
ax2.boxplot(y1)
ax2.set_xlabel('G1')
ax2.set_ylabel('count')
ax1.legend()
ax1.set_ylabel('count')
ax1.set_xlabel('G1')
plt.grid(True)
plt.show()

fig, (ax1) = plt.subplots(1, 1, figsize=(8,6))
y1 = [student_data_math['G1'],student_data_math['G2'],student_data_math['G3'],student_data_math['absences']]
ax1.boxplot(y1,labels=['G1', 'G2', 'G3', 'absences'])
ax1.set_xlabel('category')
ax1.set_ylabel('count')
ax1.legend()
ax1.set_ylabel('count')
ax1.set_xlabel('category')
plt.grid(True)
plt.show()

3-3-6 Coefficient of variation

The coefficient of variation CV is the standard deviation σ / mean $ \ bar {x} $ The coefficient of variation does not depend on the scale, and the degree of dispersion can be seen.

print(student_data_math.std()/student_data_math.mean())
age           0.076427
Medu          0.398177
Fedu          0.431565
traveltime    0.481668
studytime     0.412313
failures      2.225319
famrel        0.227330
freetime      0.308725
goout         0.358098
Dalc          0.601441
Walc          0.562121
health        0.391147
absences      1.401873
G1            0.304266
G2            0.351086
G3            0.439881
dtype: float64

3-3-7 Scatter plot and correlation coefficient

fig, (ax1) = plt.subplots(1, 1, figsize=(8,6))
x = student_data_math['G1']
y = student_data_math['G3']
ax1.plot(x,y, 'o')
ax1.set_xlabel('G1-grade')
ax1.set_ylabel('G3-grade')
ax1.legend()
plt.grid(True)
plt.show()

Those who had a high G1-Grade also have a high G3-Grade. However, there are some people who have 0 G3-Grade. This is an outlier, but there are various reasons for it, and there is a debate about whether to exclude it. So, what about the number of days attended by people with G3-Grade 0, draw the following correlation diagram.

fig, (ax1) = plt.subplots(1, 1, figsize=(8,6))
x = student_data_math['G3']
y = student_data_math['absences']
ax1.plot(x,y, 'o')
ax1.set_xlabel('G3-grade')
ax1.set_ylabel('absences')
ax1.legend()
plt.grid(True)
plt.show()

The result is that people with a G3-Grade of 0 are absent 0. Something is wrong. Actually, I can imagine that I stopped halfway and did not count. Furthermore, the correlation between G1-Grade and the number of absentees is as follows. In the first place, at the time of G2-Grade, some people are 0 Grade. And even if you look at the correlation between G2-Grade and G3-Grade, you can see that some people have fallen to 0 Grade, and that the number of such people is gradually increasing. And it seems that the deceased are out of those with low scores. Therefore, it is more important to analyze various data than to rush to the conclusion with one graph.

3-3-7-1 Covariance

The definition formula is as follows

S_{xy}=\frac{1}{n}\Sigma_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})

That is, the diagonal term is the variance defined above. So what does the off-diagonal term mean? According to the reference, if $ x $ and $ y $ have an inherently linear relationship, the equation of the straight line derived by the least squares method is as follows. 【reference】 Linear regression analysis / least squares means covariance / correlation coefficient

y=\frac{S_{xy}}{\sigma^2_{x}}x + \bar y−\frac{S_{xy}}{\sigma^2_x}\bar x \\
When transformed,\\
\frac{y-\bar y}{\sigma_y}=\frac{S_{xy}}{\sigma_x\sigma_y}\frac{x-\bar x}{\sigma_x}

That is, the slope of the linear equation standardized by the standard deviation and the mean value is as follows.

r_{xy}=\frac{S_{xy}}{\sigma_x\sigma_y}

In other words, it is a value obtained by standardizing the covariance with the standard deviation, and this is the definition formula of the so-called correlation coefficient $ r_ {xy} $.

3-3-7-2 Correlation coefficient

Here, we will find the covariance and the correlation coefficient. The covariance is the off-diagonal and the diagonal is the variance of G1 and G3.

print(np.cov(student_data_math['G1'],student_data_math['G3']))
[[11.01705327 12.18768232]
 [12.18768232 20.9896164 ]]

The front is the correlation coefficient and the second term is the p-value.

The p-value will be covered in a later chapter.

print(sp.stats.pearsonr(student_data_math['G1'],student_data_math['G3']))
(0.801467932017414, 9.001430312277865e-90)

The correlation matrix is calculated below.

print(np.corrcoef(student_data_math['G1'],student_data_math['G3']))
[[1.         0.80146793]
 [0.80146793 1.        ]]

3-3-8 Draw histograms and scatter plots for all variables

Dalc; Weekday alcohol intake Walc; Weekend Alcohol Intake And draw a scatter plot to see if there is a correlation between the scores of G1 and G3. Result; seems unlikely

g = sns.pairplot(student_data_math[['Dalc','Walc','G1','G3']])
g.savefig('seaborn_pairplot_g.png')

【reference】 Create a pair plot diagram (scatter plot matrix) with Python, pandas, seaborn No correlation between Walc and G3 scores

print(np.corrcoef(student_data_math['Walc'],student_data_math['G3']))
[[ 1.         -0.05193932]
 [-0.05193932  1.        ]]

No variation for each group

print(student_data_math.groupby('Walc')['G3'].mean())
Walc
1    10.735099
2    10.082353
3    10.725000
4     9.686275
5    10.142857
Name: G3, dtype: float64

Exercise 3-1

              age        Medu        Fedu  traveltime  ...    absences          G1          G2          G3
count  649.000000  649.000000  649.000000  649.000000  ...  649.000000  649.000000  649.000000  649.000000
mean    16.744222    2.514638    2.306626    1.568567  ...    3.659476   11.399076   11.570108   11.906009
std      1.218138    1.134552    1.099931    0.748660  ...    4.640759    2.745265    2.913639    3.230656
min     15.000000    0.000000    0.000000    1.000000  ...    0.000000    0.000000    0.000000    0.000000
25%     16.000000    2.000000    1.000000    1.000000  ...    0.000000   10.000000   10.000000   10.000000
50%     17.000000    2.000000    2.000000    1.000000  ...    2.000000   11.000000   11.000000   12.000000
75%     18.000000    4.000000    3.000000    2.000000  ...    6.000000   13.000000   13.000000   14.000000
max     22.000000    4.000000    4.000000    4.000000  ...   32.000000   19.000000   19.000000   19.000000

Exercise 3-2

df =student_data_math.merge(student_data_por,left_on=['school','sex','age','address','famsize','Pstatus','Medu','Fedu','Mjob','Fjob','reason','nursery','internet'], right_on=['school','sex','age','address','famsize','Pstatus','Medu','Fedu','Mjob','Fjob','reason','nursery','internet'], suffixes=('_math', '_por'))
print(df.head())
  school sex  age address famsize Pstatus  ...  Walc_por  health_por absences_por G1_por G2_por G3_por
0     GP   F   18       U     GT3       A  ...         1           3            4      0     11     11
1     GP   F   17       U     GT3       T  ...         1           3            2      9     11     11
2     GP   F   15       U     LE3       T  ...         3           3            6     12     13     12
3     GP   F   15       U     GT3       T  ...         1           5            0     14     14     14
4     GP   F   16       U     GT3       T  ...         2           5            0     11     13     13

[5 rows x 53 columns]

Exercise 3-3

gm = sns.pairplot(df[['G1_math','G3_math','G1_por','G3_por']])
gm.savefig('seaborn_pairplot_gm.png')

Correlation between math and por seems to be high Variance seems to be smaller in por than in math It is also supported by the following results.

print(np.corrcoef(df['G1_math'],df['G3_math']))
[[1.        0.8051287]
 [0.8051287 1.       ]]
print(np.corrcoef(df['G3_math'],df['G3_por']))
[[1.         0.48034936]
 [0.48034936 1.        ]]

print(np.cov(df['G1_math'],df['G3_math']))
[[11.2169202  12.63919693]
 [12.63919693 21.9702354 ]]
print(np.cov(df['G3_math'],df['G3_por']))
[[21.9702354   6.63169394]
 [ 6.63169394  8.67560567]]

Chapter 3-4 Simple Regression Analysis

"Next to descriptive statistics, let's learn the basics of regression analysis." "Regression analysis is an analysis that predicts numbers .... I've graphed the student data above. From this scatter plot, I can see that G1 and G3 are likely to be related."

fig, (ax1) = plt.subplots(1, 1, figsize=(8,6))
ax1.plot(student_data_math['G1'],student_data_math['G3'],'o')
ax1.set_xlabel('G1_Grade')
ax1.set_ylabel('G3_Grade')
ax1.grid(True)
plt.show()

"In the regression problem, we assume a relational expression from the given data and find the coefficient that best fits the data. Specifically, we predict the G3 grade based on the G1 grade that we know in advance. That is, there is a target variable G3 (called the objective variable), and the variable G1 (called the explanatory variable) that explains it is used for prediction. In regression analysis, one explanatory variable and one explanatory variable are used. The former is called simple regression and the latter is called multiple regression analysis. In this chapter, we will explain simple regression analysis. "

Slightly translated

3-4-1 Linear Simple Regression Analysis

"Here, we will explain how to solve the regression problem by a method called linear simple regression, which assumes that the output and input have a linear relationship in simple regression analysis."

import pandas as pd
from sklearn import linear_model

reg = linear_model.LinearRegression()
student_data_math = pd.read_csv('./chap3/student-mat.csv', sep =';')

x = student_data_math.loc[:,['G1']].values
y = student_data_math['G3'].values
reg.fit(x,y)
print('Regression coefficient;',reg.coef_)
print('Intercept;',reg.intercept_)
Regression coefficient;[1.10625609]
Intercept;-1.6528038288004616

fig, (ax1) = plt.subplots(1, 1, figsize=(8,6))
ax1.plot(student_data_math['G1'],student_data_math['G3'],'o')
ax1.plot(x,reg.predict(x))
ax1.set_xlabel('G1_Grade')
ax1.set_ylabel('G3_Grade')
ax1.grid(True)
plt.show()

3-4-2 Coefficient of determination

R^2 = 1- \frac{\Sigma_{i=1}^{n}(y_i-f(x_i))^2}{\Sigma_{i=1}^{n}(y_i-\bar y)^2}

The above equation is called the coefficient of determination, and $ R ^ 2 = 1 $ is the maximum value, and the closer it is to 1, the better the model.

print('Coefficient of determination;',reg.score(x,y))
Coefficient of determination; 0.64235084605227

Comprehensive problem 3-2-1 Lorenz Curve

df0 = student_data_math[student_data_math['sex'].isin(['M'])]
df = df0.sort_values(by=['G1'])
df['Ct']=np.arange(1,len(df)+1)

x = df['Ct']
print(x)
y = df['G1'].cumsum()
print(y)
fig, (ax1) = plt.subplots(1, 1, figsize=(8,6))
ax1.plot(x/max(x),y/max(y))
ax1.set_xlabel('peoples')
ax1.set_ylabel('G1_Grade.cumsum')
ax1.grid(True)
plt.show()

248      1
144      2
164      3
161      4
153      5
      ...
113    183
129    184
245    185
42     186
47     187
Name: Ct, Length: 187, dtype: int32
248       3
144       8
164      13
161      18
153      23
       ...
113    2026
129    2044
245    2062
42     2081
47     2100
Name: G1, Length: 187, dtype: int64

M F reference M；G1 vs peaples F；G1 vs peaples