Python environment construction and SQL execution example to DB and memo of basic processing for statistics 2019

Postscript 20180611

This page is based on the assumption that it is standalone on Windows, but I wrote the latest version of the setup using Sakura's VPS. Look slowly! : relaxed: LAMP + WP + Py setup memo 2018

Postscript 20180816

The latest version of 2018 (by VisualStudio) for building a standalone environment on Windows is now available! Look slowly! : relaxed: Notes on Python environment construction and SQL execution example to DB 2018

Postscript 20190201

Since there are a lot of views, I will review all the contents. As expected, the number of Japanese pages has increased considerably after three years: relaxed:

Reference page

-Install Python3

Install Python3

Place of installation


C:\Python

Download (64bit version)

https://www.python.org/downloads/windows/ image.png

Install after adding Path to the environment variable

Check "Add Python 3.x to PATH". Then the command prompt will recognize python. I chose Customize installation because I wanted to change the installation location. image.png

Verification

Environment variables (super important)

When you press the start button and type "en" (= environment), "Edit system environment variables" will appear, so make sure that the installation location is recorded. image.png image.png

command prompt

When you press the start button and type "cmd", the "command prompt" will appear. By setting the environment variable, the command python Nantoka can be passed. image.png

Check the Python version


Microsoft Windows [Version 10.0.17134.523]
(c) 2018 Microsoft Corporation. All rights reserved.

C:\Users\yoshi> python -V
  Python 3.7.2

List the libraries currently included (nothing is included)


C:\Users\yoshi> pip freeze
C:\Users\yoshi>

image.png

Frequency of use Installation of demon packages

pyodbc is used for db connection pandas is a library that easily reads and analyzes csv and xls.

For the time being, put pandas and pyodbc


C:\Users\yoshi> python -m pip install --upgrade pip
C:\Users\yoshi> pip install pandas
C:\Users\yoshi> pip install pyodbc

List the libraries currently in


C:\Users\yoshi>pip freeze
  numpy==1.16.1
  pandas==0.24.0
  pyodbc==4.0.25
  python-dateutil==2.7.5
  pytz==2018.9
  six==1.12.0

Move directory

The basics of MS-DOS. Move when running a python program

That's why I choose this directory as my work location


C:\Users\yoshi> cd/d "D:\OneDrive\document\Project\Python"
D:\OneDrive\document\Project\Python>

How to run the program

Move to the directory containing the program to be executed "cd C: \ xxxx \ xxxx .." You can execute the program by typing "python xxxx.py".

main.py


print("Hello world!")

You did it!


D:\OneDrive\document\Project\Python> python main.py
  Hello world!

If you make your own library

Summary of Python import For example, let's say you created a convenient library called "PyUtils.py" and stored it in a directory called "lib", right? Then you can call a convenient library (= user-defined function?) From main.py with the reserved word "Import", but there are two ways to do it. 2 is a shorter chanting, but which one is it?

Current directory status


D:\OneDrive\document\Project\Python> tree /F
  D:.
  │  main.py
  └─ lib
       PyUtils.py

How to import 1

main.py


import lib.PyUtils

print("Hello world!")
instance = lib.PyUtils.HelloClass()
instance.HelloWorldByPyUtils()

lib/PyUtils.py


class HelloClass:
    def HelloWorldByPyUtils(self):
        print("Hello PyUtils")

test


D:\OneDrive\document\Project\Python> python main.py
  Hello world!
  Hello PyUtils

How to import 2

main.py


from lib.PyUtils import *

print("Hello world!")
instance = HelloClass()
instance.HelloWorldByPyUtils()

lib/PyUtils.py


class HelloClass:
    def HelloWorldByPyUtils(self):
        print("Hello PyUtils")

test


D:\OneDrive\document\Project\Python> python main.py
  Hello world!
  Hello PyUtils

Create csv to read with UTF-8

VisualStudioCode Really: relaxed: Ah, this is a nice name because it's made with a name generator (this generator is also useful with this) image.png

(Put the one you can copy and paste)

Affiliation Full name Full name(Wonder) mail address Street address birthday
recruit Honda Mamoru Honda protect [email protected] Hokkaido 2019/2/1
recruit Kengo Hayasaka Hayasaka Kengo [email protected] Aomori Prefecture 2019/2/2
recruit Shin Tateishi Proceed [email protected] Iwate Prefecture 2019/2/3
recruit Reina Washio Takayuki Washio [email protected] Miyagi Prefecture 2019/2/4
recruit Takuji Guo Lottery [email protected] Akita 2019/2/5
recruit Mikio Ogushi Mikio Ogushi [email protected] Yamagata Prefecture 2019/2/6
recruit Norihiko Shimanuki Shimanuki Norihiko [email protected] Fukushima Prefecture 2019/2/7
recruit Hidetaka Kojo Hidetaka Kojo [email protected] Ibaraki Prefecture 2019/2/8
recruit Takeshi Kurei Kujirai Kenji [email protected] Tochigi Prefecture 2019/2/9
recruit Genta Kurusu Kurusu Genta [email protected] Gunma Prefecture 2019/2/10
recruit Takahiro Shioi Shioi Takahiro [email protected] Saitama 2019/2/11
recruit Kiyoto Moroda Kiyohito Morota [email protected] Chiba 2019/2/12
recruit Jinguji Yoshifumi Jinguuji Yoshifumi [email protected] Tokyo 2019/2/13
recruit Ikemizu Ryo Satori Ikemizu [email protected] Kanagawa Prefecture 2019/2/14
recruit Yusuke Kashii Yusuke [email protected] Niigata Prefecture 2019/2/15
recruit Shin Shugo Mashugo [email protected] Toyama Prefecture 2019/2/16
recruit Hirohiko Nishizono Nishizono Hirohiko [email protected] Ishikawa Prefecture 2019/2/17
recruit Nobuhiro Kuroda I did it Nobuhiro [email protected] Fukui prefecture 2019/2/18
recruit Masaharu Nezuka Masaharu Nezuka [email protected] Yamanashi Prefecture 2019/2/19
recruit Nakamuro Ikki Kazuki Nakamuro [email protected] Nagano Prefecture 2019/2/20
recruit Kenji Mimata Mitsumata Kenji [email protected] Gifu Prefecture 2019/2/21
recruit Chihiro Mezaki Chihiro Mesaki [email protected] Shizuoka Prefecture 2019/2/22
recruit Yugo Takita Yugo Takida [email protected] Aichi prefecture 2019/2/23
recruit Eight truth Makoto Yamoto [email protected] Mie Prefecture 2019/2/24
recruit Akihiro Ori Orii Akihiro [email protected] Shiga Prefecture 2019/2/25
recruit Hirotaka Nomaru No Umaru Hirotaka [email protected] Kyoto 2019/2/26
recruit Shun Jun Yasushi [email protected] Osaka 2019/2/27
recruit Kojima Kiichiro Kojima Kiichirou [email protected] Hyogo prefecture 2019/2/28
recruit Mitsuya Shimotsuura Shimotsuura Mitsuya [email protected] Nara Prefecture 2019/3/1
recruit Takayuki Azuchi And Takayuki [email protected] Wakayama Prefecture 2019/3/2

SQL Server setup

db uses SQL Server The SQL Server engine and Manegement Studio (client software to operate) are different. SQL Server Engine Download image.png image.png image.png

SQL Server local connection

image.png

Add user

Username: python Password: python image.png image.png Postscript: When doing Django, check "db_ddladmin" here (CREATE TABLE etc. can be issued) image.png

trouble shooting

Even if you increase the number of users called python, you cannot log in unless you switch to SQL Server authentication mode. If you change this, don't forget to restart the server. image.png image.png image.png

db creation

image.png image.png

Create table

CREATETABLE.sql


CREATE TABLE Customer(
	  [Affiliation] NVARCHAR(20) NULL
    , [Full name] NVARCHAR(20) NULL
    , [Name (Kana)] NVARCHAR(20) NULL
    , [mail address] NVARCHAR(20) NULL
    , [Street address] TEXT NULL
    , [birthday] NVARCHAR(20) NULL
);

Here is an example of a super specific db operation!

main.py


import pyodbc
import pandas as pd
import lib.PyUtils

#Issue a data deletion query
SQL_TEMPLATE = "DELETE FROM [dbo].[client]"       #SQL original
editSql = SQL_TEMPLATE                          #Replace the SQL original
lib.PyUtils.ExecuteSQLBySQLServer(editSql)      #Issuing a DELETE statement

#CSV that contains data to plunge into db is UTF-8
df = pd.read_csv(r"data\customer.csv")

#Issue additional data query
SQL_TEMPLATE = "INSERT INTO [dbo].[client]([Affiliation],[Fullname],[Fullname(かな)],[mailaddress],[Streetaddress],[birthday])VALUES('{0}','{1}','{2}','{3}','{4}','{5}')"
for line in df.values:
    editSql = SQL_TEMPLATE                      #SQL original
    for i,col in enumerate(line):               #Replace the SQL original
        editSql = editSql.replace('{' + str(i) + '}', col)
    lib.PyUtils.ExecuteSQLBySQLServer(editSql)  #Issuing an INSERT statement

#Issue a select query
SQL_TEMPLATE = "SELECT * FROM [dbo].[client]"     #SQL original
editSql = SQL_TEMPLATE                          #Replace the SQL original
df = lib.PyUtils.ReadQueryBySQLServer(editSql)  #Issuing a SELECT statement
for line in df.values:
    print(','.join(line))                       #Cook and serve SQL results

lib/PyUtils.py


import pyodbc
import pandas as pd

def ExecuteSQLBySQLServer(sql):
    con = pyodbc.connect(r'DRIVER={SQL Server};SERVER=localhost\SQLExpress;DATABASE=db;UID=python;PWD=python;')
    cur = con.cursor()
    cur.execute(sql)
    con.commit()
    con.close()

def ReadQueryBySQLServer(sql):
    con = pyodbc.connect(r'DRIVER={SQL Server};SERVER=localhost\SQLExpress;DATABASE=db;UID=python;PWD=python;')
    df = pd.io.sql.read_sql(sql,con)
    con.close()
    return(df)

Verification

test


D:\OneDrive\document\Project\Python> python main.py
recruit,Honda Mamoru,Honda protect,[email protected],Hokkaido,2019/2/1
recruit,Kengo Hayasaka,Hayasaka Kengo,[email protected],Aomori Prefecture,2019/2/2
recruit,Shin Tateishi,Proceed,[email protected],Iwate Prefecture,2019/2/3
recruit,Reina Washio,Takayuki Washio,[email protected],Miyagi Prefecture,2019/2/4
recruit,Takuji Guo,Lottery,[email protected],Akita,2019/2/5
recruit,Mikio Ogushi,Mikio Ogushi,[email protected],Yamagata Prefecture,2019/2/6
recruit,Norihiko Shimanuki,Shimanuki Norihiko,[email protected],Fukushima Prefecture,2019/2/7
recruit,Hidetaka Kojo,Hidetaka Kojo,[email protected],Ibaraki Prefecture,2019/2/8
recruit,Takeshi Kurei,Kujirai Kenji,[email protected],Tochigi Prefecture,2019/2/9
recruit,Genta Kurusu,Kurusu Genta,[email protected],Gunma Prefecture,2019/2/10
recruit,Takahiro Shioi,Shioi Takahiro,[email protected],Saitama,2019/2/11
recruit,Kiyoto Moroda,Kiyohito Morota,[email protected],Chiba,2019/2/12
recruit,Jinguji Yoshifumi,Jinguuji Yoshifumi,[email protected],Tokyo,2019/2/13
recruit,Ikemizu Ryo,Satori Ikemizu,[email protected],Kanagawa Prefecture,2019/2/14
recruit,Yusuke Kashii,Yusuke,[email protected],Niigata Prefecture,2019/2/15
recruit,Shin Shugo,Mashugo,[email protected],Toyama Prefecture,2019/2/16
recruit,Hirohiko Nishizono,Nishizono Hirohiko,[email protected],Ishikawa Prefecture,2019/2/17
recruit,Nobuhiro Kuroda,I did it Nobuhiro,[email protected],Fukui prefecture,2019/2/18
recruit,Masaharu Nezuka,Masaharu Nezuka,[email protected],Yamanashi Prefecture,2019/2/19
recruit,Nakamuro Ikki,Kazuki Nakamuro,[email protected],Nagano Prefecture,2019/2/20
recruit,Kenji Mimata,Mitsumata Kenji,[email protected],Gifu Prefecture,2019/2/21
recruit,Chihiro Mezaki,Chihiro Mesaki,[email protected],Shizuoka Prefecture,2019/2/22
recruit,Yugo Takita,Yugo Takida,[email protected],Aichi prefecture,2019/2/23
recruit,Eight truth,Makoto Yamoto,[email protected],Mie Prefecture,2019/2/24
recruit,Akihiro Ori,Orii Akihiro,[email protected],Shiga Prefecture,2019/2/25
recruit,Hirotaka Nomaru,No Umaru Hirotaka,[email protected],Kyoto,2019/2/26
recruit,Shun,Jun Yasushi,[email protected],Osaka,2019/2/27
recruit,Kojima Kiichiro,Kojima Kiichirou,[email protected],Hyogo prefecture,2019/2/28
recruit,Mitsuya Shimotsuura,Shimotsuura Mitsuya,[email protected],Nara Prefecture,2019/3/1
recruit,Takayuki Azuchi,And Takayuki,[email protected],Wakayama Prefecture,2019/3/2

D:\OneDrive\document\Project\Python>

image.png

pandas is convenient

A data frame is a "conceptual state in which array data can be processed like playing with a table in SQL". You can delete columns or replace NaN with 0. So once you get used to it, it's easier to use than an array. I often use it in R language, right? Data frames. Pandas Official Reference

main.py


import pyodbc
import pandas as pd
import lib.PyUtils

#Issue a data deletion query
SQL_TEMPLATE = "DELETE FROM [dbo].[client]"       #SQL original
editSql = SQL_TEMPLATE                          #Replace the SQL original
lib.PyUtils.ExecuteSQLBySQLServer(editSql)      #Issuing a DELETE statement

#CSV that contains data to plunge into db is UTF-8
df = pd.read_csv(r"data\customer.csv")

##################################
#Now process the data with pandas#
##################################

Data processing Lv1

From here on, it's string processing, so I'll do it in the Python console image.png I think there are three things that you can operate data frames on a business basis (the fourth one is also less frequent).

Python3.7Console


Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> import pandas as pd

#Prepare sample data
>>> df = pd.DataFrame([np.nan, 2, np.nan, 0])
>>> df
             0
        0  NaN
        1  2.0
        2  NaN
        3  0.0

# No.1 NaN(Missing value)Replaced with 0
>>> df = df.fillna(0)
>>> df
             0
        0  0.0
        1  2.0
        2  0.0
        3  0.0

# No.Convert to 2 characters and pad with 5 digits zero
# astype(str)   ...Type conversion to string
# .str.         ...Accessor to str method(.You will be able to use commands such as split)
>>> df[0] = df[0].astype(str).str.zfill(5)
>>> df
               0
        0  000.0
        1  002.0
        2  000.0
        3  000.0

# No.3 Convert to null that db understands
#There is no NULL (undefined), NaN cannot be used(Not Available, missing value)
#In the standard library, None corresponds to Null. If you want to put NULL in Sqlite3, pass None.
>>> df = pd.DataFrame([np.nan, 2, np.nan, 0])
>>> df = df.where(pd.notna(df), None)
>>> df

# No.4-column deletion (axis)(meaning:axis)When is set to 1, "of the column...」のmeaning合いになるようだ)
>>> df = df.drop([0], axis=1)
>>> df
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]

Data processing Lv2 Fill in missing values with the values specified for each group

In the preprocessing of statistical data, there are things such as putting the average value and putting the median value in the missing value (≈Null value) for the time being.

For example, in Kaggle's Regression Analysis Page for Home Price Forecasts (https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard), median missing values by group. The part that you want to fill with. It seems to be super likely in business. I can think of a way to do it with SQL ...

  1. Create a table or view after aggregation
  2. Use Left Join to update only null items

Lot Frontage is "width of frontage (width of the front of the house)". What value should I fill in if LotFrontage is missing? One example is the idea that "neighborhood houses, that is, the'Neighborhood'will be about the same as the LotFrontage value of the same house." Therefore, this time, we will use "median LotFrontafe of houses with the same Neighborhood" as the supplementary value.

The cell-shaped one below is the median value of "Neighborhood". As soon as I find na, I want to fill it with the "median group by item" below.

Neighborhood LotFrontage(median)
Blmngtn 43
Blueste 24
BrDale 21
BrkSide 52
ClearCr 80
CollgCr 70
Crawfor 74
Edwards 64.5
Gilbert 65
IDOTRR 60
MeadowV 21
Mitchel 73
NAmes 73
NoRidge 91
NPkVill 24
NridgHt 88.5
NWAmes 80
OldTown 60
Sawyer 71
SawyerW 66.5
Somerst 73.5
StoneBr 61.5
SWISU 60
Timber 85
Veenker 68

The mechanism of grouping is split-apply-combine

Fill in with the median for na (however, different values are entered for each group)


#Take the LotFrontage median for each neighborhood house (Neighborhood) and fill it with that value.
print(data['LotFrontage'].isnull().sum()) #Missing value 259
data["LotFrontage"] = data.groupby("Neighborhood")["LotFrontage"].apply(lambda x: x.fillna(x.median()))
print(data['LotFrontage'].isnull().sum()) #Missing value 0

See here for Python lambda Python lambda is easy to understand This is an image that "as soon as na is found while performing a full scan (= the entire amount is processed) vertically, it is filled with the median value of any group by item".

About lambda


#lambda formal argument:Return value
#Define the argument used in this lambda function as x
#x is the full amount of one of the items split by group
#After all, the median is filled for the total amount of na.
lambda x: x.fillna(x.median())

"Neighborhood" has 25 records starting with Blmngtn and ending with Veenker. When this is grouped, it is internally divided into 25 tables for each group (the total number of details is obtained by adding all the details of 25 tables). Then, the contents of the parentheses of the apply function are applied to each of the 25 tables. ** Well, what you're doing is the same as SQL ** It's written in O'Reilly's diagram that there are few steps ("apply" means "apply") once you understand the point of view. For more information, buy this book properly w It's pretty easy to understand. apply.png

Statistical processing cheat sheet

This cheat sheet is written so easily, but after all statistics have a high first threshold. It's the same with general programming that you don't know what you're doing unless you copy and paste it after you understand it. I couldn't get statistics, so I tried Techacademy's data science course (¥ 139,000 for 4 weeks). Basically, [a new textbook on statistics learned with Python](https://www.amazon.co.jp/Python%E3%81%A7%E5%AD%A6%E3%81%B6%E3%81] % 82% E3% 81% 9F% E3% 82% 89% E3% 81% 97% E3% 81% 84% E7% B5% B1% E8% A8% 88% E5% AD% A6% E3% 81% AE % E6% 95% 99% E7% A7% 91% E6% 9B% B8-AI-TECHNOLOGY-% E9% A6% AC% E5% A0% B4-% E7% 9C% 9F% E5% 93% 89 / dp It will be a form that traces / 4798155063), but since you can learn how to fight in Kaggle, I can recommend it as an online teaching material. I was able to ask a lot of questions that weren't written in the book, such as the reverse question, "Which math do I have to go back to to solve this problem?" I'm doing my best on my own, but I'm sorry I ended up becoming a doctor shopping book. .. Where are you now? If you're like this, you'd better pay for it (if you take that step, you're actually in a dungeon where you've gone too high and you can ask sharp questions: muscle :). However, I can not do it, but while doing work, I can get 456 pages (all one book) in 4 weeks lol

Cheat 1 (Basic flow from data reading / visualization / missing value inspection / missing value filling / overcorrelation inspection / model creation / regression line / comprehensive inspection)

The point of making a model is to make a mathematical function (make something like y = x ^ 2 and then the personal computer will perform the calculation by pushing the formula)

Cheat sheet 1 (Variable names are not controlled because it is a collection)


#◆ Import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%precision 3 #3 decimal place notation
%matplotlib inline #jupyter only
sns.set()

#◆ Read data
data = pd.read_csv(r'data/kaggle_housing_price_revised.csv')

#◆ Confirmation 1 Data visualization
# 【pairplot】
sns.pairplot(data, hue='Category')  #Hue color-codes each specified group
#[Scatter plot with histogram]
sns.jointplot(x = 'temperature', y = 'beer', data = beer, color = 'black')
#[Box plot]
sns.boxplot(x = 'Line', y = 'Density', data = data, color = 'gray')
#【Scatter plot】
fig, ax = plt.subplots()
ax.scatter(x = data['GrLivArea'], y = data['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

#◆ Confirmation 2 Graph the ratio of missing values to each data
all_data_na = (data.isnull().sum() / len(data)) * 100
#Delete records that contain all values (= 0) and sort them in descending order.
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
f, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na.index, y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
plt.ylim([0,100]) #Fixed scale with maximum value set to 100

#◆ Fill in missing values (confirmation 2)
#Fill None:
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    all_data[col] = all_data[col].fillna('None')
#0 fill:
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
    all_data[col] = all_data[col].fillna(0)
#Median fill:
data["LotFrontage"] = data.groupby("Neighborhood")["LotFrontage"].apply(lambda x: x.fillna(x.median()))

#◆ Deletion of strongly correlated variables
threshold = 0.8
#Create and display a Correlation Matrix from data. At that time, apply the absolute value and convert the negative value to positive (corr)()Since only the numeric string is the operation target, corr_matrix.columns are reduced)
corr_matrix = data.corr().abs()
#The correlation matrix is converted to an upper triangular matrix because the information on the diagonal line is absolutely 1 and the information on the part below the diagonal line overlaps with the information on the upper part (corr)._matrix 0 or!Assuming a bool type with 0!It means that 0 was extracted by where)
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
#Extract columns with one or more values greater than or equal to threshold for columns other than SalePrice from the upper triangular matrix
to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
print('There are %d columns to remove.' % (len(to_drop)))
to_drop.remove('SalePrice')
data_dropped = data.drop(columns = to_drop)

#◆ Dummy variable
data_dummied = pd.get_dummies(data_dropped)
data_dummied.head()

#◆ Creating a model
#[Least squares method]
lm_model = smf.ols(formula = 'beer ~ temperature', data = beer).fit()
lm_model.summary()
#[Generalized linear model]
glm_model = smf.glm(formula = 'beer ~ temperature', data = data, family=sm.families.Binomial()).fit()
glm_model.summary()
#[AIC inspection] See if AIC is smaller than the NULL model
null_model = smf.ols('beer ~ 1', data = beer).fit()
print('null_model.aic: ', null_model.aic)
print('lm_model.aic: ', lm_model.aic)
print('If the explanatory variable of temperature is included, the AIC will be smaller, so the prediction accuracy will be higher.')

#◆ (Inspection 1) Regression straight line
sns.lmplot(x = 'temperature', y = 'beer', data = beer, scatter_kws = {'color': 'black'}, line_kws = {'color': 'black'})
#[Confirmation 1] The predicted value of 0 degree temperature is the intercept.(intercept)Match with
lm_model.predict(pd.DataFrame({'temperature':[0]}))
#[Confirmation 2] The predicted value of 20 degrees Celsius matches the position of the seaborn regression line.
lm_model.predict(pd.DataFrame({'temperature':[20]}))

#◆ (Inspection 2) Q using the selected model-Check the normality of the residuals on the Q plot
sm.qqplot(lm_model.resid, line = 's')
#Predicted value and actual SalePrice can be overlaid on a scatter plot
x = range(len(data))
y = data['SalePrice']
plt.scatter(x, y)
plt.scatter(x, lm_model.predict(data))

Cheat 2 (Random Forest)

Random forest is easy to do, but the interpretation of "what happened" is important.

Cheat sheet 2 (Variable names are not controlled because it is a collection)


#Run Random Forest and calculate MSE with validation data
model_random_forest = RandomForestRegressor(criterion='mse').fit(X_train, y_train)
mean_squared_error(y_test, model_random_forest.predict(X_test))
> 1322565474.5486445
#What's this number! ?? If you think about it, you can check it with "the average difference between the verification data and the predicted value."
#◆ Get a sense of conviction by checking with mse (mean square error) vs. manual mse
np.mean((y_test - model_random_forest.predict(X_test))**2)
> 1322565474.5486445
#After making a model, y and y^Arrange in a scatter plot
fig, ax = plt.subplots()
ax.scatter(x = y_test, y = model_random_forest.predict(X_test))
plt.ylabel('predict', fontsize=13)
plt.xlabel('SalePrice', fontsize=13)
plt.show()

◆ Plot of random forest. SalePrice and predicted value are lined up on a straight line (predicted performance is reasonable) image.png

Cheat 3 (clustering)

Clustering is also easy to do, but the interpretation of "what happened" is important.

Cheat sheet 3 (Variable names are not controlled because it is a collection)


#Module loading
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler

#Data read
dataset = pd.read_csv('data/baseball_salary.csv')

#Contents'batting average'When'RBI'Only in the column
dataset = dataset[['batting average', 'RBI']]

#Normalize with MinMaxScaler
dataset_scaled = MinMaxScaler().fit(dataset).transform(dataset)

#Draw a scatter plot of batting average and RBI after normalization
fig, ax = plt.subplots()
ax.scatter(x = dataset_scaled[:,0], y = dataset_scaled[:,1])
plt.ylabel('points', fontsize=13)
plt.xlabel('batting average', fontsize=13)
plt.show()

#Determine the number of clusters by looking at the elbow (6)...)
# WCSS:The sum of the distances from the node to the center of the shortest cluster (within)-cluster sum of squares)
wcss = []
n_comps = np.arange(1, 10)
for k in n_comps:
    #Initialization of KMeans. init is the default"k-means++"
    km = KMeans(n_clusters=k, init='k-means++').fit(dataset_scaled)
    # inertia_The attribute contains the value of WCSS
    wcss.append(km.inertia_)
fig = plt.figure(figsize=(6, 4))
plt.plot(n_comps, wcss)
plt.xlabel("# of clusters")
plt.ylabel("WCSS")
plt.show()

#Fit and plot on a scatter plot (for statement is cluster coloring)
km = KMeans(n_clusters=6, init='k-means++').fit(dataset_scaled)
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
colors = ['dodgerblue', 'orange', 'green', 'tomato', 'blueviolet', 'cornflowerblue']
cluster_ids = list(set(km.labels_))
for k in range(len(cluster_ids)):
    average = dataset_scaled[km.labels_ == k, 0]
    points = dataset_scaled[km.labels_ == k, 1]
    ax.scatter(average, points, c=colors[k], s=3, label=f'cluster = {cluster_ids[k]}')
    ax.scatter(average.mean(), points.mean(), marker='*', c=colors[k], s=150)
ax.set_title('Clustering')
ax.set_xlabel('batting average')
ax.set_ylabel('points')
ax.legend(loc='upper right')

image.png

Recommended Posts

Python environment construction and SQL execution example to DB and memo of basic processing for statistics 2019
Environment construction of python and opencv
Learning history for participating in team application development in Python ~ Supplement of basic items and construction of jupyterLab environment ~
Installation of Python3 and Flask [Environment construction summary]
[Basic] Unify everything from tabulation to dashboard creation! Data visualization application development with Python and Dash ~ Overview of Dash, environment construction, sample execution ~
[Python3] Connection to Oracle Database and SQL execution [cx_Oracle]
[Python] Chapter 01-02 About Python (Execution and installation of development environment)
Raspberry Pi + python + IoT device, environment construction procedure to start image processing and machine learning
[Note] List of basic commands for building python / conda environment
Memo of python + numpy/scipy/pandas/matplotlib/jupyterlab environment construction on M1 macOS (as of 2020/12/24)
[Memo] Construction of cygwin environment
Python environment construction For Mac
Python3 environment construction (for beginners)
How to develop in a virtual environment of Python [Memo]
Python environment construction and TensorFlow
[MEMO] [Development environment construction] Python
Environment construction of python2 & 3 (OSX)
[Django] Memo to create an environment of Django + MySQL + Vue.js [Python]
Overview of Python virtual environment and how to create it
Tips for Python beginners to use the Scikit-image example for themselves 8 Processing time measurement and profiler
How to prepare the execution environment of the ultra-lightweight Python "Python embeddable" (about 15MB) Memo (until the introduction of pip and other libraries (eg psutil))
Memo to switch between python2 series and 3 series in anaconda environment of mac (win is also added)
Build and test a CI environment for multiple versions of Python
Python environment construction 2016 for those who aim to be data scientists
How to build Python and Jupyter execution environment with VS Code
Python environment construction memo on Windows 10
Python environment construction memo on Mac
Environment construction memo of pyenv + conda
Environment construction of python3.8 on mac
Python3 TensorFlow for Mac environment construction
Emacs Python development environment construction memo
Procedure from AWS CDK (Python) development to AWS resource construction * For beginners of development
Example of python code for exponential distribution and maximum likelihood estimation (MLE)
Easy-to-understand explanation of Python Web application (Django) even for beginners (1) [Environment construction]
Memo to ask for KPI with python
[Python] Road to snake charmer (1) Environment construction
Python and machine learning environment construction (macOS)
Construction of development environment for Choreonoid class
How to build an environment for using multiple versions of Python on Mac
Build a python environment to learn the theory and implementation of deep learning
Memo # 4 for Python beginners to read "Detailed Python Grammar"
Python3 TensorFlow environment construction (Mac and pyenv virtualenv)
Prepare the execution environment of Python3 with Docker
Memo # 3 for Python beginners to read "Detailed Python Grammar"
Image Processing with Python Environment Setup for Windows
Poetry-virtualenv environment construction with python of centos-sclo-rh ~ Notes
Pillow environment construction --For Docker + iPython (and OpenCV)
python development environment -use of pyenv and virtualenv-
Memo # 2 for Python beginners to read "Detailed Python Grammar"
Introduction to Python "Re" 1 Building an execution environment
[Django3] Environment construction and various settings summary [Python3]
List of Python code to move and remember
Memo # 7 for Python beginners to read "Detailed Python Grammar"
From 0 to Django development environment construction to basic operation
A memo for creating a python environment by a beginner
From Python environment construction to virtual environment construction with anaconda
Memo # 6 for Python beginners to read "Detailed Python Grammar"
Build PyPy and Python execution environment with Docker
Memo # 5 for Python beginners to read "Detailed Python Grammar"
Example of reading and writing CSV with Python
Comparison of Python and Ruby (Environment / Grammar / Literal)