[PYTHON] Data analysis parts collection

Data analysis parts collection

--History

version.cmd


python --version
:: Python 3.7.6
jupyter --version
:: jupyter core     : 4.6.1
:: jupyter-notebook : 6.0.3
:: qtconsole        : 4.6.0
:: ipython          : 7.12.0
:: ipykernel        : 5.1.4
:: jupyter client   : 5.3.4
:: jupyter lab      : 1.2.6
:: nbconvert        : 5.6.1
:: ipywidgets       : 7.5.1
:: nbformat         : 5.0.4
:: traitlets        : 4.3.3

package

import.py


import pandas as pd #A library that provides functions to support data analysis
import numpy as np #Numerical calculation extension module
import matplotlib #Package for data visualization
import matplotlib.pyplot as plt #Interface for automatic plotting, seems
from datetime import datetime as dt #Module for manipulating dates and times
from sklearn.preprocessing import StandardScaler #Module for data standardization

Handle Japanese in matplotlib

Reference source: Japanese with matplotlib

matplot_japanese.py


from matplotlib import rcParams
rcParams['font.family'] = 'sans-serif'
rcParams['font.sans-serif'] = ['Hiragino Maru Gothic Pro', 'Yu Gothic', 'Meirio', 'Takao', 'IPAexGothic', 'IPAPGothic', 'VL PGothic', 'Noto Sans CJK JP']

Size of graph or image to display

figsize.py


plt.figure(figsize=(20,2))

Data input / output

Read from SQL

#SQL
import pymysql
import sqlalchemy
from sqlalchemy import create_engine

#Connection information
url = 'mysql+pymysql://root:[email protected]:3306/databasename?charset=utf8'
engine = sqlalchemy.create_engine(url, echo=False)
#Run
query = "SELECT * FROM Table"
dataset = pd.read_sql(query,con = engine)

CSV input / output

csv.py


dataset = pd.read_csv("pass of csv", encoding = "utf-8")
dataset.to_csv("pass of csv", encoding="shift-jis")

Output to clipboard

clipboard.py


!pip install pyperclip
import pyperclip
pyperclip.copy(STR_XXX)

Pre-processing

Confirmation of contents

info.py


dataset.info()
dataset.describe() #max,min,mean,std,Quartile, etc.

Plastic surgery

index.py


#Indexing data frames
dataset2 = dataset.set_index('StoreCD')
#Type change
dataset['colum']=dataset['colum'].astype(int)

Data frame decomposition and combination

--Decompose the crosstabbed data into simple ROW data

melt_concat.py


#Data frame decomposition
meltDF1 = pd.melt(dataset,id_vars='index_column',var_name='horizon_axis_column_name',value_name='value_column_name')
#Combine data frames
concatDF = pd.concat([meltDF1,meltDF2]) 

Handling null values

null.py


dataset.fillna(0,inplace=True) #0 fill
dataset.isnull() #Check throughout the data frame
dataset.isnull().any() #Check by column

Dummy variable

dummy.py


target_col = 'a'
str_colmns = ['b','c','d'] #Non-numeric column
dummie_cols = ['b'] #Columns that you want to dummy in non-numeric columns
exclude_cols = [col for col in str_colmns if col not in dummie_cols] #
#Dummy variable
df = pd.get_dummies(data=df, columns=dummie_cols)
#Specify the column to be used as a feature from the column after dummyization
feature_cols = [col for col in df.columns if col not in exclude_cols]
 

Visualization and exploration

Take the correlation coefficient between items

corr.py


corrDF = df[feature_cols].corr()

Heat map

heatmap.py


import seaborn as sns
sns.heatmap(corrDF,annot=False) #annot=Display numbers with True

image.png

Overlapping bar chart

distplot.py


import seaborn as sns
g=sns.FacetGrid(df,hue="target_column",height=3)
g.map(sns.distplot,"feature_column",kde=False)
g.add_legend()

image.png

seaborn color settings

--You can set linked colors

color.py


flatui = ['#969696', '#DA5019']
sns.set_palette(flatui)

image.png

comment ――Because it is troublesome to find out from the past code and copy and paste

Recommended Posts

Data analysis parts collection
Data analysis Titanic 2
Data analysis python
Data analysis Titanic 1
Data analysis Titanic 3
Data analysis with python 2
Data analysis using xarray
Data analysis using Python 0
Data analysis overview python
Python data analysis template
Data analysis planning collection processing and judgment (Part 1)
Data analysis planning collection processing and judgment (Part 2)
Data analysis with Python
My python data analysis container
Multidimensional data analysis library xarray
Python for Data Analysis Chapter 4
[Python] Notes on data analysis
Python data analysis learning notes
Python for Data Analysis Chapter 2
Wrap analysis part1 (data preparation)
Data analysis using python pandas
Tips for data analysis ・ Notes
Python for Data Analysis Chapter 3
Analyzing Twitter Data | Trend Analysis
First satellite data analysis by Tellus
Python: Time Series Analysis: Preprocessing Time Series Data
Preprocessing template for data analysis (Python)
November 2020 data analysis test passing experience
Data analysis for improving POG 3-Regression analysis-
Recommendation of data analysis using MessagePack
Time series analysis 3 Preprocessing of time series data
Data analysis starting with python (data visualization 1)
Data analysis starting with python (data visualization 2)
Data handling 2 Analysis of various data formats
Multidimensional data analysis library xarray Part 2
Starbucks Twitter Data Location Visualization and Analysis
Python visualization tool for data analysis work
I tried factor analysis with Titanic data!
Data analysis, what do you do after all?
[Python] First data analysis / machine learning (Kaggle)
Creating a data analysis application using Streamlit
Data analysis before kaggle's titanic feature generation
Data analysis starting with python (data preprocessing-machine learning)
[Data analysis] Let's analyze US automobile stocks
I did Python data analysis training remotely
Data analysis environment centered on Datalab (+ GCP)
Python 3 Engineer Certified Data Analysis Exam Preparation
FX data collection using OANDA REST API
Preprocessing in machine learning 1 Data analysis process
JupyterLab Basic Setting 2 (pip) for data analysis
JupyterLab Basic Setup for Data Analysis (pip)
Analysis for Data Scientists: Qiita Self-Article Summary 2020