Python data analysis template

Python data analysis template

When working on kaggle, you need to analyze the data and create your own features. At that time, the data is analyzed using the graph. In this article, I will post a template to create a graph for the purpose of data analysis.

Library used

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

Observation of correlation

Scatter plot between all variables

If you use pandas, you can get a scatter plot in one shot. A histogram is drawn between the same variables. (Because the same variables are just straight lines)

from pandas.plotting import scatter_matrix
scatter_matrix(df)

image.png

Scatter plot

In addition, a scatter plot between specific variables can be easily created as follows.

df.plot(kind='scatter',x='Age',y='Survived',alpha=0.1,figsize=(4,3))

image.png

Calculation of correlation coefficient

Correlation coefficient

Pearson's correlation coefficient can be displayed in one shot with corr (). Very convenient.

data1.corr()

image.png

Correlation coefficient heat map

def correlation_heatmap(df):
    _ , ax = plt.subplots(figsize =(14, 12))
    colormap = sns.diverging_palette(220, 10, as_cmap = True)
    
    _ = sns.heatmap(
        df.corr(), 
        cmap = colormap,
        square=True, 
        cbar_kws={'shrink':.9 }, 
        ax=ax,
        annot=True, 
        linewidths=0.1,vmax=1.0, linecolor='white',
        annot_kws={'fontsize':12 }
    )
    
    plt.title('Pearson Correlation of Features', y=1.05, size=15)

correlation_heatmap(data1)

image.png

Correlation coefficient for the objective variable

corr_matrix = data1.corr()
fig,ax=plt.subplots(figsize=(15,6))
y=pd.DataFrame(corr_matrix['Survived'].sort_values(ascending=False))
sns.barplot(x = y.index,y='Survived',data=y)
plt.tick_params(labelsize=10)

image.png

histogram

Histogram of all variables

You can get it in one shot with hist ().

df.hist()

image.png

Overlay the histogram

plt.figure(figsize=[8,6])

plt.subplot(222)
plt.hist(x = [data1[data1['Survived']==1]['Age'], data1[data1['Survived']==0]['Age']], stacked=True, color = ['g','r'],label = ['Survived','Dead'])
plt.title('Age Histogram by Survival')
plt.xlabel('Age (Years)')
plt.ylabel('# of Passengers')
plt.legend()

image.png

Explanation of variable distribution

If include ='all', features that are not numerical values are also displayed.

data1.describe(include = 'all')

image.png

Quartile

plt.figure(figsize=[8,6])

"""
o is treated as a Outlier.
minimun
25th percentile first quartile
50th percentile second quartile (median)
75th percentile third quartile
maximum
"""

plt.subplot(221)
plt.boxplot(data1['Age'], showmeans = True, meanline = True)
plt.title('Age Boxplot')
plt.ylabel('Age (Years)')

image.png

You can look at Boxplot to see if there are any outliers. This can also be used to fill in missing values. When the outliers match or the distribution is biased, it is better to use the median rather than the mean. On the other hand, if the distribution is symmetrical on the left and right, it may be better to use the average value.

Recommended Posts

Python data analysis template
Data analysis python
Preprocessing template for data analysis (Python)
Data analysis with python 2
Data analysis using Python 0
Data analysis with Python
My python data analysis container
Python for Data Analysis Chapter 4
[Python] Notes on data analysis
Python data analysis learning notes
Python for Data Analysis Chapter 2
Data analysis using python pandas
Python for Data Analysis Chapter 3
Data analysis starting with python (data visualization 1)
Data analysis starting with python (data visualization 2)
Data analysis Titanic 1
python argparse template
Data analysis Titanic 3
[python] Read data
[Python] Tkinter template
Python visualization tool for data analysis work
[Python] First data analysis / machine learning (Kaggle)
Data analysis starting with python (data preprocessing-machine learning)
I did Python data analysis training remotely
Python 3 Engineer Certified Data Analysis Exam Preparation
Python: Time Series Analysis
Data analysis using xarray
Data analysis parts collection
Python template for log analysis at explosive speed
Python Data Visualization Libraries
[Examination Report] Python 3 Engineer Certified Data Analysis Exam
[CovsirPhy] COVID-19 Python Package for Data Analysis: Data loading
Competitive Pro Template (Python)
Python 3 Engineer Certification Data Analysis Exam Pre-Exam Learning
Voice analysis with python
Data cleaning using Python
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
Data analysis in Python: A note about line_profiler
[Python] Flow from web scraping to data analysis
Jinja2 | Python template engine
[Python tutorial] Data structure
[Python] Sorting Numpy data
Association analysis in Python
Voice analysis with python
A well-prepared record of data analysis in Python
python unit test template
Python template engine empy
Regression analysis in Python
Have passed the Python Engineer Certification Data Analysis Exam
[Python] [Word] [python-docx] Simple analysis of diff data using python
[For beginners] How to study Python3 data analysis exam
Reading Note: An Introduction to Data Analysis with Python
Data analysis environment construction with Python (IPython notebook + Pandas)
[Python3 engineer certification data analysis test] Examination / passing experience
[CovsirPhy] COVID-19 Python package for data analysis: SIR-F model
[CovsirPhy] COVID-19 Python package for data analysis: S-R trend analysis
[CovsirPhy] COVID-19 Python Package for Data Analysis: SIR model
[CovsirPhy] COVID-19 Python Package for Data Analysis: Parameter estimation
[Technical book] Introduction to data analysis using Python -1 Chapter Introduction-
Sample data created with python
Handle Ambient data in Python