[PYTHON] SIGNATE Quest ① From data reading to preprocessing

Background using SIGNATE Quest

While building a BI environment (Talend, Snowflake, Quicksight) as part of my work, machine learning was a field that I wanted to learn but couldn't get into. I learned that "Beginner Limited Competition" will be held at SIGNATE, so I decided to take this opportunity to challenge myself. I didn't know where to start, but a part of "Efficiency of telemarketing in financial institutions" was released for free, so I recorded what I learned here as my own learning memo. We are creating this article to keep it in mind.

Basics

Import pandas

import pandas as pd

Data reading

df = pd.read_csv('data.csv', index_col='id')

Basic statistics for numeric data

print( df.describe() )

Basic statistics for string data

print( df.describe(include=['O']) )

Displaying the type of variable df

print( type(df) )

Display of column y

print( df['y'] )

Display of data type from which only column y is extracted

print( type(df['y']) )

Display of data with columns age, job, y extracted in this order

print( df[['age', 'job', 'y']] )

Display of data with indexes 2 and 3 columns age, job, y extracted in this order

print( df.loc[[2, 3], ['age', 'job', 'y']] )

Display of data with column y removed

print( df.drop('y', axis=1) )

Simple tabulation

Display of element type and appearance number of column poutcome

print( df['poutcome'].value_counts() )

Display of element types and number of occurrences in column y

print( df['y'].value_counts() )

By using the value_counts function, the elements existing in the column and the number of their occurrences are displayed.

print( df['poutcome'].value_counts() )

Variable correlation

    1. Confirm with numerical value --Correlation coefficient
  1. Visualize and check

Correlation coefficient

The corr function, which is one of the functions in the Pandas library. Correlation is shown by cross tabulation of numerical data. print(df.corr()) 2020-08-05_10h04_35.png

Visualization

Scatter plots and box plots are often used as methods of expressing correlation.

Scatter plot

Used when checking the correlation between quantitative data 2020-08-14_10h19_53.png

Box plot

Used to confirm the correlation between quantitative and qualitative data 2020-08-05_10h03_46.png

How to make a box plot

    1. Create a table of qualitative and quantitative data
  1. Create a box plot based on each quantitative data for each qualitative data (In the example below, in the case of a, create a boxplot with a maximum value of 4 and a minimum value of 1.) 2020-08-14_10h37_17.png

2020-08-14_10h42_23.png

Analysis method

Let's make a hypothesis by looking at the basic statistics and graphs as to what variables affect the presence or absence of fixed deposit applications. Instead of looking at the data in the dark clouds, you can proceed with the analysis efficiently by making a hypothesis and verifying it. For example, the following can be considered.

Hypothesis 1. People who have applied for the previous campaign may be more likely to repeat (only if the product is satisfactory). Hypothesis 2. Since time deposits are products that cannot be withdrawn freely, it may be easier for people with a large amount of surplus funds to apply. Hypothesis 3. People who have been in contact with sales for a long time may be more likely to feel uneasy about applying (depending on the skill of the sales staff).

Cross tabulation

In order to prove Hypothesis 1, (1) Calculate the application rate for each element of column poutcome using the variable cross to which the cross tabulation result created in the previous Operation is assigned, and assign it to the variable rate.

Perform crosstabs

import pandas as pd df = pd.read_csv('data.csv', index_col='id') cross = pd.crosstab(df['poutcome'], df['y'], margins=True)

Calculation of application rate

rate = cross[1] / cross["All"]

Added application rate cvr to cross tabulation results

cross["cvr"] = rate print( cross )

Extract and display only indexes'success',' failure', and column'cvr' from the crosstabulation table

print( cross.loc[["success", "failure"], "cvr"])

Heat map display

Even if the correlation is created as a matrix table, it is difficult to understand what is highly correlated in the list of numerical values, so visualize it and check it.

Library import

import pandas as pd import matplotlib.pyplot as plt import seaborn as sns

Data reading

df = pd.read_csv('data.csv', index_col='id')

Calculation of correlation coefficient of quantitative data

corr_matrix = df.corr()

Create heatmap

sns.heatmap(corr_matrix, cmap="Reds")

Add title to graph

plt.title('Correlation')

Show graph

plt.show()

2020-08-05_11h58_10.png

Data filtering

In Pandas, you can perform the same processing as the filter function of Excel by describing it as a variable [conditional expression] to which DataFrame is assigned. If you give the conditional expression the variable ['column name'] == value to which DataFrame is assigned, you can extract only the rows for which the column has that value. Typical conditional expressions are as follows.

Data equal to the specified value: Variable to which DataFrame is assigned ['column name'] == value Data different from the specified value: Variable to which DataFrame is assigned ['column name']! = Value Data larger than the specified value: Variable to which DataFrame is assigned ['column name']> value Data greater than or equal to the specified value: Variable to which DataFrame is assigned ['column name']> = value

For example, when the variable to which DataFrame is assigned is X, to filter the data on the condition that the value of column A is 0, write as follows.

X [X ['Column A'] == 0]

If you want to select only a specific column B of the filtered data, write as follows.

X [X ['Column A'] == 0] ['Column B']

By using this function, you can extract only the column duration (last contact time) when column y (presence / absence of fixed deposit application (1: yes, 0: no)) is '1'.

Import pandas

import pandas as pd

Data reading

df = pd.read_csv('data.csv', index_col='id')

Display data with a y value of 1

print( df[df['y']==1] )

Shows the duration of data with a y value of 1

print( df[df['y']==1]['duration'] )

Histogram drawing

Histogram is a visualization method to check the data distribution of numerical data. By using the histogram, you can check the range of numerical data and the range of the most frequent numerical data.

Then, regarding the column duration (final contact time), two types of histograms divided into the case where the column y (presence / absence of fixed deposit application (1: yes, 0: no)) is 0 and the case where it is 1 are displayed on one graph. Let's draw and compare the distribution. To create a histogram, use seaborn's distplot function and write:

seaborn.distplot (variable assigned Series)

To superimpose two types of data, you can draw by writing the distplot function twice. If you want to add a legend to the graph, specify the label name in the distplot option, and then write the legend function of matplotlib.

seaborn.distplot (variable assigned Series, label = "label name") matplotlib.pyplot.legend()

matplotlib provides many functions to improve the appearance of your graph. For example, to name the x-axis and y-axis, use the xlabel and ylabel functions.

matplotlib.pyplot.xlabel (name of x-axis) matplotlib.pyplot.ylabel (y-axis name)

Also, use the xlim function to specify the x-axis display range. (You can specify the display range for the y-axis with the ylim function as well.)

matplotlib.pyplot.xlim (x-axis lower limit, x-axis upper limit)

Import pandas

import pandas as pd

Data reading

df = pd.read_csv('data.csv', index_col='id')

Import matplotlib.pyplot

import matplotlib.pyplot as plt

seaborn import

import seaborn as sns

Extraction of duration

duration_0 = df[df['y']==0]['duration'] duration_1 = df[df['y']==1]['duration']

Histogram creation

sns.distplot(duration_0, label='y=0') sns.distplot(duration_1, label='y=1')

Add title to graph

plt.title('duration histgram')

Add name to x-axis of graph

plt.xlabel('duration')

Add name to y-axis of graph

plt.ylabel('frequency')

Specifying the x-axis display range

plt.xlim(0, 2000)

Add legend to graph

plt.legend()

Show graph

plt.show()

2020-08-14_12h11_31.png

Generation and processing of features

2020-08-14_12h19_42.png

2020-08-14_12h20_33.png

2020-08-14_12h23_10.png

2020-08-14_12h25_05.png

2020-08-14_12h25_18.png

2020-08-14_12h26_28.png

2020-08-14_12h27_49.png

2020-08-14_12h28_08.png

2020-08-14_12h28_53.png

Recommended Posts

SIGNATE Quest ① From data reading to preprocessing
[Kaggle] From data reading to preprocessing and encoding
SIGNATE Quest ② From creation of targeting model to creation of submitted data
Data preprocessing (2) Data is changed from Categorical to Numerical.
From Elasticsearch installation to data entry
Python: Reading JSON data from web API
[Python] How to read data from CIFAR-10 and CIFAR-100
Sum from 1 to 10
Data retrieval from MacNote3 and migration to Write
[Python] Flow from web scraping to data analysis
[AWS] Migrate data from DynamoDB to Aurora MySQL
Reading CSV data from DSX object storage Python code
How to scrape image data from flickr with python
Automatic data migration from yahoo root lab to Strava
Send log data from the server to Splunk Cloud
Reading Note: An Introduction to Data Analysis with Python
Send data from Python to Processing via socket communication
DataNitro, implementation of function to read data from sheet
I tried reading data from a file using Node.js.
I want to say that there is data preprocessing ~
Extract data from S3
Changes from Python 3.0 to Python 3.5
Changes from Python 2 to Python 3.0
Transition from WSL1 to WSL2
Reading data with TensorFlow
Correlation by data preprocessing
Preprocessing of prefecture data
From editing to execution
[Linux] Copy data from Linux to Windows with a shell script
[Introduction to matplotlib] Read the end time from COVID-19 data ♬
The story of copying data from S3 to Google's TeamDrive
Hit REST in Python to get data from New Relic
Try to process Titanic data with preprocessing library DataLiner (Append)
Meteorology x Python ~ From weather data acquisition to spectrum analysis ~
Try to process Titanic data with preprocessing library DataLiner (Encoding)
Try to process Titanic data with preprocessing library DataLiner (conversion)
Pass OpenCV data from the original C ++ library to Python
I tried to get data from AS / 400 quickly using pypyodbc