Background using SIGNATE Quest

While building a BI environment (Talend, Snowflake, Quicksight) as part of my work, machine learning was a field that I wanted to learn but couldn't get into. I learned that "Beginner Limited Competition" will be held at SIGNATE, so I decided to take this opportunity to challenge myself. I didn't know where to start, but a part of "Efficiency of telemarketing in financial institutions" was released for free, so I recorded what I learned here as my own learning memo. We are creating this article to keep it in mind.

Basics

Import pandas

import pandas as pd

Data reading

df = pd.read_csv('data.csv', index_col='id')

Basic statistics for numeric data

print( df.describe() )

Basic statistics for string data

print( df.describe(include=['O']) )

It is not a zero, but an uppercase letter O.

Displaying the type of variable df

print( type(df) )

Display of column y

print( df['y'] )

Display of data type from which only column y is extracted

print( type(df['y']) )

Display of data with columns age, job, y extracted in this order

print( df[['age', 'job', 'y']] )

Display of data with indexes 2 and 3 columns age, job, y extracted in this order

print( df.loc[[2, 3], ['age', 'job', 'y']] )

Display of data with column y removed

print( df.drop('y', axis=1) )

Simple tabulation

Display of element type and appearance number of column poutcome

print( df['poutcome'].value_counts() )

Display of element types and number of occurrences in column y

print( df['y'].value_counts() )

By using the value_counts function, the elements existing in the column and the number of their occurrences are displayed.

print( df['poutcome'].value_counts() )

Variable correlation

1. Confirm with numerical value --Correlation coefficient
Visualize and check

Scatter plot --Box plot

Correlation coefficient

The corr function, which is one of the functions in the Pandas library. Correlation is shown by cross tabulation of numerical data. print(df.corr())

Visualization

Scatter plots and box plots are often used as methods of expressing correlation.

Scatter plot

Used when checking the correlation between quantitative data

Box plot

Used to confirm the correlation between quantitative and qualitative data

How to make a box plot

1. Create a table of qualitative and quantitative data
Create a box plot based on each quantitative data for each qualitative data (In the example below, in the case of a, create a boxplot with a maximum value of 4 and a minimum value of 1.)

Analysis method

Let's make a hypothesis by looking at the basic statistics and graphs as to what variables affect the presence or absence of fixed deposit applications. Instead of looking at the data in the dark clouds, you can proceed with the analysis efficiently by making a hypothesis and verifying it. For example, the following can be considered.

Hypothesis 1. People who have applied for the previous campaign may be more likely to repeat (only if the product is satisfactory). Hypothesis 2. Since time deposits are products that cannot be withdrawn freely, it may be easier for people with a large amount of surplus funds to apply. Hypothesis 3. People who have been in contact with sales for a long time may be more likely to feel uneasy about applying (depending on the skill of the sales staff).

Cross tabulation

In order to prove Hypothesis 1, (1) Calculate the application rate for each element of column poutcome using the variable cross to which the cross tabulation result created in the previous Operation is assigned, and assign it to the variable rate.

Perform crosstabs

import pandas as pd df = pd.read_csv('data.csv', index_col='id') cross = pd.crosstab(df['poutcome'], df['y'], margins=True)

Calculation of application rate

rate = cross[1] / cross["All"]

Added application rate cvr to cross tabulation results

cross["cvr"] = rate print( cross )

Extract and display only indexes'success',' failure', and column'cvr' from the crosstabulation table

print( cross.loc[["success", "failure"], "cvr"])

Heat map display

Even if the correlation is created as a matrix table, it is difficult to understand what is highly correlated in the list of numerical values, so visualize it and check it.

Library import

import pandas as pd import matplotlib.pyplot as plt import seaborn as sns

Data reading

df = pd.read_csv('data.csv', index_col='id')

Calculation of correlation coefficient of quantitative data

corr_matrix = df.corr()

Create heatmap

sns.heatmap(corr_matrix, cmap="Reds")

Add title to graph

plt.title('Correlation')

Show graph

plt.show()

Data filtering

In Pandas, you can perform the same processing as the filter function of Excel by describing it as a variable [conditional expression] to which DataFrame is assigned. If you give the conditional expression the variable ['column name'] == value to which DataFrame is assigned, you can extract only the rows for which the column has that value. Typical conditional expressions are as follows.

Data equal to the specified value: Variable to which DataFrame is assigned ['column name'] == value Data different from the specified value: Variable to which DataFrame is assigned ['column name']! = Value Data larger than the specified value: Variable to which DataFrame is assigned ['column name']> value Data greater than or equal to the specified value: Variable to which DataFrame is assigned ['column name']> = value

For example, when the variable to which DataFrame is assigned is X, to filter the data on the condition that the value of column A is 0, write as follows.

X [X ['Column A'] == 0]

If you want to select only a specific column B of the filtered data, write as follows.

X [X ['Column A'] == 0] ['Column B']

By using this function, you can extract only the column duration (last contact time) when column y (presence / absence of fixed deposit application (1: yes, 0: no)) is '1'.

Import pandas

import pandas as pd

Data reading

df = pd.read_csv('data.csv', index_col='id')

Display data with a y value of 1

print( df[df['y']==1] )

Shows the duration of data with a y value of 1

print( df[df['y']==1]['duration'] )

Histogram drawing

Histogram is a visualization method to check the data distribution of numerical data. By using the histogram, you can check the range of numerical data and the range of the most frequent numerical data.

Then, regarding the column duration (final contact time), two types of histograms divided into the case where the column y (presence / absence of fixed deposit application (1: yes, 0: no)) is 0 and the case where it is 1 are displayed on one graph. Let's draw and compare the distribution. To create a histogram, use seaborn's distplot function and write:

seaborn.distplot (variable assigned Series)

To superimpose two types of data, you can draw by writing the distplot function twice. If you want to add a legend to the graph, specify the label name in the distplot option, and then write the legend function of matplotlib.

seaborn.distplot (variable assigned Series, label = "label name") matplotlib.pyplot.legend()

matplotlib provides many functions to improve the appearance of your graph. For example, to name the x-axis and y-axis, use the xlabel and ylabel functions.

matplotlib.pyplot.xlabel (name of x-axis) matplotlib.pyplot.ylabel (y-axis name)

Also, use the xlim function to specify the x-axis display range. (You can specify the display range for the y-axis with the ylim function as well.)

matplotlib.pyplot.xlim (x-axis lower limit, x-axis upper limit)

Import pandas

import pandas as pd

Data reading

df = pd.read_csv('data.csv', index_col='id')

Import matplotlib.pyplot

import matplotlib.pyplot as plt

seaborn import

import seaborn as sns

Extraction of duration

duration_0 = df[df['y']==0]['duration'] duration_1 = df[df['y']==1]['duration']

Histogram creation

sns.distplot(duration_0, label='y=0') sns.distplot(duration_1, label='y=1')

Add title to graph

plt.title('duration histgram')

Add name to x-axis of graph

plt.xlabel('duration')

Add name to y-axis of graph

plt.ylabel('frequency')

Specifying the x-axis display range

plt.xlim(0, 2000)

Add legend to graph

plt.legend()

Show graph

plt.show()

Generation and processing of features

[PYTHON] SIGNATE Quest ① From data reading to preprocessing

Background using SIGNATE Quest

Basics

Import pandas

Data reading

Basic statistics for numeric data

Basic statistics for string data

Displaying the type of variable df

Display of column y

Display of data type from which only column y is extracted

Display of data with columns age, job, y extracted in this order

Display of data with indexes 2 and 3 columns age, job, y extracted in this order

Display of data with column y removed

Simple tabulation

Display of element type and appearance number of column poutcome

Display of element types and number of occurrences in column y

By using the value_counts function, the elements existing in the column and the number of their occurrences are displayed.

Variable correlation

Correlation coefficient

Visualization

Scatter plots and box plots are often used as methods of expressing correlation.

Scatter plot

Box plot

How to make a box plot

Analysis method

Cross tabulation

Perform crosstabs

Calculation of application rate

Added application rate cvr to cross tabulation results

Extract and display only indexes'success',' failure', and column'cvr' from the crosstabulation table

Heat map display

Library import

Data reading

Calculation of correlation coefficient of quantitative data

Create heatmap

Add title to graph

Show graph

Data filtering

Import pandas

Data reading

Display data with a y value of 1

Shows the duration of data with a y value of 1

Histogram drawing

Import pandas

Data reading

Import matplotlib.pyplot

seaborn import

Extraction of duration

Histogram creation

Add title to graph

Add name to x-axis of graph

Add name to y-axis of graph

Specifying the x-axis display range

Add legend to graph

Show graph

Generation and processing of features