While building a BI environment (Talend, Snowflake, Quicksight) as part of my work, machine learning was a field that I wanted to learn but couldn't get into. I learned that "Beginner Limited Competition" will be held at SIGNATE, so I decided to take this opportunity to challenge myself. I didn't know where to start, but a part of "Efficiency of telemarketing in financial institutions" was released for free, so I recorded what I learned here as my own learning memo. We are creating this article to keep it in mind.
import pandas as pd
df = pd.read_csv('data.csv', index_col='id')
print( df.describe() )
print( df.describe(include=['O']) )
print( type(df) )
print( df['y'] )
print( type(df['y']) )
print( df[['age', 'job', 'y']] )
print( df.loc[[2, 3], ['age', 'job', 'y']] )
print( df.drop('y', axis=1) )
print( df['poutcome'].value_counts() )
print( df['y'].value_counts() )
print( df['poutcome'].value_counts() )
The corr function, which is one of the functions in the Pandas library. Correlation is shown by cross tabulation of numerical data. print(df.corr())
Used when checking the correlation between quantitative data
Used to confirm the correlation between quantitative and qualitative data
Let's make a hypothesis by looking at the basic statistics and graphs as to what variables affect the presence or absence of fixed deposit applications. Instead of looking at the data in the dark clouds, you can proceed with the analysis efficiently by making a hypothesis and verifying it. For example, the following can be considered.
Hypothesis 1. People who have applied for the previous campaign may be more likely to repeat (only if the product is satisfactory). Hypothesis 2. Since time deposits are products that cannot be withdrawn freely, it may be easier for people with a large amount of surplus funds to apply. Hypothesis 3. People who have been in contact with sales for a long time may be more likely to feel uneasy about applying (depending on the skill of the sales staff).
In order to prove Hypothesis 1, (1) Calculate the application rate for each element of column poutcome using the variable cross to which the cross tabulation result created in the previous Operation is assigned, and assign it to the variable rate.
import pandas as pd df = pd.read_csv('data.csv', index_col='id') cross = pd.crosstab(df['poutcome'], df['y'], margins=True)
rate = cross[1] / cross["All"]
cross["cvr"] = rate print( cross )
print( cross.loc[["success", "failure"], "cvr"])
Even if the correlation is created as a matrix table, it is difficult to understand what is highly correlated in the list of numerical values, so visualize it and check it.
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns
df = pd.read_csv('data.csv', index_col='id')
corr_matrix = df.corr()
sns.heatmap(corr_matrix, cmap="Reds")
plt.title('Correlation')
plt.show()
In Pandas, you can perform the same processing as the filter function of Excel by describing it as a variable [conditional expression] to which DataFrame is assigned. If you give the conditional expression the variable ['column name'] == value to which DataFrame is assigned, you can extract only the rows for which the column has that value. Typical conditional expressions are as follows.
Data equal to the specified value: Variable to which DataFrame is assigned ['column name'] == value Data different from the specified value: Variable to which DataFrame is assigned ['column name']! = Value Data larger than the specified value: Variable to which DataFrame is assigned ['column name']> value Data greater than or equal to the specified value: Variable to which DataFrame is assigned ['column name']> = value
For example, when the variable to which DataFrame is assigned is X, to filter the data on the condition that the value of column A is 0, write as follows.
X [X ['Column A'] == 0]
If you want to select only a specific column B of the filtered data, write as follows.
X [X ['Column A'] == 0] ['Column B']
By using this function, you can extract only the column duration (last contact time) when column y (presence / absence of fixed deposit application (1: yes, 0: no)) is '1'.
import pandas as pd
df = pd.read_csv('data.csv', index_col='id')
print( df[df['y']==1] )
print( df[df['y']==1]['duration'] )
Histogram is a visualization method to check the data distribution of numerical data. By using the histogram, you can check the range of numerical data and the range of the most frequent numerical data.
Then, regarding the column duration (final contact time), two types of histograms divided into the case where the column y (presence / absence of fixed deposit application (1: yes, 0: no)) is 0 and the case where it is 1 are displayed on one graph. Let's draw and compare the distribution. To create a histogram, use seaborn's distplot function and write:
seaborn.distplot (variable assigned Series)
To superimpose two types of data, you can draw by writing the distplot function twice. If you want to add a legend to the graph, specify the label name in the distplot option, and then write the legend function of matplotlib.
seaborn.distplot (variable assigned Series, label = "label name") matplotlib.pyplot.legend()
matplotlib provides many functions to improve the appearance of your graph. For example, to name the x-axis and y-axis, use the xlabel and ylabel functions.
matplotlib.pyplot.xlabel (name of x-axis) matplotlib.pyplot.ylabel (y-axis name)
Also, use the xlim function to specify the x-axis display range. (You can specify the display range for the y-axis with the ylim function as well.)
matplotlib.pyplot.xlim (x-axis lower limit, x-axis upper limit)
import pandas as pd
df = pd.read_csv('data.csv', index_col='id')
import matplotlib.pyplot as plt
import seaborn as sns
duration_0 = df[df['y']==0]['duration'] duration_1 = df[df['y']==1]['duration']
sns.distplot(duration_0, label='y=0') sns.distplot(duration_1, label='y=1')
plt.title('duration histgram')
plt.xlabel('duration')
plt.ylabel('frequency')
plt.xlim(0, 2000)
plt.legend()
plt.show()
Recommended Posts