We will update the information of Kaggle who participated in the past. Here, we will pick up the data introduction of Predicting Red Hat Business Value and the prominent discussions in the forum. The code for the winner of the competition is summarized in Kaggle Summary: Red Hat (Part 2), which is a summary, discussion, and sample code. (Currently under construction)
This article uses Python 2.7, numpy 1.11, scipy 0.17, scikit-learn 0.18, matplotlib 1.5, seaborn 0.7, pandas 0.17. It has been confirmed to work on jupyter notebook. (Please modify% matplotlib inline appropriately) If you find any errors when you run the sample script, it would be helpful if you could comment.
As with many companies, Red Hat is able to collect customer information related to it in chronological order. Red Hat is researching a method to predict which individual should be approached and how to approach it from the prediction according to the behavior of the customer. In this competition, Kaggler is trying to create a classification algorithm to predict the potential business value of Red Hat from customer characteristics and activities.
The characteristic points of this time are as follows.
The evaluation index this time is ROC. ([Japanese wikipedia](https://ja.wikipedia.org/wiki/%E5%8F%97%E4%BF%A1%E8%80%85%E6%93%8D%E4%BD%9C % E7% 89% B9% E6% 80% A7))
ROC is the most standard index for evaluating binary classification problems. Detailed explanations are given on various sites, so please search for "ROC" and "F-measure" for details.
In addition, the format of the submitted file expresses the correspondence between the activity id and the probability of the result in CSV.
activity_id,outcome
act1_1,0
act1_100006,0
act1_100050,0
etc.
This data consists of two different files (people file, act_train file). The people file contains personal information associated with the customer id. The activity file contains the customer's behavior history associated with the customer id and the result (outcome) of that behavior.
The people file possesses the customer's nature (char *). All features (char *) except char_38 are anonymized category information. char_38 contains continuous values, not categorical information.
The outcome of the activity file indicates whether the customer has achieved a particular goal over a period of time. The activity file also contains information called activity_category. This shows the category to which the feature quantity (char *) of the information of each index belongs. For example, the type 1 data (char *) and the type 2-7 data (char *) show different feature quantities. I will.
The purpose of this competition is to predict the customers who will generate business value by merging these two data files with person_id.
act_test.csv
activity_id date activity_category char_1 char_2 char_3 char_4 char_5 char_6 char_7 char_8 char_9 char_10
people_id
ppl_100004 act1_249281 2022-07-20 type 1 type 5 type 10 type 5 type 1 type 6 type 1 type 1 type 7 type 4 NaN
ppl_100004 act2_230855 2022-07-20 type 5 NaN NaN NaN NaN NaN NaN NaN NaN NaN type 682
ppl_10001 act1_240724 2022-10-14 type 1 type 12 type 1 type 5 type 4 type 6 type 1 type 1 type 13 type 10 NaN
people.csv
char_1 group_1 char_2 date char_3 char_4 char_5 char_6 char_7 char_8 ... char_29 char_30 char_31 char_32 char_33 char_34 char_35 char_36 char_37 char_38
people_id
ppl_100 type 2 group 17304 type 2 2021-06-29 type 5 type 5 type 5 type 3 type 11 type 2 ... False True True False False True True True False 36
ppl_100002 type 2 group 8688 type 3 2021-01-06 type 28 type 9 type 5 type 3 type 11 type 2 ... False True True True True True True True False 76
ppl_100003 type 2 group 33592 type 3 2022-06-10 type 4 type 8 type 5 type 2 type 5 type 2 ... False False True True True True False True True 99
4.1. Exploration of the date features First, import the library and data.
import_data.py
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
%matplotlib inline
train = pd.read_csv('../input/act_train.csv', parse_dates=['date'])
test = pd.read_csv('../input/act_test.csv', parse_dates=['date'])
ppl = pd.read_csv('../input/people.csv', parse_dates=['date'])
df_train = pd.merge(train, ppl, on='people_id')
df_test = pd.merge(test, ppl, on='people_id')
del train, test, ppl
Let's look at the contents of the data.
show_day.py
for d in ['date_x', 'date_y']:
print('Start of ' + d + ': ' + str(df_train[d].min().date()))
print(' End of ' + d + ': ' + str(df_train[d].max().date()))
print('Range of ' + d + ': ' + str(df_train[d].max() - df_train[d].min()) + '\n')
Click here for the execution result.
Start of date_x: 2022-07-17
End of date_x: 2023-08-31
Range of date_x: 410 days 00:00:00
Start of date_y: 2020-05-18
End of date_y: 2023-08-31
Range of date_y: 1200 days 00:00:00
You can see that we are looking at the data for several years. Actually, these data are anonymized, but this time we will assume that we are dealing with data for several years. date_x contains data for one year and date_y contains data for three years. The end time is the same for both date_x and date_y.
After grouping by date, visualize the probability of outcome.
feature_structure.py
date_x = pd.DataFrame()
date_x['Class probability'] = df_train.groupby('date_x')['outcome'].mean()
date_x['Frequency'] = df_train.groupby('date_x')['outcome'].size()
date_x.plot(secondary_y='Frequency', figsize=(20, 10))
Looking at the graph, we can see that there are fewer weekend events and the probability that the outcome will be 1 is also reduced. In addition, it can be seen that the average value of outcome is stable on weekdays, but it drops to 0.4 to 0.3 on weekends. Let's look at date_y as well.
show_day_y.py
date_y = pd.DataFrame()
date_y['Class probability'] = df_train.groupby('date_y')['outcome'].mean()
date_y['Frequency'] = df_train.groupby('date_y')['outcome'].size()
# We need to split it into multiple graphs since the time-scale is too long to show well on one graph
i = int(len(date_y) / 3)
date_y[:i].plot(secondary_y='Frequency', figsize=(20, 5), title='date_y Year 1')
date_y[i:2*i].plot(secondary_y='Frequency', figsize=(20, 5), title='date_y Year 2')
date_y[2*i:].plot(secondary_y='Frequency', figsize=(20, 5), title='date_y Year 3')
Here is the result.
As with date_x, you can see the difference on weekdays and holidays.
4.1.2. test set In the analysis so far, we have seen the relationship between outcome and date. We will check if this relationship can be seen in the test data. Of course, you can't see the outcome of the test data. Therefore, we will check only the variance of the sample.
show_test.py
date_x_freq = pd.DataFrame()
date_x_freq['Training set'] = df_train.groupby('date_x')['activity_id'].count()
date_x_freq['Testing set'] = df_test.groupby('date_x')['activity_id'].count()
date_x_freq.plot(secondary_y='Testing set', figsize=(20, 8),
title='Comparison of date_x distribution between training/testing set')
date_y_freq = pd.DataFrame()
date_y_freq['Training set'] = df_train.groupby('date_y')['activity_id'].count()
date_y_freq['Testing set'] = df_test.groupby('date_y')['activity_id'].count()
date_y_freq[:i].plot(secondary_y='Testing set', figsize=(20, 8),
title='Comparison of date_y distribution between training/testing set (first year)')
date_y_freq[2*i:].plot(secondary_y='Testing set', figsize=(20, 8),
title='Comparison of date_y distribution between training/testing set (last year)
The result is as follows.
Check the similarity with train data from the correlation coefficient.
correlation.py
print('Correlation of date_x distribution in training/testing sets: ' + str(np.corrcoef(date_x_freq.T)[0,1]))
print('Correlation of date_y distribution in training/testing sets: ' + str(np.corrcoef(date_y_freq.fillna(0).T)[0,1]))
Correlation of date_x distribution in training/testing sets: 0.853430807691
Correlation of date_y distribution in training/testing sets: 0.709589035055
In date_x, I found a similar structure in training and testing data. This means that train and test data are split based on people, not time or other factors. Similarly, the characteristics can be seen in September and October.
You can see that the correlation is low on date_y. The test data contains many spikes in the first year, and it seems that the correlation of spikes changes from year to year. Let's look at the correlation by year.
correlation_date_y.py
print('date_y correlation in year 1: ' + str(np.corrcoef(date_y_freq[:i].fillna(0).T)[0,1]))
print('date_y correlation in year 2: ' + str(np.corrcoef(date_y_freq[i:2*i].fillna(0).T)[0,1]))
print('date_y correlation in year 3: ' + str(np.corrcoef(date_y_freq[2*i:].fillna(0).T)[0,1]))
date_y correlation in year 1: 0.237056344324
date_y correlation in year 2: 0.682344221229
date_y correlation in year 3: 0.807207224857
You can see that the correlation in the third year is the highest.
4.1.3. Probability features Let's generate date probability as a feature.
probability_features.py
from sklearn.metrics import roc_auc_score
features = pd.DataFrame()
features['date_x_prob'] = df_train.groupby('date_x')['outcome'].transform('mean')
features['date_y_prob'] = df_train.groupby('date_y')['outcome'].transform('mean')
features['date_x_count'] = df_train.groupby('date_x')['outcome'].transform('count')
features['date_y_count'] = df_train.groupby('date_y')['outcome'].transform('count')
_=[print(f.ljust(12) + ' AUC: ' + str(round(roc_auc_score(df_train['outcome'], features[f]), 6))) for f in features.columns]
date_x_prob AUC: 0.626182
date_y_prob AUC: 0.720296
date_x_count AUC: 0.465697
date_y_count AUC: 0.475916
4.2. Group_1 date trick In this competition, [magic features published in kernels](https://www.kaggle.com/ijkilchenko/predicting-red-hat-business-value/python-ver-of-group-1-and-date -trick / code ) Was used to achieve a ROC of over 90%. Here, we will explain the kernel that explains the magic feature.
First, import the library.
import pandas as pd
import numpy as np
import datetime
from itertools import product
from scipy import interpolate ## For other interpolation functions.
Next, read the data and encode boolean to 01. Also change date to datetime type.
# Load and transform people data.
ppl = pd.read_csv('../input/people.csv')
# Convert booleans to integers.
p_logi = ppl.select_dtypes(include=['bool']).columns
ppl[p_logi] = ppl[p_logi].astype('int')
del p_logi
# Transform date.
ppl['date'] = pd.to_datetime(ppl['date'])
Do the same for the act file. Create by filling the outcome column with nan and combine train and test.
# Load activities.
# Read and combine.
activs = pd.read_csv('../input/act_train.csv')
TestActivs = pd.read_csv('../input/act_test.csv')
TestActivs['outcome'] = np.nan ## Add the missing column to the test set.
activs = pd.concat([activs, TestActivs], axis=0) ## Append train and test sets.
del TestActivs
There are many variables in the act file, but we will target only people_id, outcome, activity_id, and date. The extracted activs are linked from the people file to the read ppl (people_id) and merged.
# Extract only required variables.
activs = activs[['people_id', 'outcome', 'activity_id', 'date']] ## Let's look at these columns only.
# Merge people data into activities.
## This keeps all the rows from activities.
d1 = pd.merge(activs, ppl, on='people_id', how='right')
## These are the indices of the rows from the test set.
testset = ppl[ppl['people_id'].isin(d1[d1['outcome'].isnull()]['people_id'])].index
d1['activdate'] = pd.to_datetime(d1['date_x'])
del activs
First, let's visualize the data. Example here does not include a sample script However, the analysis content is very simple. The purpose here is the following three.
Let's look at char_38, which is the only continuous value given. Here is the result of dividing char_38 of train by outcome.
It has a fairly characteristic distribution. Next, compare the distribution of char_38 with train and test.
You can see that they have almost the same distribution. Next, let's look at the relationship between people data and outcome. Plot customers with all 0s and 1 outcomes and customers with both 0s and 1s in a bar graph.
You can see that almost all customers are biased towards 0 or 1. Finally, we visualize the relationship between char_38 and ROC. Here, let's look at the prediction result when using only 0 or 1 customer data and the prediction result of customer data including both. The description of dmi3kno does not specifically describe what algorithm was used, but I think that it is probably using XGBoost.
From the above results, it was found that the outcomes of almost all customers can be predicted with fairly high accuracy simply by using char_38. On the other hand, we can see that the estimation for customers who changed outcome on the way is weak.
First, let's look at the customer status of 6 people who changed their outcome in the middle of the process in chronological order.
You can see that many customers change their outcome only once during the observation. The problem this time is the analysis of time series data that predicts changes in this outcome. Next, let's look at when a particular customer cluster changed the outcome. Let's look at a variable that contains few missing values, here the variable group_1. Let's look at the same graph as before with 6 randomly selected groups_1.
You can see that the change is exactly the same as when graphing with people_id. In other words, the goal of this competition is to come down to the problem of predicting the change point for each group_1.
These "intermediate elements" need to be brought to their respective groups (0 or 1).
The analysis so far raises some questions.
Looking at these, it becomes as follows.
Here, ambivalent is a group containing both 0s and 1s, uniform is a group having only a single 0s and 1s, and unknown is a group unique to the test group. Next, let's see how many activities each group contains. For example, if you randomly predict these (without a clear indicator), you can see what the highest score will be in terms of entropy maximization. By using XGBoost and char_38, you can get a clue to these.
Another thing, in the ambivalent group, we will look at how many times the outcome has changed.
bouncing is a group in which bidirectional changes occur from 0 to 1 and 1 to 0. Among these bouncing, we will look at some groups that have changed more than once.
##
## 0 1 2 3
## 25646 3687 565 1
Finally, we visualize some of these groups whose outcomes have changed multiple times.
Recommended Posts