Introduction

We will update the information of Kaggle who participated in the past. Here, we will pick up the data introduction of Predicting Red Hat Business Value and the prominent discussions in the forum. The code for the winner of the competition is summarized in Kaggle Summary: Red Hat (Part 2), which is a summary, discussion, and sample code. (Currently under construction)

This article uses Python 2.7, numpy 1.11, scipy 0.17, scikit-learn 0.18, matplotlib 1.5, seaborn 0.7, pandas 0.17. It has been confirmed to work on jupyter notebook. (Please modify% matplotlib inline appropriately) If you find any errors when you run the sample script, it would be helpful if you could comment.

Overview
Evaluation index
Introduction of data
kernels

background

As with many companies, Red Hat is able to collect customer information related to it in chronological order. Red Hat is researching a method to predict which individual should be approached and how to approach it from the prediction according to the behavior of the customer. In this competition, Kaggler is trying to create a classification algorithm to predict the potential business value of Red Hat from customer characteristics and activities.

The characteristic points of this time are as follows.

Typical classification problem for scoring binary classification with ROC
All inputs except one type consist of category or binary information
Magic features give almost complete input
XGBoost and Keras are frequently used as estimation algorithms

2. Evaluation index

The evaluation index this time is ROC. ([Japanese wikipedia](https://ja.wikipedia.org/wiki/%E5%8F%97%E4%BF%A1%E8%80%85%E6%93%8D%E4%BD%9C % E7% 89% B9% E6% 80% A7))

ROC is the most standard index for evaluating binary classification problems. Detailed explanations are given on various sites, so please search for "ROC" and "F-measure" for details.

In addition, the format of the submitted file expresses the correspondence between the activity id and the probability of the result in CSV.

activity_id,outcome
act1_1,0
act1_100006,0
act1_100050,0
etc.

3. Introduction of data

This data consists of two different files (people file, act_train file). The people file contains personal information associated with the customer id. The activity file contains the customer's behavior history associated with the customer id and the result (outcome) of that behavior.

The people file possesses the customer's nature (char *). All features (char *) except char_38 are anonymized category information. char_38 contains continuous values, not categorical information.

The outcome of the activity file indicates whether the customer has achieved a particular goal over a period of time. The activity file also contains information called activity_category. This shows the category to which the feature quantity (char *) of the information of each index belongs. For example, the type 1 data (char *) and the type 2-7 data (char *) show different feature quantities. I will.

The purpose of this competition is to predict the customers who will generate business value by merging these two data files with person_id.

`act_test.csv`


activity_id	date	activity_category	char_1	char_2	char_3	char_4	char_5	char_6	char_7	char_8	char_9	char_10
people_id													
ppl_100004	act1_249281	2022-07-20	type 1	type 5	type 10	type 5	type 1	type 6	type 1	type 1	type 7	type 4	NaN
ppl_100004	act2_230855	2022-07-20	type 5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	type 682
ppl_10001	act1_240724	2022-10-14	type 1	type 12	type 1	type 5	type 4	type 6	type 1	type 1	type 13	type 10	NaN

`people.csv`


	char_1	group_1	char_2	date	char_3	char_4	char_5	char_6	char_7	char_8	...	char_29	char_30	char_31	char_32	char_33	char_34	char_35	char_36	char_37	char_38
people_id																					
ppl_100	type 2	group 17304	type 2	2021-06-29	type 5	type 5	type 5	type 3	type 11	type 2	...	False	True	True	False	False	True	True	True	False	36
ppl_100002	type 2	group 8688	type 3	2021-01-06	type 28	type 9	type 5	type 3	type 11	type 2	...	False	True	True	True	True	True	True	True	False	76
ppl_100003	type 2	group 33592	type 3	2022-06-10	type 4	type 8	type 5	type 2	type 5	type 2	...	False	False	True	True	True	True	False	True	True	99

Kernels

4.1. Exploration of the date features First, import the library and data.

`import_data.py`


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
%matplotlib inline

train = pd.read_csv('../input/act_train.csv', parse_dates=['date'])
test = pd.read_csv('../input/act_test.csv', parse_dates=['date'])
ppl = pd.read_csv('../input/people.csv', parse_dates=['date'])

df_train = pd.merge(train, ppl, on='people_id')
df_test = pd.merge(test, ppl, on='people_id')
del train, test, ppl

Let's look at the contents of the data.

`show_day.py`


for d in ['date_x', 'date_y']:
    print('Start of ' + d + ': ' + str(df_train[d].min().date()))
    print('  End of ' + d + ': ' + str(df_train[d].max().date()))
    print('Range of ' + d + ': ' + str(df_train[d].max() - df_train[d].min()) + '\n')

Click here for the execution result.

Start of date_x: 2022-07-17
  End of date_x: 2023-08-31
Range of date_x: 410 days 00:00:00

Start of date_y: 2020-05-18
  End of date_y: 2023-08-31
Range of date_y: 1200 days 00:00:00

You can see that we are looking at the data for several years. Actually, these data are anonymized, but this time we will assume that we are dealing with data for several years. date_x contains data for one year and date_y contains data for three years. The end time is the same for both date_x and date_y.

4.1.1. Data structure

After grouping by date, visualize the probability of outcome.

`feature_structure.py`


date_x = pd.DataFrame()
date_x['Class probability'] = df_train.groupby('date_x')['outcome'].mean()
date_x['Frequency'] = df_train.groupby('date_x')['outcome'].size()
date_x.plot(secondary_y='Frequency', figsize=(20, 10))

Looking at the graph, we can see that there are fewer weekend events and the probability that the outcome will be 1 is also reduced. In addition, it can be seen that the average value of outcome is stable on weekdays, but it drops to 0.4 to 0.3 on weekends. Let's look at date_y as well.

`show_day_y.py`


date_y = pd.DataFrame()
date_y['Class probability'] = df_train.groupby('date_y')['outcome'].mean()
date_y['Frequency'] = df_train.groupby('date_y')['outcome'].size()
# We need to split it into multiple graphs since the time-scale is too long to show well on one graph
i = int(len(date_y) / 3)
date_y[:i].plot(secondary_y='Frequency', figsize=(20, 5), title='date_y Year 1')
date_y[i:2*i].plot(secondary_y='Frequency', figsize=(20, 5), title='date_y Year 2')
date_y[2*i:].plot(secondary_y='Frequency', figsize=(20, 5), title='date_y Year 3')

Here is the result.

As with date_x, you can see the difference on weekdays and holidays.

4.1.2. test set In the analysis so far, we have seen the relationship between outcome and date. We will check if this relationship can be seen in the test data. Of course, you can't see the outcome of the test data. Therefore, we will check only the variance of the sample.

`show_test.py`


date_x_freq = pd.DataFrame()
date_x_freq['Training set'] = df_train.groupby('date_x')['activity_id'].count()
date_x_freq['Testing set'] = df_test.groupby('date_x')['activity_id'].count()
date_x_freq.plot(secondary_y='Testing set', figsize=(20, 8), 
                 title='Comparison of date_x distribution between training/testing set')
date_y_freq = pd.DataFrame()
date_y_freq['Training set'] = df_train.groupby('date_y')['activity_id'].count()
date_y_freq['Testing set'] = df_test.groupby('date_y')['activity_id'].count()
date_y_freq[:i].plot(secondary_y='Testing set', figsize=(20, 8), 
                 title='Comparison of date_y distribution between training/testing set (first year)')
date_y_freq[2*i:].plot(secondary_y='Testing set', figsize=(20, 8), 
                 title='Comparison of date_y distribution between training/testing set (last year)

The result is as follows.

Check the similarity with train data from the correlation coefficient.

`correlation.py`


print('Correlation of date_x distribution in training/testing sets: ' + str(np.corrcoef(date_x_freq.T)[0,1]))
print('Correlation of date_y distribution in training/testing sets: ' + str(np.corrcoef(date_y_freq.fillna(0).T)[0,1]))

Correlation of date_x distribution in training/testing sets: 0.853430807691
Correlation of date_y distribution in training/testing sets: 0.709589035055

In date_x, I found a similar structure in training and testing data. This means that train and test data are split based on people, not time or other factors. Similarly, the characteristics can be seen in September and October.

You can see that the correlation is low on date_y. The test data contains many spikes in the first year, and it seems that the correlation of spikes changes from year to year. Let's look at the correlation by year.

`correlation_date_y.py`


print('date_y correlation in year 1: ' + str(np.corrcoef(date_y_freq[:i].fillna(0).T)[0,1]))
print('date_y correlation in year 2: ' + str(np.corrcoef(date_y_freq[i:2*i].fillna(0).T)[0,1]))
print('date_y correlation in year 3: ' + str(np.corrcoef(date_y_freq[2*i:].fillna(0).T)[0,1]))

date_y correlation in year 1: 0.237056344324
date_y correlation in year 2: 0.682344221229
date_y correlation in year 3: 0.807207224857

You can see that the correlation in the third year is the highest.

4.1.3. Probability features Let's generate date probability as a feature.

`probability_features.py`


from sklearn.metrics import roc_auc_score
features = pd.DataFrame()
features['date_x_prob'] = df_train.groupby('date_x')['outcome'].transform('mean')
features['date_y_prob'] = df_train.groupby('date_y')['outcome'].transform('mean')
features['date_x_count'] = df_train.groupby('date_x')['outcome'].transform('count')
features['date_y_count'] = df_train.groupby('date_y')['outcome'].transform('count')
_=[print(f.ljust(12) + ' AUC: ' + str(round(roc_auc_score(df_train['outcome'], features[f]), 6))) for f in features.columns]

date_x_prob  AUC: 0.626182
date_y_prob  AUC: 0.720296
date_x_count AUC: 0.465697
date_y_count AUC: 0.475916

4.2. Group_1 date trick In this competition, [magic features published in kernels](https://www.kaggle.com/ijkilchenko/predicting-red-hat-business-value/python-ver-of-group-1-and-date -trick / code ) Was used to achieve a ROC of over 90%. Here, we will explain the kernel that explains the magic feature.

First, import the library.

import pandas as pd
import numpy as np
import datetime
from itertools import product
from scipy import interpolate ## For other interpolation functions.

Next, read the data and encode boolean to 01. Also change date to datetime type.

# Load and transform people data. 
ppl = pd.read_csv('../input/people.csv')

# Convert booleans to integers.
p_logi = ppl.select_dtypes(include=['bool']).columns
ppl[p_logi] = ppl[p_logi].astype('int')
del p_logi

# Transform date.
ppl['date'] = pd.to_datetime(ppl['date'])

Do the same for the act file. Create by filling the outcome column with nan and combine train and test.

# Load activities.
# Read and combine.
activs = pd.read_csv('../input/act_train.csv')
TestActivs = pd.read_csv('../input/act_test.csv')
TestActivs['outcome'] = np.nan ## Add the missing column to the test set.
activs = pd.concat([activs, TestActivs], axis=0) ## Append train and test sets.
del TestActivs

There are many variables in the act file, but we will target only people_id, outcome, activity_id, and date. The extracted activs are linked from the people file to the read ppl (people_id) and merged.

# Extract only required variables.
activs = activs[['people_id', 'outcome', 'activity_id', 'date']] ## Let's look at these columns only.
# Merge people data into activities.
## This keeps all the rows from activities.
d1 = pd.merge(activs, ppl, on='people_id', how='right')

## These are the indices of the rows from the test set.
testset = ppl[ppl['people_id'].isin(d1[d1['outcome'].isnull()]['people_id'])].index

    d1['activdate'] = pd.to_datetime(d1['date_x'])

    del activs

4.3. Evaluation of outcome and group_1, char_38

First, let's visualize the data. Example here does not include a sample script However, the analysis content is very simple. The purpose here is the following three.

Explain in detail a more sophisticated model that utilizes people data
Extend the existing model to include activities data and consider changes in customer behavior
Use these two models to improve the forecasting system

4.3.1. What does char_38 mean?

Let's look at char_38, which is the only continuous value given. Here is the result of dividing char_38 of train by outcome.

It has a fairly characteristic distribution. Next, compare the distribution of char_38 with train and test.

You can see that they have almost the same distribution. Next, let's look at the relationship between people data and outcome. Plot customers with all 0s and 1 outcomes and customers with both 0s and 1s in a bar graph.

You can see that almost all customers are biased towards 0 or 1. Finally, we visualize the relationship between char_38 and ROC. Here, let's look at the prediction result when using only 0 or 1 customer data and the prediction result of customer data including both. The description of dmi3kno does not specifically describe what algorithm was used, but I think that it is probably using XGBoost.

From the above results, it was found that the outcomes of almost all customers can be predicted with fairly high accuracy simply by using char_38. On the other hand, we can see that the estimation for customers who changed outcome on the way is weak.

4.3.2. Features of magic features

First, let's look at the customer status of 6 people who changed their outcome in the middle of the process in chronological order.

You can see that many customers change their outcome only once during the observation. The problem this time is the analysis of time series data that predicts changes in this outcome. Next, let's look at when a particular customer cluster changed the outcome. Let's look at a variable that contains few missing values, here the variable group_1. Let's look at the same graph as before with 6 randomly selected groups_1.

You can see that the change is exactly the same as when graphing with people_id. In other words, the goal of this competition is to come down to the problem of predicting the change point for each group_1.

These "intermediate elements" need to be brought to their respective groups (0 or 1).

4.3.3. Questions

The analysis so far raises some questions.

How many groups have a single outcome? If the test information can be incorporated into the train time information, many outcomes can be filled with simple 0s and 1s.
How many groups have ambiguous outcomes?
How many groups are unique to test data that do not appear in the tran data?

Looking at these, it becomes as follows.

Here, ambivalent is a group containing both 0s and 1s, uniform is a group having only a single 0s and 1s, and unknown is a group unique to the test group. Next, let's see how many activities each group contains. For example, if you randomly predict these (without a clear indicator), you can see what the highest score will be in terms of entropy maximization. By using XGBoost and char_38, you can get a clue to these.

Another thing, in the ambivalent group, we will look at how many times the outcome has changed.

bouncing is a group in which bidirectional changes occur from 0 to 1 and 1 to 0. Among these bouncing, we will look at some groups that have changed more than once.

## 
##     0     1     2     3 
## 25646  3687   565     1

Finally, we visualize some of these groups whose outcomes have changed multiple times.

[PYTHON] Kaggle Summary: Redhat (Part 1)