[PYTHON] Kaggle Summary: Outbrain # 1

Introduction

We will update the information of Kaggle who participated in the past. Here, we will pick up the data introduction of Outbrain Click Prediction and the prominent discussions in the forum. The code of the winner of the competition is summarized in Kaggle Summary: Outbrain (Part 2), which is a summary and a summary of each data analysis. (Currently under construction)

This article uses Python 2.7, numpy 1.11, scipy 0.17, scikit-learn 0.18, matplotlib 1.5, seaborn 0.7, pandas 0.17. It has been confirmed to work on jupyter notebook. (Please modify% matplotlib inline appropriately) If you find any errors when you run the sample script, it would be helpful if you could comment.

table of contents

  1. Overview
  2. Evaluation index
  3. Introduction of data
  4. kernels

background

front_page.png

On the Internet, many users act based on their communities and experiences, discover new articles, and aim for their next destination based on the discovered articles. Outbrain provides users with a content delivery platform that guides them to locations that match their preferences. Outbrain makes 250 billion personal recommendations to thousands of sites each month. This time, we ask kaggler to predict the user's behavior from the user's click history. And we aim to be a recommendation algorithm that provides stories that satisfy users more than ever.

The characteristic points of this time are as follows.

2. Evaluation index

The evaluation index this time is Mean Average Precision @ 12.

Screen Shot 2017-01-12 at 7.42.08.png

Details are written in the link, but it is a derivative system of MAP (average precision rate). The precision rate is the Precision in terms of F value and ROC. Since the explanations in Japanese and related explanations are substantial, the explanations here are omitted.

In addition, the format of the submitted file expresses the correspondence between display_id and ad_id in CSV, and ad_id is a space-delimited list.

display_id,ad_id
16874594,66758 150083 162754 170392 172888 180797
16874595,8846 30609 143982
16874596,11430 57197 132820 153260 173005 288385 289122 289915
etc.

3. Introduction of data

This data is time series and category data divided into a large number of files. The user's page browsing and click history covers about two weeks from June 14th to 28th across multiple sites in the United States. Details are as follows.

The data contains recommendations for each user in a particular context. The context here is a recommendation set for the user. Each context is represented by display_id. It also contains recommendations clicked by at least one user for each display_id. In this competition, the goal is to determine the recommendation ranking by arranging the click predictions in descending order.

It should be noted that this data is very large relational data. A part of the table is small enough to fit in memory, but the pageview data length is over 200 million rows, which is 100GB of uncompressed data.

Data field

In the data, the user is represented by a unique uuid. The site that the user can see is represented by document_id. Each document has an advertisement indicated by ad_id. Each ad has a campaign_id provided by the advertiser indicated by advertiser_id. There is also document_id metadata. Which products attracted attention, category classification, topics of interest, and providers of document_id.

File details

Next, each file will be explained. There are 6 types of files this time. Naturally, each data is anonymized and processed so that the individual user is not identified.

1. page_views.csv This is the data of the articles visited by the user. To save space, the time stamps for each dataset are listed only for the first visit. If you want to restore the actual epoch time, add 1465876799998 to the timestamp.

2. clicks_train.csv This is the history of clicks on ads.

Similarly, clicks_test.csv is also provided, but it is clicked, that is, the correct answer data is not attached. The goal of this time is to create a model that predicts clicked from this file. Each display_id is clicked only once. The display_id of the test data contains the timeframe of the entire dataset.

3. events.csv Provides the context for display_id. This content is common to train and test.

4. promoted_content.csv Provides advertising details.

  1. documents_meta.csv Detailed information on the document.
  1. documents_topics.csv, documents_entities.csv, documents_categories.csv Includes relationships between documents. Detailed information is as follows.

page_view.png

  1. Forum

4.1. Analysis of main data (click information)

Analyze the main data. First, check the file

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import gc # We're gonna be clearing memory a lot
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

p = sns.color_palette()

print('# File sizes')
for f in os.listdir('../input'):
    if 'zip' not in f:
        print(f.ljust(30) + str(round(os.path.getsize('../input/' + f) / 1000000, 2)) + 'MB')
# File sizes
page_views_sample.csv         454.35MB
documents_meta.csv            89.38MB
documents_categories.csv      118.02MB
events.csv                    1208.55MB
clicks_test.csv               506.95MB
promoted_content.csv          13.89MB
documents_topics.csv          339.47MB
documents_entities.csv        324.1MB
sample_submission.csv         273.14MB
clicks_train.csv              1486.73MB

You can see that there are many CSV files. Next, we will focus on clicks_train and clicks_test for analysis.

df_train = pd.read_csv('../input/clicks_train.csv')
df_test = pd.read_csv('../input/clicks_test.csv')
sizes_train = df_train.groupby('display_id')['ad_id'].count().value_counts()
sizes_test = df_test.groupby('display_id')['ad_id'].count().value_counts()
sizes_train = sizes_train / np.sum(sizes_train)
sizes_test = sizes_test / np.sum(sizes_test)

plt.figure(figsize=(12,4))
sns.barplot(sizes_train.index, sizes_train.values, alpha=0.8, color=p[0], label='train')
sns.barplot(sizes_test.index, sizes_test.values, alpha=0.6, color=p[1], label='test')
plt.legend()
plt.xlabel('Number of Ads in display', fontsize=12)
plt.ylabel('Proportion of set', fontsize=12)

The execution result is as follows

__results___4_1.png

The ratio of the number of advertisements included in the display has been visualized. Next, let's look at the histogram to see how many times the same number of advertisements appears in total.

ad_usage_train = df_train.groupby('ad_id')['ad_id'].count()

for i in [2, 10, 50, 100, 1000]:
    print('Ads that appear less than {} times: {}%'.format(i, round((ad_usage_train < i).mean() * 100, 2)))

plt.figure(figsize=(12, 6))
plt.hist(ad_usage_train.values, bins=50, log=True)
plt.xlabel('Number of times ad appeared', fontsize=12)
plt.ylabel('log(Count of displays with ad)', fontsize=12)
plt.show()
Ads that appear less than 2 times: 22.69%
Ads that appear less than 10 times: 61.74%
Ads that appear less than 50 times: 86.93%
Ads that appear less than 100 times: 91.04%
Ads that appear less than 1000 times: 97.45%

__results___6_1.png

We found that the majority of ads were displayed a small number of times, and that a relatively small number of ads were displayed many times. This means that we need to estimate user behavior from data that has little precedent for whether or not the user clicked.

Finally, check the ratio of duplicate ads in test and train.

ad_prop = len(set(df_test.ad_id.unique()).intersection(df_train.ad_id.unique())) / len(df_test.ad_id.unique())
print('Proportion of test ads in test that are in training: {}%'.format(round(ad_prop * 100, 2)))
Proportion of test ads in test that are in training: 82.87%

We found that 82% of the ads were also displayed in the test. It may be good to create a prediction model based on these advertising information.

Events

The analysis of time stamps and geographic information is summarized in detail in the analysis described later. Here, we will analyze events.

try:del df_train,df_test # Being nice to Azure
except:pass;gc.collect()

events = pd.read_csv('../input/events.csv')
print('Shape:', events.shape)
print('Columns', events.columns.tolist())
events.head()
display_id	uuid	document_id	timestamp	platform	geo_location
0	1	cb8c55702adb93	379743	61	3	US>SC>519
1	2	79a85fa78311b9	1794259	81	2	US>CA>807
2	3	822932ce3d8757	1179111	182	2	US>MI>505
3	4	85281d0a49f7ac	1777797	234	2	US>WV>564
4	5	8d0daef4bf5b56	252458	338	2	SG>00

I understand the contents of the data. Next, let's look at the platform.

plat = events.platform.value_counts()

print(plat)
print('\nUnique values of platform:', events.platform.unique())
2     10684579
1      8747280
3      3032907
2       291699
1       279988
3        83668
\N           5
Name: platform, dtype: int64

Unique values of platform: [3 2 1 '2' '1' '3' '\\N']

From the results, we can see that 1, 2 and 3 platforms are repeated many times for both float and string. Let's look at the platform further.

events.platform = events.platform.astype(str)
plat = events.platform.value_counts()

plt.figure(figsize=(12,4))
sns.barplot(plat.index, plat.values, alpha=0.8, color=p[2])
plt.xlabel('Platform', fontsize=12)
plt.ylabel('Occurence count', fontsize=12)

__results___13_1.png

It is not clear what the 1, 2 and 3 platforms indicate, but it can be predicted that there are probably three types: PC, phone and tablet. Is the \ N or string type number an abnormal value created by some error? It seems difficult to predict any more here.

Next, we will dig deeper into UUIDs.

uuid_counts = events.groupby('uuid')['uuid'].count().sort_values()

print(uuid_counts.tail())

for i in [2, 5, 10]:
    print('Users that appear less than {} times: {}%'.format(i, round((uuid_counts < i).mean() * 100, 2)))
    
plt.figure(figsize=(12, 4))
plt.hist(uuid_counts.values, bins=50, log=True)
plt.xlabel('Number of times user appeared in set', fontsize=12)
plt.ylabel('log(Count of users)', fontsize=12)
plt.show()
uuid
ef7761dd22277c    38
45d23867dbe3b3    38
c0bd502c7a479f    42
2759b057797f02    46
b88553e3a2aa29    49
Name: uuid, dtype: int64
Users that appear less than 2 times: 88.42%
Users that appear less than 5 times: 99.51%
Users that appear less than 10 times: 99.96%

__results___15_1.png

It seems that the same user appears several tens of times. Since most of them appear several times, it seems difficult to create a user-based prediction model.

Also, like collaborative filtering, it is likely to consume a large amount of memory in order to execute analysis associated with the user in all combinations.

Categories Outbrain provides pre-clustered results. We will visualize the frequency of appearance of topic data and category data.

try:del events
except:pass;gc.collect()

topics = pd.read_csv('../input/documents_topics.csv')
print('Columns:',topics.columns.tolist())
print('Number of unique topics:', len(topics.topic_id.unique()))

topics.head()
document_id	topic_id	confidence_level
0	1595802	140	0.073113
1	1595802	16	0.059416
2	1595802	143	0.045421
3	1595802	170	0.038867
4	1524246	113	0.196450

Next, aim at confidence_level.

topic_ids = topics.groupby('topic_id')['confidence_level'].count().sort_values()

for i in [10000, 50000, 100000, 200000]:
    print('Number of topics that appear more than {} times: {}'
          .format(i, (topic_ids > i).sum()))

plt.figure(figsize=(12, 4))
sns.barplot(topic_ids.index, topic_ids.values, order=topic_ids.index, alpha=1, color=p[5])
plt.xlabel('Document Topics', fontsize=12)
plt.ylabel('Total occurences', fontsize=12)
plt.show()
Number of topics that appear more than 10000 times: 201
Number of topics that appear more than 50000 times: 86
Number of topics that appear more than 100000 times: 22
Number of topics that appear more than 200000 times: 3

__results___18_1.png

You can see that it is wide open from near zero to 300,000.

cat = pd.read_csv('../input/documents_categories.csv')
print('Columns:', cat.columns.tolist())
print('Number of unique categories:', len(cat.category_id.unique()))

cat_ids = cat.groupby('category_id')['confidence_level'].count().sort_values()

for i in [1000, 10000, 50000, 100000]:
    print('Number of categories that appear more than {} times: {}'
          .format(i, (cat_ids > i).sum()))

plt.figure(figsize=(12, 4))
sns.barplot(cat_ids.index, cat_ids.values, order=cat_ids.index, alpha=1, color=p[3])
plt.xlabel('Document Categories', fontsize=12)
plt.ylabel('Total occurences', fontsize=12)
plt.show()
Columns: ['document_id', 'category_id', 'confidence_level']
Number of unique categories: 97
Number of categories that appear more than 1000 times: 88
Number of categories that appear more than 10000 times: 65
Number of categories that appear more than 50000 times: 35
Number of categories that appear more than 100000 times: 15

__results___19_1.png

I was able to visualize the category data as well.

4.2. Analysis of geographic data

We will analyze each country and focus on the United States. Read the data.

import pandas as pd
import warnings
warnings.filterwarnings('ignore')

page_views_sample_df = pd.read_csv("../input/page_views_sample.csv", usecols=['uuid', 'geo_location'])
# Drop NAs
page_views_sample_df.dropna(inplace=True)
# Drop EU code
page_views_sample_df = page_views_sample_df.loc[~page_views_sample_df.geo_location.isin(['EU', '--']), :]
# Drop duplicates
page_views_sample_df = page_views_sample_df.drop_duplicates('uuid', keep='first')

Next, count by country.

country = page_views_sample_df.copy()
country.columns = ['uuid', 'Country']
country.Country = country.Country.str[:2]
country.loc[:, 'UserCount'] = country.groupby('Country')['Country'].transform('count')
country = country.loc[:, ['Country', 'UserCount']].drop_duplicates('Country', keep='first')
country.sort_values('UserCount', ascending=False, inplace=True)
country.head(10)

Enter the country name by hand to plot it on the map data.

countryCode2Name = {u'BD': u'Bangladesh', u'BE': u'Belgium', u'BF': u'Burkina Faso', u'BG': u'Bulgaria', u'BA': u'Bosnia and Herzegovina', u'BB': u'Barbados', u'WF': u'Wallis and Futuna', u'BL': u'Saint Barth\xe9lemy', u'BM': u'Bermuda', u'BN': u'Brunei Darussalam', u'BO': u'Bolivia, Plurinational State of', u'BH': u'Bahrain', u'BI': u'Burundi', u'BJ': u'Benin', u'BT': u'Bhutan', u'JM': u'Jamaica', u'BV': u'Bouvet Island', u'BW': u'Botswana', u'WS': u'Samoa', u'BQ': u'Bonaire, Sint Eustatius and Saba', u'BR': u'Brazil', u'BS': u'Bahamas', u'JE': u'Jersey', u'BY': u'Belarus', u'BZ': u'Belize', u'RU': u'Russian Federation', u'RW': u'Rwanda', u'RS': u'Serbia', u'TL': u'Timor-Leste', u'RE': u'R\xe9union', u'TM': u'Turkmenistan', u'TJ': u'Tajikistan', u'RO': u'Romania', u'TK': u'Tokelau', u'GW': u'Guinea-Bissau', u'GU': u'Guam', u'GT': u'Guatemala', u'GS': u'South Georgia and the South Sandwich Islands', u'GR': u'Greece', u'GQ': u'Equatorial Guinea', u'GP': u'Guadeloupe', u'JP': u'Japan', u'GY': u'Guyana', u'GG': u'Guernsey', u'GF': u'French Guiana', u'GE': u'Georgia', u'GD': u'Grenada', u'GB': u'United Kingdom', u'GA': u'Gabon', u'GN': u'Guinea', u'GM': u'Gambia', u'GL': u'Greenland', u'GI': u'Gibraltar', u'GH': u'Ghana', u'OM': u'Oman', u'TN': u'Tunisia', u'JO': u'Jordan', u'HR': u'Croatia', u'HT': u'Haiti', u'HU': u'Hungary', u'HK': u'Hong Kong', u'HN': u'Honduras', u'HM': u'Heard Island and McDonald Islands', u'VE': u'Venezuela, Bolivarian Republic of', u'PR': u'Puerto Rico', u'PS': u'Palestine, State of', u'PW': u'Palau', u'PT': u'Portugal', u'KN': u'Saint Kitts and Nevis', u'PY': u'Paraguay', u'IQ': u'Iraq', u'PA': u'Panama', u'PF': u'French Polynesia', u'PG': u'Papua New Guinea', u'PE': u'Peru', u'PK': u'Pakistan', u'PH': u'Philippines', u'PN': u'Pitcairn', u'PL': u'Poland', u'PM': u'Saint Pierre and Miquelon', u'ZM': u'Zambia', u'EH': u'Western Sahara', u'EE': u'Estonia', u'EG': u'Egypt', u'ZA': u'South Africa', u'EC': u'Ecuador', u'IT': u'Italy', u'VN': u'Viet Nam', u'SB': u'Solomon Islands', u'ET': u'Ethiopia', u'SO': u'Somalia', u'ZW': u'Zimbabwe', u'SA': u'Saudi Arabia', u'ES': u'Spain', u'ER': u'Eritrea', u'ME': u'Montenegro', u'MD': u'Moldova, Republic of', u'MG': u'Madagascar', u'MF': u'Saint Martin (French part)', u'MA': u'Morocco', u'MC': u'Monaco', u'UZ': u'Uzbekistan', u'MM': u'Myanmar', u'ML': u'Mali', u'MO': u'Macao', u'MN': u'Mongolia', u'MH': u'Marshall Islands', u'MK': u'Macedonia, Republic of', u'MU': u'Mauritius', u'MT': u'Malta', u'MW': u'Malawi', u'MV': u'Maldives', u'MQ': u'Martinique', u'MP': u'Northern Mariana Islands', u'MS': u'Montserrat', u'MR': u'Mauritania', u'IM': u'Isle of Man', u'UG': u'Uganda', u'TZ': u'Tanzania, United Republic of', u'MY': u'Malaysia', u'MX': u'Mexico', u'IL': u'Israel', u'FR': u'France', u'AW': u'Aruba', u'SH': u'Saint Helena, Ascension and Tristan da Cunha', u'SJ': u'Svalbard and Jan Mayen', u'FI': u'Finland', u'FJ': u'Fiji', u'FK': u'Falkland Islands (Malvinas)', u'FM': u'Micronesia, Federated States of', u'FO': u'Faroe Islands', u'NI': u'Nicaragua', u'NL': u'Netherlands', u'NO': u'Norway', u'NA': u'Namibia', u'VU': u'Vanuatu', u'NC': u'New Caledonia', u'NE': u'Niger', u'NF': u'Norfolk Island', u'NG': u'Nigeria', u'NZ': u'New Zealand', u'NP': u'Nepal', u'NR': u'Nauru', u'NU': u'Niue', u'CK': u'Cook Islands', u'CI': u"C\xf4te d'Ivoire", u'CH': u'Switzerland', u'CO': u'Colombia', u'CN': u'China', u'CM': u'Cameroon', u'CL': u'Chile', u'CC': u'Cocos (Keeling) Islands', u'CA': u'Canada', u'CG': u'Congo', u'CF': u'Central African Republic', u'CD': u'Congo, The Democratic Republic of the', u'CZ': u'Czech Republic', u'CY': u'Cyprus', u'CX': u'Christmas Island', u'CR': u'Costa Rica', u'CW': u'Cura\xe7ao', u'CV': u'Cape Verde', u'CU': u'Cuba', u'SZ': u'Swaziland', u'SY': u'Syrian Arab Republic', u'SX': u'Sint Maarten (Dutch part)', u'KG': u'Kyrgyzstan', u'KE': u'Kenya', u'SS': u'South Sudan', u'SR': u'Suriname', u'KI': u'Kiribati', u'KH': u'Cambodia', u'SV': u'El Salvador', u'KM': u'Comoros', u'ST': u'Sao Tome and Principe', u'SK': u'Slovakia', u'KR': u'Korea, Republic of', u'SI': u'Slovenia', u'KP': u"Korea, Democratic People's Republic of", u'KW': u'Kuwait', u'SN': u'Senegal', u'SM': u'San Marino', u'SL': u'Sierra Leone', u'SC': u'Seychelles', u'KZ': u'Kazakhstan', u'KY': u'Cayman Islands', u'SG': u'Singapore', u'SE': u'Sweden', u'SD': u'Sudan', u'DO': u'Dominican Republic', u'DM': u'Dominica', u'DJ': u'Djibouti', u'DK': u'Denmark', u'DE': u'Germany', u'YE': u'Yemen', u'DZ': u'Algeria', u'US': u'United States', u'UY': u'Uruguay', u'YT': u'Mayotte', u'UM': u'United States Minor Outlying Islands', u'LB': u'Lebanon', u'LC': u'Saint Lucia', u'LA': u"Lao People's Democratic Republic", u'TV': u'Tuvalu', u'TW': u'Taiwan, Province of China', u'TT': u'Trinidad and Tobago', u'TR': u'Turkey', u'LK': u'Sri Lanka', u'LI': u'Liechtenstein', u'LV': u'Latvia', u'TO': u'Tonga', u'LT': u'Lithuania', u'LU': u'Luxembourg', u'LR': u'Liberia', u'LS': u'Lesotho', u'TH': u'Thailand', u'TF': u'French Southern Territories', u'TG': u'Togo', u'TD': u'Chad', u'TC': u'Turks and Caicos Islands', u'LY': u'Libya', u'VA': u'Holy See (Vatican City State)', u'VC': u'Saint Vincent and the Grenadines', u'AE': u'United Arab Emirates', u'AD': u'Andorra', u'AG': u'Antigua and Barbuda', u'AF': u'Afghanistan', u'AI': u'Anguilla', u'IS': u'Iceland', u'IR': u'Iran, Islamic Republic of', u'AM': u'Armenia', u'AL': u'Albania', u'AO': u'Angola', u'AQ': u'Antarctica', u'AS': u'American Samoa', u'AR': u'Argentina', u'AU': u'Australia', u'AT': u'Austria', u'IO': u'British Indian Ocean Territory', u'IN': u'India', u'AX': u'\xc5land Islands', u'AZ': u'Azerbaijan', u'IE': u'Ireland', u'ID': u'Indonesia', u'UA': u'Ukraine', u'QA': u'Qatar', u'MZ': u'Mozambique', u'FX': u'France, Metropolitan', u'AN': u'Netherlands Antilles', u'A1': u'Anguilla'}
country['CountryName'] = country['Country'].map(countryCode2Name)

# Drop NAs
country.dropna(inplace=True)

country['CumulativePercentage'] = 100 * country.UserCount.cumsum()/country.UserCount.sum()
country.reset_index(drop=True, inplace=True)
country[['CountryName', 'UserCount', 'CumulativePercentage']].head(10)

The name has been entered.

CountryName	UserCount	CumulativePercentage
0	United States	7503199	81.582245
1	Canada	372772	85.635393
2	United Kingdom	280972	88.690400
3	Australia	133158	90.138226
4	India	90453	91.121721
5	Philippines	55822	91.728674
6	Germany	46026	92.229114
7	South Africa	34046	92.599296
8	France	29169	92.916450
9	Singapore	29037	93.232170

We will look dynamically on the world map.

import plotly.offline as py
py.offline.init_notebook_mode()

data = [ dict(
        type = 'choropleth',
        locations = country['CountryName'],
        z = country['UserCount'],
        locationmode = 'country names',
        text = country['CountryName'],
        colorscale = [[0,"rgb(153, 241, 243)"],[0.005,"rgb(16, 64, 143)"],[1,"rgb(0, 0, 0)"]],
        autocolorscale = False,
        marker = dict(
            line = dict(color = 'rgb(58,100,69)', width = 0.6)),
            colorbar = dict(autotick = True, tickprefix = '', title = '# of Users')
            )
       ]

layout = dict(
    title = 'Total number of users by country',
    geo = dict(
        showframe = False,
        showcoastlines = True,
        projection = dict(
        type = 'equirectangular'
        ),
    margin = dict(b = 0, t = 0, l = 0, r = 0)
            )
    )

fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False, filename='worldmap')

newplot.png

If you execute it with notebook, you can check the country count by operating the cursor.

Pay attention to USA

First, organize the data.

usa = page_views_sample_df.loc[page_views_sample_df.geo_location.str[:2] == 'US', :]
usa.columns = ['uuid', 'State']

usa.State = usa.State.str[3:5]

# Drop Data with missing state info
usa = usa.loc[usa.State != '', :]

usa.loc[:, 'UserCount'] = usa.groupby('State')['State'].transform('count')
usa.loc[:, ['State', 'UserCount']] = usa.loc[:, ['State', 'UserCount']].drop_duplicates('State', keep='first')
usa.sort_values('UserCount', ascending=False, inplace=True)

Enter the name of the state.

stateCode2Name = {'AK': 'Alaska', 'AL': 'Alabama', 'AR': 'Arkansas', 'AS': 'American Samoa', 'AZ': 'Arizona', 'CA': 'California', 'CO': 'Colorado', 'CT': 'Connecticut', 'DC': 'District of Columbia', 'DE': 'Delaware', 'FL': 'Florida', 'GA': 'Georgia', 'GU': 'Guam', 'HI': 'Hawaii', 'IA': 'Iowa', 'ID': 'Idaho', 'IL': 'Illinois', 'IN': 'Indiana', 'KS': 'Kansas', 'KY': 'Kentucky', 'LA': 'Louisiana', 'MA': 'Massachusetts', 'MD': 'Maryland', 'ME': 'Maine', 'MI': 'Michigan', 'MN':'Minnesota', 'MO': 'Missouri', 'MP': 'Northern Mariana Islands', 'MS': 'Mississippi', 'MT': 'Montana', 'NA': 'National', 'NC': 'North Carolina', 'ND': 'North Dakota', 'NE':'Nebraska', 'NH': 'New Hampshire', 'NJ': 'New Jersey', 'NM': 'New Mexico', 'NV': 'Nevada', 'NY': 'New York', 'OH': 'Ohio', 'OK': 'Oklahoma', 'OR': 'Oregon', 'PA': 'Pennsylvania', 'PR': 'Puerto Rico', 'RI': 'Rhode Island', 'SC': 'South Carolina', 'SD': 'South Dakota', 'TN': 'Tennessee', 'TX': 'Texas', 'UT': 'Utah', 'VA': 'Virginia', 'VI': 'Virgin Islands', 'VT': 'Vermont', 'WA': 'Washington', 'WI': 'Wisconsin', 'WV': 'West Virginia', 'WY': 'Wyoming'}
usa['StateName'] = usa['State'].map(stateCode2Name)
# Drop NAs
usa.dropna(inplace=True)

usa['CumulativePercentage'] = 100 * usa.UserCount.cumsum()/usa.UserCount.sum()
usa.reset_index(drop=True, inplace=True)
usa[['StateName', 'UserCount', 'CumulativePercentage']].head(50)

Visualize the count in the United States.

import plotly.offline as py
py.offline.init_notebook_mode()

data = [ dict(
        type = 'choropleth',
        locations = usa['State'],
        z = usa['UserCount'],
        locationmode = 'USA-states',
        text = usa['StateName'],
        colorscale = [[0,"rgb(153, 241, 243)"],[0.33,"rgb(16, 64, 143)"],[1,"rgb(0, 0, 0)"]],
        autocolorscale = False,
        marker = dict(
            line = dict(color = 'rgb(58,100,69)', width = 0.6)),
            colorbar = dict(autotick = True, tickprefix = '', title = '# of Users')
            )
       ]

layout = dict(
    title = 'Total number of users by state',
    geo = dict(
        scope='usa',
        projection=dict( type='albers usa' ),
        showlakes = True,
        lakecolor = 'rgb(255, 255, 255)'),
    )

fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False, filename='USmap')

newplot (1).png

4.3. Timestamp analysis

Analyze the timestamp.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

events = pd.read_csv("../input/events.csv", dtype=np.int32, index_col=0, usecols=[0,3])
events.head()
      timestamp 
display_id	
1	61
2	81
3	182
4	234
5	338

There is a timestamp for each display_id. It is a timestamp of milliseconds and starts from 1970-01-01 as explained in the introduction of data. In other words, time zero corresponds to 04:00 UTC, 14th June 2016.

train = pd.merge(pd.read_csv("../input/clicks_train.csv", dtype=np.int32, index_col=0).sample(frac=0.1),
                 events, left_index=True, right_index=True)
test = pd.merge(pd.read_csv("../input/clicks_test.csv", dtype=np.int32, index_col=0).sample(frac=0.1),
                events, left_index=True, right_index=True)

Read the data from the clicks file and

test["hour"] = (test.timestamp // (3600 * 1000)) % 24
test["day"] = test.timestamp // (3600 * 24 * 1000)

train["hour"] = (train.timestamp // (3600 * 1000)) % 24
train["day"] = train.timestamp // (3600 * 24 * 1000)

plt.figure(figsize=(12,4))
train.hour.hist(bins=np.linspace(-0.5, 23.5, 25), label="train", alpha=0.7, normed=True)
test.hour.hist(bins=np.linspace(-0.5, 23.5, 25), label="test", alpha=0.7, normed=True)
plt.xlim(-0.5, 23.5)
plt.legend(loc="best")
plt.xlabel("Hour of Day")
plt.ylabel("Fraction of Events")

Plot the time.

__results___4_1 (1).png

There seems to be some consistency in the distribution of timestamps. There is a slight difference between train and test.

plt.figure(figsize=(12,4))
train.day.hist(bins=np.linspace(-.5, 14.5, 16), label="train", alpha=0.7, normed=True)
test.day.hist(bins=np.linspace(-.5, 14.5, 16), label="test", alpha=0.7, normed=True)
plt.xlim(-0.5, 14.5)
plt.legend(loc="best")
plt.xlabel("Days since June 14")
plt.ylabel("Fraction of Events")

__results___6_1.png

You can see that train and test are different depending on the date. You can see that more than half of the test data was sampled on the 13th and 14th. And you can see that the other half is sampled on the same day as the train data.

In order to deepen the understanding, we will visualize the train data with a heat map by date and time.

plt.figure(figsize=(12,6))
hour_day_counts = train.groupby(["hour", "day"]).count().ad_id.values.reshape(24,-1)
# plot 2d hist in days and hours, with each day normalised to 1 
plt.imshow((hour_day_counts / hour_day_counts.sum(axis=0)).T,
           interpolation="none", cmap="rainbow")
plt.xlabel("Hour of Day")
plt.ylabel("Days since June 14")

__results___8_1.png

The vertical axis is the date and the horizontal axis is the time. Similarly, visualize with test.

# for completeness, the test set too:
plt.figure(figsize=(12,6))
hour_day_counts = test.groupby(["hour", "day"]).count().ad_id.values.reshape(24,-1)
# plot 2d hist in days and hours, with each day normalised to 1 
plt.imshow((hour_day_counts / hour_day_counts.sum(axis=0)).T,
           interpolation="none", cmap="rainbow")
plt.xlabel("Hour of Day")
plt.ylabel("Days since June 14")

__results___10_1.png

Recommended Posts

Kaggle Summary: Outbrain # 2
Kaggle Summary: Outbrain # 1
Kaggle related summary
Kaggle Summary: Redhat (Part 1)
Kaggle Summary: BOSCH (kernels)
Kaggle Summary: Redhat (Part 2)
Kaggle Kernel Method Summary [Image]
Kaggle Summary: Instacart Market Basket Analysis
Kaggle Summary: BOSCH (intro + forum discussion)
[Survey] Kaggle --Quora 3rd place solution summary
Python Summary
samba summary
[Survey] Kaggle --Quora 5th place solution summary
Django Summary
python-pptx summary
[Survey] Kaggle --Quora 4th place solution summary
Linux Summary
Python summary
Django Summary
pyenv summary
String summary 1
pytest summary
matplotlib summary
[Survey] Kaggle --Quora 2nd place solution summary