[PYTHON] Kaggle Competition Hands-on: Real or Not? NLP with Disaster Tweets ~ EDA / Preprocessing ~

The big picture of the article

This article is in two parts.

・ Part 1: EDA ・ Pretreatment ← Now here ・ Part 2: Vectorization / Modeling (LSTM, BERT) (Under construction, Coming Soon!)

Since the second part uses the data that was preprocessed in the first part, At the bottom of the article, there is ** code that can execute preprocessing in a batch *, so If you want to skip the first part, please execute it and proceed to the second part. ( Please refer to this article to get the data!)

Overview of the competition

Kaggle : Real or Not? NLP with Disaster Tweets This is an introductory competition for natural language processing. The task is to classify tweets into two categories: "disaster tweets" or "non-disaster tweets".

Recently, Twitter has been used to request rescue in the event of a disaster, and there is increasing interest in automatic tweet monitoring in the event of a disaster at disaster relief organizations and news agencies. However, it is difficult to mechanically determine whether a tweet actually represents a disaster. This is because, for example, the expression "burning" that explicitly expresses a disaster is sometimes used as a metaphorical expression such as "the sky is burning." In this competition, we will use a dataset of 10,000 tweets to build a machine learning model that predicts "disaster tweets" or "non-disaster tweets."

Hands-on!

We will proceed with hands-on using Jupyter Notebook.

0. Obtaining data

Kaggle : Real or Not? NLP with Disaster Tweets - Data From the above page ・ Test.csv ・ Train.csv Download two of them. After that, put this csv file in the same hierarchy as the notebook to be analyzed. スクリーンショット 2020-03-22 12.40.22.png

The contents of each column in the csv file are as follows.

file name Description
id Unique identifier for each tweet
text Tweet text body
location Where the tweet was sent(There is a blank)
keyword Specific keywords in tweets(There is a blank)
target Tweet label(Disaster tweet=1, non-disaster tweets=0)

Specific details and missing value information are explained in the next section. The competition predicts the target column in test.csv.

1. Load the library

Predefine the library to be used in this hands-on.

#Data analysis
import pandas as pd
import numpy as np

#Visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

#Natural language processing
import string #symbol(punctuation)Get a list of
import re
import contractions #Document abbreviation
from wordcloud import STOPWORDS #Get a list of stopwords
from collections import defaultdict # n-Used when creating gram

2. Check the contents of the data

I would like to check the contents of the data immediately. First, read the data as a data frame. Display the number of rows and columns of the data frame, and try to extract 10 rows at random.

#Reading training data and test data
df_train = pd.read_csv('train.csv', dtype={'id': np.int16, 'target': np.int8})
df_test = pd.read_csv('test.csv', dtype={'id': np.int16})

#Display the number of rows and columns of training data and test data
print('Training Set Shape = {}'.format(df_train.shape))
print('Test Set Shape = {}'.format(df_test.shape))

#Randomly extract 10 lines from the training data
df_train.sample(n=10, random_state=28)

Part of the execution result


Training Set Shape = (7613, 5)
Test Set Shape = (3263, 4)

Below, the continuation of the execution result (contents of randomly extracted lines) スクリーンショット 2020-03-22 20.38.25.png

There may be a missing value NaN in the keyword and location columns.

I would like to investigate the missing values in detail.

#Calculate the missing value rate for each column of training data and test data
print("missing-value ratio of training data(%)")
print(df_train.isnull().sum()/df_train.shape[0]*100)
print("\nmissing-value ratio of test data(%)")
print(df_test.isnull().sum()/df_test.shape[0]*100)

Execution result


missing-value ratio of training data(%)
id           0.000000
keyword      0.801261
location    33.272035
text         0.000000
target       0.000000
dtype: float64

missing-value ratio of test data(%)
id           0.000000
keyword      0.796813
location    33.864542
text         0.000000
dtype: float64

It can be seen that the missing values are 0.8% for keyword and 33 to 34% for location in both training data and test data.

Next, let's look at the distribution of target in the training data.

#Plot the target elements and their number
target_vals = df_train.target.value_counts()
sns.barplot(target_vals.index, target_vals)
plt.gca().set_ylabel('samples')

value_counts.png

You can see that there are more non-disaster tweets = 0 than disaster tweets = 1 in the dataset.

Next, I would like to find out the number of unique elements in each column of text, keyword, and location.

#Shows the number of unique elements in text, keyword, location
print(f'Number of unique values in text = {df_train["text"].nunique()} (Training) - {df_test["text"].nunique()} (Test)')
print(f'Number of unique values in keyword = {df_train["keyword"].nunique()} (Training) - {df_test["keyword"].nunique()} (Test)')
print(f'Number of unique values in location = {df_train["location"].nunique()} (Training) - {df_test["location"].nunique()} (Test)')

Execution result


Number of unique values in text = 7503 (Training) - 3243 (Test)
Number of unique values in keyword = 221 (Training) - 221 (Test)
Number of unique values in location = 3342 (Training) - 1603 (Test)

You can see that text and location are free input. On the other hand, you can see that keyword automatically extracts 221 predefined keywords from text.

3. Exploratory Data Analysis (EDA)

I would like to do some processing to understand the characteristics of the data. First of all, I would like to roughly grasp and compare the features in text. Features are word count, unique word count, stop word count, URL count, average word character count, character count , Number of punctuation (*), Number of hashtags, Number of mentions. One point to add, punctuation (*) here does not mean punctuation, but refers to ASCII characters other than alphanumeric characters. Roughly speaking, it's okay if you have an image like the "symbol" defined in string.punctuation. In addition, these features are combined into the data frame as meta features. Furthermore, in the training data, disaster tweet = 1 tweet other than disaster = 0, training data ⇄ test data Compare the distribution of 9 features. Because tweets related to disasters, tweets not related to disasters, training data, and test data have different data scores. When visualizing the distribution, kernel density estimation allows the scales to be aligned and intuitive comparisons. The default argument of distplot of the visualization library seaborn is kde = True, but this time we will explicitly describe kde = True.

#Number of words
df_train['word_count'] = df_train['text'].apply(lambda x: len(str(x).split()))
df_test['word_count'] = df_test['text'].apply(lambda x: len(str(x).split()))

#Unique number of words
df_train['unique_word_count'] = df_train['text'].apply(lambda x: len(set(str(x).split())))
df_test['unique_word_count'] = df_test['text'].apply(lambda x: len(set(str(x).split())))

#Number of stop words
df_train['stop_word_count'] = df_train['text'].apply(lambda x: len([w for w in str(x).lower().split() if w in STOPWORDS]))
df_test['stop_word_count'] = df_test['text'].apply(lambda x: len([w for w in str(x).lower().split() if w in STOPWORDS]))

#Number of URLs
df_train['url_count'] = df_train['text'].apply(lambda x: len([w for w in str(x).lower().split() if 'http' in w or 'https' in w]))
df_test['url_count'] = df_test['text'].apply(lambda x: len([w for w in str(x).lower().split() if 'http' in w or 'https' in w]))

#Average number of word characters
df_train['mean_word_length'] = df_train['text'].apply(lambda x: np.mean([len(w) for w in str(x).split()]))
df_test['mean_word_length'] = df_test['text'].apply(lambda x: np.mean([len(w) for w in str(x).split()]))

#word count
df_train['char_count'] = df_train['text'].apply(lambda x: len(str(x)))
df_test['char_count'] = df_test['text'].apply(lambda x: len(str(x)))

#Number of punctuation marks
df_train['punctuation_count'] = df_train['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))
df_test['punctuation_count'] = df_test['text'].apply(lambda x: len([c for c in str(x) if c in string.punctuation]))

#Number of hashtags
df_train['hashtag_count'] = df_train['text'].apply(lambda x: len([c for c in str(x) if c == '#']))
df_test['hashtag_count'] = df_test['text'].apply(lambda x: len([c for c in str(x) if c == '#']))

#Number of mentions
df_train['mention_count'] = df_train['text'].apply(lambda x: len([c for c in str(x) if c == '@']))
df_test['mention_count'] = df_test['text'].apply(lambda x: len([c for c in str(x) if c == '@']))
#Disaster tweet about the distribution of 9 features=1 ⇄ Tweets other than disasters=0, training data ⇄ Compare with test data
METAFEATURES = ['word_count', 'unique_word_count', 'stop_word_count', 'url_count', 'mean_word_length',
                'char_count', 'punctuation_count', 'hashtag_count', 'mention_count']
DISASTER_TWEETS = df_train['target'] == 1

fig, axes = plt.subplots(ncols=2, nrows=len(METAFEATURES), figsize=(20, 50), dpi=100)

for i, feature in enumerate(METAFEATURES):
    #Disaster tweet=1 ⇄ Tweets other than disasters=Compare the distribution of 0(Kernel density estimation)
    sns.distplot(df_train.loc[~DISASTER_TWEETS][feature], label='Not Disaster', ax=axes[i][0], color='green', kde=True)
    sns.distplot(df_train.loc[DISASTER_TWEETS][feature], label='Disaster', ax=axes[i][0], color='red', kde=True)
    
    #Training data ⇄ Compare the distribution of test data(Kernel density estimation)
    sns.distplot(df_train[feature], label='Training', ax=axes[i][1], kde=True)
    sns.distplot(df_test[feature], label='Test', ax=axes[i][1], kde=True)

    for j in range(2):
        axes[i][j].set_xlabel('')
        axes[i][j].tick_params(axis='x', labelsize=12)
        axes[i][j].tick_params(axis='y', labelsize=12)
        axes[i][j].legend()

    axes[i][0].set_title(f'{feature} Target Distribution in Training Set', fontsize=13)
    axes[i][1].set_title(f'{feature} Training & Test Set Distribution', fontsize=13)

plt.show()

meta_feature.png

You can see that there is no big difference in distribution between disaster tweets and tweets other than disaster, training data and test data. You can see that tweets with URLs, hashtags, and mentions are as good as or better than tweets without them. Notations such as URLs, hashtags, and mentions are likely not information necessary for determining disaster tweets and tweets other than disasters, so it may be better to cleanse them in pre-processing.

Next, find out which of the keyword words appear most often in disaster tweets and which appear most often in non-disaster tweets. Because target of disaster tweets is an integer 1 and tweets other than disasters are integers 0. Take the average value of target for each word of keyword, and if it is close to 1, words that tend to appear in disaster tweets, If it is close to 0, you can see that it is a word that tends to appear in tweets other than disasters. Use pandas' groupby method to find the average value of target for each word of keyword and add that value to the entire training data. After that, plot the number of appearances in order from the word that appears in the disaster tweet.

#For each keyword word, find the average value of target and add that value to the entire training data.
df_train['target_mean'] = df_train.groupby('keyword')['target'].transform('mean')

fig = plt.figure(figsize=(8, 72), dpi=100)

#Check the label distribution included in the keyword
sns.countplot(y=df_train.sort_values(by='target_mean', ascending=False)['keyword'],
             hue=df_train.sort_values(by='target_mean', ascending=False)['target'])

plt.tick_params(axis='x', labelsize=15)
plt.tick_params(axis='y', labelsize=12)
plt.legend(loc=1)
plt.title('Target Distribution in Keywords')

plt.show()

#Delete the column of the average value of the target value because it will not be used anymore.
df_train.drop(columns=['target_mean'], inplace=True)

The output result is as follows. If you display the entire label distribution of 221 words in this article, it will be vertically long, so Words that tend to appear in disaster tweets and words that tend to appear in non-disaster tweets, Show only the top of each (actually you get a picture of the vertical label distribution). target_dist_head_foot.png

Nouns that describe specific disaster-related conditions, such as derailment, debris, and wreckage, tend to appear in disaster tweets. On the other hand, aftershock (= aftershock, aftershock), body bags (= body bag), ruin (= ruin (noun), ruin (verb)) are seemingly disaster-related words, It tends not to appear in disaster tweets. This is probably because it is a word that is also used as a metaphorical expression.

Next, check the frequently-used words by n-gram. This time, unigram (n = 1), bigram (n = 2), trigram (n = 3) frequently used words, disaster tweet = 1 non-disaster tweet = 0, respectively. confirm. First, define a function that will generate a list of n-grams. If you like the Detailed explanation article of the function that generates the list of n-gram, please have a look.

def generate_ngrams(text, n_gram=1):
    #Tokenize only words that are not on the list of stop words
    token = [token for token in text.lower().split(' ') if token != '' if token not in STOPWORDS]
    # n_Creating gram tuples, zip(*)Extracts elements with the same index from the beginning of the list.
    ngrams = zip(*[token[i:] for i in range(n_gram)])
    return [' '.join(ngram) for ngram in ngrams]

Then use the function to calculate the unigram (n = 1), bigram (n = 2), trigram (n = 3) and their frequency.

#Unigram
disaster_unigrams = defaultdict(int)
nondisaster_unigrams = defaultdict(int)

# df_Create a unigram of disaster tweets in the train
for tweet in df_train[DISASTER_TWEETS]['text']:
    for word in generate_ngrams(tweet):
        disaster_unigrams[word] += 1

# df_Create a unigram of non-disaster tweets in the train.
for tweet in df_train[~DISASTER_TWEETS]['text']:
    for word in generate_ngrams(tweet):
        nondisaster_unigrams[word] += 1

#Sort by frequency of occurrence
df_disaster_unigrams = pd.DataFrame(sorted(disaster_unigrams.items(), key=lambda x: x[1])[::-1])
df_nondisaster_unigrams = pd.DataFrame(sorted(nondisaster_unigrams.items(), key=lambda x: x[1])[::-1])

#Biggram
disaster_bigrams = defaultdict(int)
nondisaster_bigrams = defaultdict(int)

for tweet in df_train[DISASTER_TWEETS]['text']:
    for word in generate_ngrams(tweet, n_gram=2):
        disaster_bigrams[word] += 1
        
for tweet in df_train[~DISASTER_TWEETS]['text']:
    for word in generate_ngrams(tweet, n_gram=2):
        nondisaster_bigrams[word] += 1
        
df_disaster_bigrams = pd.DataFrame(sorted(disaster_bigrams.items(), key=lambda x: x[1])[::-1])
df_nondisaster_bigrams = pd.DataFrame(sorted(nondisaster_bigrams.items(), key=lambda x: x[1])[::-1])

#Trigram
disaster_trigrams = defaultdict(int)
nondisaster_trigrams = defaultdict(int)

for tweet in df_train[DISASTER_TWEETS]['text']:
    for word in generate_ngrams(tweet, n_gram=3):
        disaster_trigrams[word] += 1
        
for tweet in df_train[~DISASTER_TWEETS]['text']:
    for word in generate_ngrams(tweet, n_gram=3):
        nondisaster_trigrams[word] += 1
        
df_disaster_trigrams = pd.DataFrame(sorted(disaster_trigrams.items(), key=lambda x: x[1])[::-1])
df_nondisaster_trigrams = pd.DataFrame(sorted(nondisaster_trigrams.items(), key=lambda x: x[1])[::-1])

First, let's take a look at the 30 most frequently occurring unigrams.

N = 30 #Show only the top 30 unigrams

fig, axes = plt.subplots(ncols=2, figsize=(15, 15), dpi=100)
plt.tight_layout()

sns.barplot(y=df_disaster_unigrams[0].values[:N], x=df_disaster_unigrams[1].values[:N], ax=axes[0], color='red')
sns.barplot(y=df_nondisaster_unigrams[0].values[:N], x=df_nondisaster_unigrams[1].values[:N], ax=axes[1], color='green')

for i in range(2):
    axes[i].spines['right'].set_visible(False)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')
    axes[i].tick_params(axis='x', labelsize=13)
    axes[i].tick_params(axis='y', labelsize=13)

axes[0].set_title(f'Top {N} most common unigrams in Disaster Tweets', fontsize=15)
axes[1].set_title(f'Top {N} most common unigrams in Non-disaster Tweets', fontsize=15)

plt.show()

unigrams.png

You can see that many of the most frequent unigrams are symbols, stop words that could not be removed, and numbers, even for disaster tweets and non-disaster tweets. These unigrams are not a criterion for target and should be removed before modeling.

You can also see that the unigrams that frequently appear in disaster tweets provide specific information about disasters. On the other hand, you can see that there are many verbs in the unigrams that frequently appear in tweets other than disasters. This is probably because non-disaster tweets tend to allow users to tweet about themselves or something.

Let's also look at bigrams and trigrams.

#Biggram
fig, axes = plt.subplots(ncols=2, figsize=(20, 15), dpi=100)
plt.subplots_adjust(wspace=0.4, hspace=0.6)

sns.barplot(y=df_disaster_bigrams[0].values[:N], x=df_disaster_bigrams[1].values[:N], ax=axes[0], color='red')
sns.barplot(y=df_nondisaster_bigrams[0].values[:N], x=df_nondisaster_bigrams[1].values[:N], ax=axes[1], color='green')

for i in range(2):
    axes[i].spines['right'].set_visible(False)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')
    axes[i].tick_params(axis='x', labelsize=20)
    axes[i].tick_params(axis='y', labelsize=20)

axes[0].set_title(f'Top {N} most common bigrams in Disaster Tweets', fontsize=20)
axes[1].set_title(f'Top {N} most common bigrams in Non-disaster Tweets', fontsize=20)

plt.show()

#Trigram
fig, axes = plt.subplots(ncols=2, figsize=(20, 15), dpi=100)
plt.subplots_adjust(wspace=0.7, hspace=0.6)

sns.barplot(y=df_disaster_trigrams[0].values[:N], x=df_disaster_trigrams[1].values[:N], ax=axes[0], color='red')
sns.barplot(y=df_nondisaster_trigrams[0].values[:N], x=df_nondisaster_trigrams[1].values[:N], ax=axes[1], color='green')

for i in range(2):
    axes[i].spines['right'].set_visible(False)
    axes[i].set_xlabel('')
    axes[i].set_ylabel('')
    axes[i].tick_params(axis='x', labelsize=20)
    axes[i].tick_params(axis='y', labelsize=20)

axes[0].set_title(f'Top {N} most common trigrams in Disaster Tweets', fontsize=20)
axes[1].set_title(f'Top {N} most common trigrams in Non-disaster Tweets', fontsize=20)

plt.show()

bigrams.png

trigrams.png

It can be seen that disaster tweets contain a lot of specific disaster content in common with both bigram and trigram. You can also see that disaster tweets don't often feature symbols, stopwords, and numbers that are typical of Unigram. On the other hand, in tweets other than disasters, delimiters and stop words appear, and many words such as reddit and youtube also appear.

4. Data preprocessing

Exploratory data analysis revealed that tweets needed to be stripped of information that wasn't needed to build the model. Before building the model, some preprocessing is performed on the training data and test data.

First, convert abbreviated words such as ʻI'm`` and `` we've`` back to the form ʻI amand we have. Python provides a module called contractions, with contractions.fix (text) `` You can restore the abbreviation to its original shape.

def fix_contractions(text):
    return contractions.fix(text)

#Tweet example before function adaptation
print("tweet before contractions fix : ", df_train.iloc[1055]["text"])

#Apply function
df_train['text']=df_train['text'].apply(lambda x : fix_contractions(x))
df_test['text']=df_test['text'].apply(lambda x : fix_contractions(x))

#Tweet example after applying the function
print("tweet after contractions fix : ", df_train.iloc[1055]["text"])

Execution result


tweet before contractions fix :  @asymbina @tithenai I'm hampered by only liking cross-body bags. I really like Ella Vickers bags: machine washable. http://t.co/YsFYEahpVg
tweet after contractions fix :  @asymbina @tithenai I am hampered by only liking cross-body bags. I really like Ella Vickers bags: machine washable. http://t.co/YsFYEahpVg

Next, delete only the URL from the tweet containing the URL using a regular expression.

def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)

#Tweet example before function adaptation
print("tweet before URL removal : ", df_train.iloc[1055]["text"])

#Apply function
df_train['text']=df_train['text'].apply(lambda x : remove_URL(x))
df_test['text']=df_test['text'].apply(lambda x : remove_URL(x))

#Tweet example after applying the function
print("tweet after URL removal : ", df_train.iloc[1055]["text"])

Execution result


tweet before URL removal :  @asymbina @tithenai I am hampered by only liking cross-body bags. I really like Ella Vickers bags: machine washable. http://t.co/YsFYEahpVg
tweet after URL removal :  @asymbina @tithenai I am hampered by only liking cross-body bags. I really like Ella Vickers bags: machine washable. 

Next, I would like to remove the symbol. This will include # @! "$% & \'() * +,-./ :; <=>? [\\] ^ _` {|} ~ including hashtags and mention symbols. The symbol is removed. A list of symbols to remove can be obtained with string.punctuation.

def remove_punct(text):
    table=str.maketrans('','',string.punctuation)
    return text.translate(table)

#Tweet example before function adaptation
print("tweet before punctuation removal : ", df_train.iloc[1055]["text"])

#Apply function
df_train['text']=df_train['text'].apply(lambda x : remove_punct(x))
df_test['text']=df_test['text'].apply(lambda x : remove_punct(x))

#Tweet example after applying the function
print("tweet after punctuation removal : ", df_train.iloc[1055]["text"])

Execution result


tweet before punctuation removal :  @asymbina @tithenai I am hampered by only liking cross-body bags. I really like Ella Vickers bags: machine washable. 
tweet after punctuation removal :  asymbina tithenai I am hampered by only liking crossbody bags I really like Ella Vickers bags machine washable 

The tweet body has been cleaned up by the above three processes. In the future, I would like to vectorize and model using the cleaned text!

An article on vectorization and modeling is currently being written! Please wait!

Referenced notebook

Competition notebook referred to (or used) in this article

・ Https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert (EDA, data preprocessing, vectorization, BERT)

・ Https://www.kaggle.com/shahules/basic-eda-cleaning-and-glove (EDA, data preprocessing, GloVe)

Pre-processing to be performed before proceeding to the next chapter

Please proceed to the next chapter with the following code executed

#Data analysis
import pandas as pd
import numpy as np

#Visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

#Natural language processing
import string #symbol(punctuation)Get a list of
import re
import contractions #Document abbreviation
from wordcloud import STOPWORDS #Get a list of stopwords
from collections import defaultdict # n-Used when creating gram

#Abbreviated restoration
def fix_contractions(text):
    return contractions.fix(text)

#Delete URL
def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)

#Remove sign
def remove_punct(text):
    table=str.maketrans('','',string.punctuation)
    return text.translate(table)

#Apply function
df_train['text']=df_train['text'].apply(lambda x : fix_contractions(x))
df_test['text']=df_test['text'].apply(lambda x : fix_contractions(x))

df_train['text']=df_train['text'].apply(lambda x : remove_URL(x))
df_test['text']=df_test['text'].apply(lambda x : remove_URL(x))

df_train['text']=df_train['text'].apply(lambda x : remove_punct(x))
df_test['text']=df_test['text'].apply(lambda x : remove_punct(x))


Recommended Posts

Kaggle Competition Hands-on: Real or Not? NLP with Disaster Tweets ~ EDA / Preprocessing ~
Kaggle Memorandum ~ NLP with Disaster Tweets Part 1 ~