[PYTHON] Kaggle Memorandum ~ NLP with Disaster Tweets Part 1 ~

Challenge Kaggle

I tried Kaggle, which I hadn't touched for a while, after a long time.

Click here to challenge ↓ Real or Not? NLP with Disaster Tweets https://www.kaggle.com/c/nlp-getting-started

First, drop the dataset into a DataFrame.

import os
import pandas as pd

for dirname, _, filenames in os.walk('../input/nlp-getting-started'):
    for filename in filenames:
        path = os.path.join(dirname, filename)
        exec("{0}_df = pd.read_csv(path)".format(filename.replace(".csv","")))

I created the following code thinking that there may be a correlation between a specific word and the disaster occurrence Tweet.

# Separate Tweet sentences by word and store in DataFrame
words_df = pd.DataFrame([], columns = ['words' , 'target_count'])
for index,item in train_df[['text','target']].iterrows():
    word_df = pd.DataFrame([], columns = ['words' , 'target_count'])
    word_df['words'] = item[0].split(' ')
    word_df['target_count'] = item[1]
    words_df = pd.concat([words_df,word_df])

# Narrow down to words with 5 or more letters to exclude stop words
long_words_df = words_df[words_df['words'].str.len() > 5]
# GroupBy the same word and display the aggregated result
long_words_df.groupby(['words']).sum().sort_values("target_count", ascending=False)

The result is as follows. I'm curious that the word Hiroshima is digging into the top.

words target_count
California 86
killed 86
people 83
suicide 71
disaster 59
Hiroshima 58

Recommended Posts

Kaggle Memorandum ~ NLP with Disaster Tweets Part 1 ~
Kaggle Competition Hands-on: Real or Not? NLP with Disaster Tweets ~ EDA / Preprocessing ~
Machine learning starting with Python Personal memorandum Part2
Machine learning starting with Python Personal memorandum Part1
Kaggle Summary: Redhat (Part 1)
Kaggle ~ Housing Analysis ③ ~ Part1
Collecting tweets with Python
Python basic memorandum part 2
sandbox with neo4j part 10
Get Tweets with Tweepy
Kaggle Summary: Redhat (Part 2)