I'm tired of Python, so I tried to analyze the data with nehan (I want to go live even with corona sickness-Part 1)

greeting

Hello, this is sunfish. ** "Twtter x Corona" ** This is the second time in the series. Last time tried to count the number of tweets, it was a level, but this time I will do my best. Especially, if you are exhausted by natural language processing such as installing MeCab or building an environment, please take a look.

Search for up / down trend words from Twitter data

More than half a year has passed since the coronavirus became a social problem. Let's follow from the tweet what is rising in people and what is forgotten. In the first part, we will carry out morphological analysis and select the words to be analyzed.

data

Use the data after Last preprocessing. In other words, it is the data of tweet date, tweet content. スクリーンショット 2020-10-05 16.14.11.png

Duplicate deletion process

Actually, in this data, the same tweet content occurs over multiple records and multiple days. (Because it includes retweet) This time, we will analyze with 1 tweet content 1 record, excluding the retweet bias.

from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import statsmodels.api as sm
import re
import MeCab
import dask.dataframe as dd
from multiprocessing import cpu_count

#Most created for each tweet_At takes a young day and makes 1 tweet 1 record
port_13['Created_At'] = pd.to_datetime(port_12['Created_At'])
port_13 = port_12.groupby(['Text']).apply(lambda grp: getattr(
    grp, 'nsmallest')(n=1, columns='Created_At', keep='first'))
port_13['Created_At'] = port_12['Created_At'].map(lambda x: x.date())

スクリーンショット 2020-10-05 16.29.00.png

Morphological analysis

I will do the key to language processing. Since it is difficult to understand if all part of speech is given, only ** "general nouns" ** will be analyzed this time.

def tokenizer(text, pos, only_surface):
    def _extract():
        if only_surface:
            return re.sub(r'[\s ]+', '_', feature[0])
        else:
            return re.sub(r'[\s ]+', '_', feature[2])
    _tagger = MeCab.Tagger(
        '-Ochasen -d {}'.format("/var/lib/mecab/dic/mecab-ipadic-neologd"))
    try:
        result = []
        for feature in _tagger.parse(text).split('\n')[:-2]:
            feature = feature.split('\t')
            if pos:
                if feature[3] in pos:
                    result.append(_extract())
            else:
                result.append(_extract())
        return ' '.join(result)
    except UnicodeEncodeError:
        return ''
    except NotImplementedError:
        return ''

port2 = port1.copy()
port2['Text_morpheme'] = port2['Text'].fillna('')
ddf = dd.from_pandas(port2, npartitions=cpu_count()-1)
target_cols = ['Text_morpheme']
pos = ['noun-General']
for target_col in target_cols:
    ddf[target_col] = ddf[target_col].apply(
        tokenizer, pos=pos, only_surface=True, meta=(f'{target_col}', 'object'))
port2 = ddf.compute(scheduler='processes')

↓ nehan's morphological analysis combines morphemes separated by spaces and inserts them into a column. スクリーンショット 2020-10-05 16.43.11.png


Note that tweets that do not contain general nouns will lose the results of morphological analysis, so delete them. Use missing value processing.

port_15 = port_14.copy()
port_15 = port_15.dropna(subset=None, how='any')

スクリーンショット 2020-10-05 16.53.19.png

Select words that appear frequently (*)

Since it is unavoidable to target words that rarely appear, we analyzed those that appeared more than 1,500 in the entire period.

#Aggregation of word frequency
port_18 = port_15.copy()
flat_words = list(chain.from_iterable(port_18['Text_morpheme'].str.split(' ')))
c = Counter(flat_words)
res = pd.DataFrame.from_dict(c, orient='index').reset_index()
res.columns = ['word', 'count']
port_18 = res

#Row filter by condition
port_20 = port_18[(port_18['count'] >= 1500.0)]

#Column selection
port_21 = port_20[['word']]

スクリーンショット 2020-10-05 17.36.21.png

↓ The number of appearances of the selected 27 words looks like this スクリーンショット 2020-10-05 17.53.26.png

↓ As a bonus, it is a word cloud before selection. スクリーンショット 2020-10-05 17.36.54.png

Aggregate the number of words that appear each day and narrow down to words that occur frequently

Since the target words were narrowed down in the previous step, the next step is to create daily data. Aggregate word frequency using Created_At as a key column.

port_16 = port_15.copy()
target_col = 'Text_morpheme'
groupby_cols = ['Created_At']
tmp = port_16[groupby_cols+[target_col]]
tmp = tmp.groupby(groupby_cols)[target_col].apply(lambda x: ' '.join(x))
vec_counter = CountVectorizer(tokenizer=lambda x: x.split(' '))
X = vec_counter.fit_transform(tmp)
res = pd.DataFrame(X.toarray(), columns=vec_counter.get_feature_names(), index=tmp.index
                   ).reset_index().melt(id_vars=groupby_cols, var_name='word', value_name='count')
port_16 = res.sort_values(groupby_cols).reset_index(drop=True)

スクリーンショット 2020-10-05 17.39.39.png


The data of * is combined here, the daily data is narrowed down to the words to be analyzed, and the work of the first part is completed.

port_22 = pd.merge(port_21, port_16, how='inner',
                   left_on=['word'], right_on=['word'])

スクリーンショット 2020-10-05 17.46.07.png

↓ As a trial, filter the obtained data and visualize the number of word occurrences by day. A smile is better than a crying face. スクリーンショット 2020-10-05 17.56.28.png

Summary

It's been a long time, but I tried to get the number of frequent words appearing every day. We value simplicity over rigor. When trying to do complicated analysis, the code tends to be long and difficult, but nehan does this with 10 nodes (the number of green circles). Of course, I didn't write any programs. I hope you will be interested in nehan even a little.

Recommended Posts

I'm tired of Python, so I tried to analyze the data with nehan (I want to go live even with corona sickness-Part 2)
I'm tired of Python, so I tried to analyze the data with nehan (I want to go live even with corona sickness-Part 1)
I'm tired of Python, so I analyzed the data with nehan (corona related, is that word now?)
I tried to get and analyze the statistical data of the new corona with Python: Data of Johns Hopkins University
I tried to analyze J League data with Python
I tried to find the entropy of the image with python
I want to be able to analyze data with Python (Part 1)
I want to be able to analyze data with Python (Part 4)
I want to be able to analyze data with Python (Part 2)
I want to output the beginning of the next month with Python
[Pandas] I tried to analyze sales data with Python [For beginners]
I tried to improve the efficiency of daily work with Python
I want to analyze logs with Python
I tried to get the authentication code of Qiita API with Python.
(Python) I tried to analyze 1 million hands ~ I tried to estimate the number of AA ~
I tried to verify and analyze the acceleration of Python by Cython
I tried to analyze the negativeness of Nono Morikubo. [Compare with Posipa]
I tried to streamline the standard role of new employees with Python
I tried to get the movie information of TMDb API with Python
I tried to save the data with discord
I tried to get CloudWatch data with Python
I tried to open the latest data of the Excel file managed by date in the folder with Python
I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK
I tried to automatically send the literature of the new coronavirus to LINE with Python
☆ Professor Anzai… !! I want to analyze the data …… Part 1 Data preparation ☆ Let's analyze the NBA player stats (results) with Python. basketball
I'm an amateur on the 14th day of python, but I want to try machine learning with scikit-learn
I tried to touch the CSV file with Python
I tried to solve the soma cube with python
I want to inherit to the back with python dataclass
I tried to solve the problem with Python Vol.1
I tried to summarize the string operations of Python
I tried to put out the frequent word ranking of LINE talk with Python
I tried to automate the article update of Livedoor blog with Python and selenium.
I tried to visualize the running data of the racing game (Assetto Corsa) with Plotly
I tried to compare the processing speed with dplyr of R and pandas of Python
The 15th offline real-time I tried to solve the problem of how to write with python
I tried "gamma correction" of the image with Python + OpenCV
I tried to simulate how the infection spreads with Python
I tried to analyze the whole novel "Weathering with You" ☔️
I want to specify another version of Python with pyvenv
I tried to find the average of the sequence with TensorFlow
[Python] I tried to visualize tweets about Corona with WordCloud
[Python] I tried to visualize the follow relationship of Twitter
I want to know the features of Python and pip
[Python] I tried collecting data using the API of wikipedia
I tried to divide the file into folders with Python
I tried to display the point cloud data DB of Shizuoka prefecture with Vue + Leaflet
How to write offline real time I tried to solve the problem of F02 with Python
[Python] I want to make a 3D scatter plot of the epicenter with Cartopy + Matplotlib!
I tried to get the number of days of the month holidays (Saturdays, Sundays, and holidays) with python
I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"
I tried scraping the ranking of Qiita Advent Calendar with Python
I tried to solve the ant book beginner's edition with python
I want to know the weather with LINE bot feat.Heroku + Python
I tried to automate the watering of the planter with Raspberry Pi
I want to extract an arbitrary URL from the character string of the html source with python
I tried to create a list of prime numbers with python
I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 1
I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 2
[Python] I tried to visualize the prize money of "ONE PIECE" over 100 million characters with matplotlib.
I tried to predict the number of domestically infected people of the new corona with a mathematical model