I'm tired of Python, so I tried to analyze the data with nehan (I want to go live even with corona sickness-Part 2)

greeting

Hello, this is sunfish. ** "Twtter x Corona" ** We will continue to analyze. Part 1 morphologically analyzed the tweet text and went up to the point where the number of frequent words appearing daily was calculated.

↓ Selected frequently-used words 27 スクリーンショット 2020-10-05 17.53.26.png

Search for up / down trend words from Twitter data

More than half a year has passed since the coronavirus became a social problem. Let's follow from the tweet what is rising in people and what is forgotten. In the second part, we will use regression analysis to find words with up / down trends.

data

We will use the data on the number of occurrences by day and word created in the first part. ↓ Data スクリーンショット 2020-10-12 18.37.23.png ↓ When visualized スクリーンショット 2020-10-05 17.56.28.png

Preparing to perform regression analysis

As the days go by, I would like to find the words that appear more or less. In other words

y:Number of tweets for a specific word=a\times x:Number of days elapsed+b

Let's derive such a regression equation and observe the slope ʻa` and the correlation coefficient. As a data operation, it is necessary to calculate the "elapsed days" from the date data. As an approach, assign serial numbers, word by word, and date in ascending order.

from scipy.spatial.distance import cdist
import pandas as pd
import statsmodels.api as sm

port_23 = port_22.copy()
model_params = {'method': 'first', 'ascending': True}
port_23[['Created_At']] = pd.to_datetime(port_23[['Created_At']])
port_23['index'] = port_23.groupby(['word'])[['Created_At']].rank(**model_params)
port_23[['Created_At']] = port_23[['Created_At']].map(lambda x: x.date())

スクリーンショット 2020-10-12 18.53.53.png

↓ Pay attention to the x-axis. Now that we have the number of days elapsed, we are ready for regression analysis. スクリーンショット 2020-10-12 18.57.13.png

Perform regression analysis. Word by word

We will observe how the selected 24 words change according to the number of days elapsed from the regression analysis results. When I try to write in python, it is difficult to loop for each word.

group_keys = ['word']
X_columns = ['index']
Y_column = 'count'
groups = port_23.groupby(group_keys)
models = {}
summaries = {}

def corr_xy(X, y):
    """Find the correlation coefficient between the objective variable and the explanatory variable"""
    X_label = X.columns.tolist()
    X = X.T.values
    y = y.values.reshape(1, -1)
    corr = 1 - cdist(X, y, metric='correlation')
    corr = pd.DataFrame(corr, columns=['Correlation coefficient with the objective variable'])
    corr['Explanatory variable'] = X_label
    return corr

for i, g in groups:
    X = g[X_columns]
    Y = g[Y_column].squeeze()
    corr = corr_xy(X, Y)
    try:
        model = sm.OLS(y, sm.add_constant(X, has_constant='add')).fit()
        model.X = X.columns
        models[i] = model
        summary = pd.DataFrame(
            {
                'Explanatory variable': X.columns,
                'coefficient': np.round(model.params, 5),
                'standard deviation': np.round(model.bse, 5),
                't value': np.round(model.tvalues, 5),
                'Pr(>|t|)': np.round(model.pvalues, 5)
            },
            columns=['Explanatory variable', 'coefficient', 'standard deviation', 't value', 'Pr(>|t|)'])
        summary = summary.merge(corr, on='Explanatory variable', how='left')
        summaries[i] = summary
    except:
        continue

res = []
for key, value in summaries.items():
    value[group_keys] = key
    res.append(value)

concat_summary = pd.concat(res, ignore_index=True)
port_24 = models
port_25 = concat_summary

↓ With nehan, you can avoid the troublesome loop processing with the Create model for each group option. スクリーンショット 2020-10-12 20.04.14.png

And we got the result of regression analysis for each word. Explanatory variable = const, is the intercept information. スクリーンショット 2020-10-12 19.13.16.png

Focus on up / downtrend words

There are various interpretations, but here

Extract the words as they are correlated.

port_27 = port_25[(port_25['Correlation coefficient with the objective variable'] <= -0.4) |
                  (port_25['Correlation coefficient with the objective variable'] >= 0.4)]

スクリーンショット 2020-10-12 19.14.51.png

Let's take a closer look at the word information. スクリーンショット 2020-10-12 19.28.43.png

Observe the results

Uptrend word

Downtrend word

↓ ** Live **, transition by number of days スクリーンショット 2020-10-12 19.29.58.png ↓ ** Government **, change by number of days スクリーンショット 2020-10-12 19.30.49.png

Summary

The threat of corona is not gone, but the number of words that we often see in the news during the crisis has decreased, and the number of words that have been strongly influenced by self-restraint, such as events and live performances, has increased. You can see how it is doing. Of course, this alone cannot be said to be ** "everyone wants to go live!" **, but I would like to conclude this theme as a consideration based on the data so far.

We hope that you can understand the appeal of nehan, a programming-free analysis tool that can be linked to various analyzes and visualizations from preprocessed data.

Recommended Posts

I'm tired of Python, so I tried to analyze the data with nehan (I want to go live even with corona sickness-Part 2)
I'm tired of Python, so I tried to analyze the data with nehan (I want to go live even with corona sickness-Part 1)
I'm tired of Python, so I analyzed the data with nehan (corona related, is that word now?)
I tried to get and analyze the statistical data of the new corona with Python: Data of Johns Hopkins University
I tried to analyze J League data with Python
I tried to find the entropy of the image with python
I want to be able to analyze data with Python (Part 3)
I want to be able to analyze data with Python (Part 1)
I want to be able to analyze data with Python (Part 4)
I want to be able to analyze data with Python (Part 2)
I tried to analyze the data of the soccer FIFA World Cup Russia tournament with soccer action
I want to output the beginning of the next month with Python
[Pandas] I tried to analyze sales data with Python [For beginners]
I tried to improve the efficiency of daily work with Python
I want to analyze logs with Python
I tried to get the authentication code of Qiita API with Python.
(Python) I tried to analyze 1 million hands ~ I tried to estimate the number of AA ~
I tried to verify and analyze the acceleration of Python by Cython
I tried to analyze the negativeness of Nono Morikubo. [Compare with Posipa]
I tried to streamline the standard role of new employees with Python
I tried to get the movie information of TMDb API with Python
I tried to save the data with discord
I tried to get CloudWatch data with Python
I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK
I tried to automatically send the literature of the new coronavirus to LINE with Python
[Python & SQLite] I tried to analyze the expected value of a race with horses in the 1x win range ①
☆ Professor Anzai… !! I want to analyze the data …… Part 1 Data preparation ☆ Let's analyze the NBA player stats (results) with Python. basketball
I'm an amateur on the 14th day of python, but I want to try machine learning with scikit-learn
I tried to touch the CSV file with Python
I tried to solve the soma cube with python
I want to inherit to the back with python dataclass
I tried to solve the problem with Python Vol.1
I tried to summarize the string operations of Python
I tried to put out the frequent word ranking of LINE talk with Python
I tried to visualize the running data of the racing game (Assetto Corsa) with Plotly
I tried to compare the processing speed with dplyr of R and pandas of Python
The 15th offline real-time I tried to solve the problem of how to write with python
I tried "gamma correction" of the image with Python + OpenCV
I tried to simulate how the infection spreads with Python
I tried to analyze the whole novel "Weathering with You" ☔️
I tried to make various "dummy data" with Python faker
I tried to find the average of the sequence with TensorFlow
[Python] I want to use the -h option with argparse
[Python] I tried to visualize tweets about Corona with WordCloud
[Python] I tried to visualize the follow relationship of Twitter
I want to know the features of Python and pip
[Python] I tried collecting data using the API of wikipedia
I tried to divide the file into folders with Python
I tried to display the point cloud data DB of Shizuoka prefecture with Vue + Leaflet
How to write offline real time I tried to solve the problem of F02 with Python
[Python] I want to make a 3D scatter plot of the epicenter with Cartopy + Matplotlib!
I tried to get the number of days of the month holidays (Saturdays, Sundays, and holidays) with python
I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"
I tried scraping the ranking of Qiita Advent Calendar with Python
I tried to solve the ant book beginner's edition with python
I want to know the weather with LINE bot feat.Heroku + Python
I tried to automate the watering of the planter with Raspberry Pi
I want to extract an arbitrary URL from the character string of the html source with python
I want to improve efficiency with Python even in the experimental system (5) I want to send a notification at the end of the experiment with the slack API
I tried to create a list of prime numbers with python
I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 1