Tweet analysis with Python, Mecab and CaboCha

Introduction

This article is the 22nd day article of Python Advent Calendar 2014 --Qiita.

Last week, I wrote a blog called BigQuery x Perfume x tweet analysis on the company's Advent Calendar. There, I tried to analyze the tweets about Perfume collected during one week from 12/12 (Friday) to 12/18 (Thursday) by giving them to BigQuery.

This time, as a development, I will do tweet analysis with natural language processing using Mecab and CaboCha m (_ _) m

Referenced

That's why I look up the tweets of the House of Representatives election --Qiita https://github.com/mima3/stream_twitter

Rather than reference, most of what I'm doing is for sale by mima_ita ...

environment

· Mac OSX 10.9.5 -Python 2.7.8

Twitter data collection

This time, I used a service called mention. In mention, you can easily pull data from SNS with the keywords set on the management screen. Export the data acquired here to csv from the management screen, and you are ready to go.

The conditions specified this time are as follows.

Search keyword: Perfume|| prfm || perfume_um (prfm and perfume_um are hashtags used in posts about Perfume)

Negative keywords: RT

Target SNS: Twitter Language: Japanese

Period: 12/12 (Friday) -12/18 (Thursday)

Install Mecab and Cabocha

Mecab:http://salinger.github.io/blog/2013/01/17/1/ Cabocha:http://qiita.com/ShingoOikawa/items/ef4ac2929ec19599a3cf

If you follow this article, there was no problem (`・ ω ・ ´) ゞ

Code used for collection and analysis

https://github.com/mima3/stream_twitter

However, here we use the Streaming API to collect Twitter data and store it in the DB. This time, unlike that, it is necessary to store what was collected by mention and spit out by csv in the DB.

So, please store it in SQLite with the following source.

create_database.py


#!/usr/bin/python
# -*- coding: utf-8 -*-
import sqlite3
import csv

if __name__ == '__main__':
    con = sqlite3.connect("twitter_stream.sqlite")

    c = con.cursor()
    # Create table
    c.execute('''CREATE TABLE "twitte" ("id" INTEGER NOT NULL PRIMARY KEY, "createAt" DATETIME NOT NULL, "idStr" VARCHAR(255) NOT NULL, "contents" VARCHAR(255) NOT NULL);
    ''')
    c.execute('''CREATE INDEX "twitte_createAt" ON "twitte" ("createAt");''')
    c.execute('''CREATE INDEX "twitte_idStr" ON "twitte" ("idStr")''')

    # Insert data
    i = 0
    data = []
    reader = csv.reader(open("./perfume_tweet.csv"))
    for row in reader:
        id = i+1
        createAt = row[4]
        idStr = unicode(row[0],'utf-8')
        contents = unicode(row[1],'utf-8')
        t = (id,createAt,idStr,contents)
        data.append(t)
        i += 1
    con.executemany(u"insert into twitte values(?,?,?,?)",data)

    # Save (commit) the changes
    con.commit()

    # We can also close the cursor if we are done with it
    con.close()

You have successfully created twitter_stream.sqlite in your current directory.

Histogram by time

Here, I will pick up 12/17 (Wednesday). We will count the number of tweets by hour in the day.

 python twitter_db_hist.py "2014/12/16 15:00" "2014/12/17 15:00" 3600

figure_1.png

Looking at this, we can analyze the following.

・ The time zone from 05:00 to 06:00 is the least. It's all low from 02:00 to 06:00, so everyone is probably sleeping at this time. ・ The most frequent hours are from 23:00 to 24:00. The number of tweets is generally high from 18:00 to 24:00, but the highest is from 23:00 to 24:00 at the end of the day. Many people have been sleeping since they tweeted about Perfume?

That's a rough idea, but you can understand it somehow.

Frequent word extraction

Next, we will perform morphological analysis using Mecab. Here, we will sift the collected tweets for one week.

python twitter_db_mecab.py "2014/12/11 15:00" "2014/12/17 15:00" > mecab.txt

Below is a list of the top 100.

word count
Perfume 10935
prfm 2739
perfume 2136
Follow 1553
1478
Chiru 1462
Noru 1448
Regular 1410
- 1347
Like 1256
Video 1218
Chan 1204
Perfume 1056
Man 996
YouTube 945
Mutual 889
During ~ 850
Summary 837
sougofollow 775
live 737
um 726
Teru 568
Tame 555
www 552
Breaking news 552
male 551
Oomoto 544
Ayano 544
Circle 543
marriage 540
En 537
loose the temper 535
General 534
493
bid 480
storm 433
Absent 431
Day 420
Ku 419
Yahoo auction 413
Year 408
you 401
Current 398
price 377
Time 374
Date and time 374
Song 364
View 356
Cling 350
One 345
Black 342
End 341
Eye 341
Pafukura 335
number 332
listen 331
of 330
thing 324
Please 318
yauc 308
DVD 308
Limited 307
Board 300
ticket 295
I love You 285
Sheet 282
love 276
First time 271
Yuka 270
natural 263
Month 263
:-, 262
Sa 261
Sakanaction 254
Give me 250
!: 250
Peaches 247
Hope 240
nowplaying 240
Music 239
FC 237
Rank 230
mask 229
Chowder 228
love 219
soil 217
come 217
217
Apple 214
Player 213
To be 210
Sekaowa 209
Kashi 207
mp 206
source 205
dance 201
Explosive sound 201
so 201
Shiina 200
200

The top words are included in the bot's tweet, so it's not significant data. As a sensory value, it seems that meaningful words appear when analyzing from a total of 500 tweets or less.

During this data collection period, there were many tweets about the event "Masked Chowder ~ YAJIO CRAZY ~ Chowder University International Collagen High School" that was held on Saturday, December 20th. was.

Event name related: "Mask" "Chowder" Ticket information related: "Bid" "Yahoo auction" "Price" "yauc" "Ticket" "Sheet" (I wasn't familiar with yauc, but it seems to be a hashtag of Yahoo Auction)

After that, I was able to roughly analyze the following.

Perfume song title related: "Cling" (Cling Cling) "Natural" "Love" (In love with Natural) Singer related: "Arashi" "Momo" "Kuro" "Sakanaction" "Sekaowa" "Ringo" "Shiina" Perfume Music Player related: "Listen" "Music" "Player" "mp"

Dependency analysis

Next, let's use CaboCha to aggregate the dependency relationships of clauses. Again, we are analyzing a week's worth of tweets.

python twitter_db_cabocha.py "2014/12/11 15:00" "2014/12/17 15:00" > cabocha.txt

Let's look at the top 100 in the same way.

phrase1 phrase2 count
Perfume Noru 582
During live Furious www http://t 535
Noru Furious www http://t 535
●● Furious www http://t 535
~Chan During live 535
Ayano Omoto General male marriage 532
This Ayano Omoto General male 532
PerfumeMusicPlayer listen 138
RT#Mr. Pafukura Connect 137
Absent http://t 131
Kuu http://t 127
Mr. Saito http://t 127
- Mr. Saito 127
Connect #Pafukura http://t 125
___VAMPS ___One Ok 97
Black ___L'Arc 97
___glee ___UVERworld 97
___UVERworld ___VAMPS 97
___Peaches Black 97
___Bz ___One Ok 97
___L'Arc ___Bz 97
___Ringo Sheena___GLAY ___glee 93
『[BGM for work]perfumemix』 http://t 91
Love Keru 90
#Perfume#I like Perfume Connect 89
[Diffusion hope] ● Explosive event schedule ● Details___http ://t 89
Nino Much 87
"secret Arashi-chan 86
Just just Keru 86
radio Keru 86
thing Kariru 86
"Dengeki Marriage~perfumeoflove~___Episode 1" http://t 84
~Chan Keru 83
Ste ___20131115』http://t 77
___MUSIC STATION 77
STATION Ste 77
FC2 video: Keru 75
___Perfume ___One Ok 74
natural I miss you 68
Hama Okamoto (OKAMOTO'S) 63
Like thing 62
One song Vote 60
this year One song 60
word 『Perfume』 58
- -#prfm#perfume_um 56
Ah ~Chan 55
Back story Talk http://t 50
12/20(soil)Osaka Castle Hall ticket 50
2008-4-5 GAME release Back story Talk 50
[3 princesses Back story Talk 50
Pleasant Back story Talk 50
『PerfumeTalk 2008-4-5 GAME release 50
PerfumeMusicPlayer listen 48
Kashino Yuka To inform 47
~Chan To inform 47
Perfume ~Chan 47
Like Man 45
Noru To inform 45
natural I love you 45
hand connect 43
Sekaowa Momokuro 43
nice to meet you please 43
thing is there 43
Like One 41
Perfume Kashino Yuka 40
member Draw 38
now Check http://t 36
soon Check http://t 36
Perfume(Perfume) ticket 36
Fit One 36
co/ 1IoZn9U583 35
Momokuro Perfume 35
Qi Become 34
Perfume GLAY 33
co/ 7CRGN21Brf) 33
One Follow me 33
winter Era 33
___http ://t 32
(#Perfume Kuu 32
Guess Extreme Maiden 32
H ___GLAY\720 31
Noru #prfm 31
/ Sandaime J Soul Brothers 31
Kariru ___GLAY\720 31
Sandaime J Soul Brothers ___GLAY\720 31
(Watts Inn) 2015 31
Two persons Is 31
2015 January issue[magazine]http://t 31
___TEAM H 31
Feel free To follow 30
heart Sports 29
12/20(soil)Details of Osaka Castle Hall Here 29
you Email 28
Perfume member 28
word "Perfume Cosplay" 28
Draw ☆ Ultimate work ☆(> 28
[Price or less] ★ Transfer ★ Masked Chowder YAJIO CRAZY Chowder University International Collagen High School 12/20(soil)Osaka Castle Hall 28
Fit Man 28
thing Karu 27
Like Artist 27

Again, a total of 500 or more dependencies are noisy bots, so let's go through. By the way, the bot has the following two tweets.

Perfume's Nocchi A-chan During the live ●● and rage www http://t.co/4Q0fmhel2l Perfume "Nocchi" Ayano Omoto Married to a general man? http://t.co/k6cFhLZwnZ

I don't know how to look at the dependency table, so I grep it with the names of the three members. Among the dependent Phrase, the ones that include "A-chan", "Kashiyuka", and "Nocchi" are shown below. (* Extract only those with neatly separated phrases)

phrase1 phrase2 count
Kashino Yuka finger 2
Kashino Yuka hair 2
Kashino Yuka Shake 2
Kashino Yuka cute 1
Kashino Yuka cute 1
Kashino Yuka Head 1
Kashino Yuka skirt 1
divine Kashino Yuka 1
Kashino Yuka left hand 1
Kashino Yuka voice 1
Nocchi Beautiful 2
A-chan angel 2
A-chan One piece / collar 1
A-chan's Dumplings 1
A-chan's Smile 1

As you can see, there were many things related to Kashino in the dependency analysis! Looking at this dependency, the characteristics of each of the three people are revealed, which is wonderful ...

Finally

As mentioned above, with the help of mima_ita, we have analyzed the tweets about Perfume. Looking back, it's a pity that we have collected a large amount of bottweets that are unnecessary for analysis under the conditions we performed this time ...

I still have a lot of data analysis and Python skills, so I will buy this book to be released next week and study it (`・ ω ・ ´) ゞ Introduction to programming in Python language: World Standard MIT textbook

Recommended Posts

Tweet analysis with Python, Mecab and CaboCha
[Python] Morphological analysis with MeCab
From Python to using MeCab (and CaboCha)
Using Python and MeCab with Azure Databricks
Use Python and MeCab with Azure Functions
Use mecab with Python3
Dependency analysis with CaboCha
Voice analysis with python
Voice analysis with python
Data analysis with Python
Programming with Python and Tkinter
Encryption and decryption with Python
[Co-occurrence analysis] Easy co-occurrence analysis with Python! [Python]
Collecting information from Twitter with Python (morphological analysis with MeCab)
Sentiment analysis with Python (word2vec)
Tweet with image in Python
Planar skeleton analysis with Python
Japanese morphological analysis with Python
python with pyenv and venv
Muscle jerk analysis with Python
Works with Python and R
Perform isocurrent analysis of open channels with Python and matplotlib
Install CaboCha in Ubuntu environment and call it with Python.
Communicate with FX-5204PS with Python and PyUSB
Shining life with Python and OpenCV
Robot running with Arduino and python
Install Python 2.7.9 and Python 3.4.x with pip.
AM modulation and demodulation with python
[Python] font family and font with matplotlib
3D skeleton structure analysis with Python
Scraping with Python, Selenium and Chromedriver
Scraping with Python and Beautiful Soup
Text mining with Python ① Morphological analysis
Hadoop introduction and MapReduce with Python
[GUI with Python] PyQt5-Drag and drop-
Reading and writing NetCDF with Python
I liked the tweet with python. ..
I played with PyQt5 and Python3
I played with Mecab (morphological analysis)!
Reading and writing CSV with Python
Multiple integrals with Python and Sympy
Data analysis starting with python (data visualization 1)
Logistic regression analysis Self-made with python
Coexistence of Python2 and 3 with CircleCI (1.0)
Easy modeling with Blender and Python
When using MeCab with virtualenv python
Data analysis starting with python (data visualization 2)
Sugoroku game and addition game with python
FM modulation and demodulation with Python
[Memo] Tweet on twitter with python
I made a Nyanko tweet form with Python, Flask and Heroku
The strongest way to use MeCab and CaboCha with Google Colab
Communicate between Elixir and Python with gRPC
Use MeCab and neologd with Google Colab
Data pipeline construction with Python and Luigi
Monitor Mojo outages with Python and Skype
Create wordcloud from your tweet with python3
[In-Database Python Analysis Tutorial with SQL Server 2017]
Marketing analysis with Python ① Customer analysis (decyl analysis, RFM analysis)
FM modulation and demodulation with Python Part 3
[Automation] Manipulate mouse and keyboard with Python