[PYTHON] Search for Pokemon haunting information from Twitter

Since Pokemon GO has been released, I definitely want to see twitter information efficiently. Try word2vec with Python3 and search for word-of-mouth data to reach the information you want.

Advance preparation

Click here for Python Installation Click here for Installation of Mecab + neologd

  1. Install Meacab
  2. Confirm that python3 works on the command line
  3. Add add-on to Eclipse (STS)
  4. Module installation with pip3
  5. HelloWold
  6. Get information on Twitter
  7. Training model creation and data extraction

let's try it!

2. Confirm that python3 works on the command line

[murotanimari]$  python3 --version
Python 3.5.2
[murotanimari]$ pip3 --version
pip 8.1.2 from /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages (python 3.5)

3. Add add-on to Eclipse (STS)

http://pydev.org/updates/からPyDevをインストールすると、PyDevProjectが作成可能になるので、新規に作成します。

4. Module installation with pip3

pip3 install gensim
pip3 install argparse
pip3 install prettyprint

pip3 install word2vec
pip3 install print
pip3 install pp
pip3 install nltk #I don't need it in Japanese
pip3 install tweepy
pip3 install scipy

# for japanese
brew install mecab
brew install mecab-ipadic
pip3 install mecab-python3
  1. HelloWold

HelloWorld.py


import nltk
nltk.download('all');

import argparse
from gensim.models import word2vec

print("Hello, World!")

6. Get information on Twitter

ParseJP.py


#!/usr/bin/env python
# -*- coding: utf-8 -*- 

import nltk
import sys
import tweepy
import json
import subprocess
import datetime
import MeCab

from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
from numpy.core.multiarray import empty

#Variables that contains the user credentials to access Twitter API 
access_token = "*****************"
access_token_secret = "*****************"
consumer_key = "*****************"
consumer_secret = "*****************"

#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):
    def on_data(self, data):
        jsondata = json.loads(data)
        sentence = jsondata["text"]
        
        try:
            #print(sentence)
            t = MeCab.Tagger("-Ochasen")
            tagged = t.parse(sentence)
            #print(tagged)
            out = "";
            for item in tagged.split('\n'):
                item = str(item).strip()
                if item is '':
                    continue
                
                fields = item.split("\t")
                #print(fields)
                found = ""
                if 'EOS' not in item:
                    if "noun" in fields[3]:
                        found = fields[2]
                    if "verb" in fields[3]:
                        if "Auxiliary verb" not in fields[3]:
                            found = fields[2]
                    
                if("//" not in str(found).lower()):
                    if(found.lower() not in ["rt","@","sex","fuck","https","http","#",".",",","/"]):
                        if(len(found.strip()) != 0):
                            found = found.replace("'", "/'");
                            out += found + " "
                            
            today  = datetime.date.today()
            cmd  = "echo '"+ out + "' >> /tmp/JP" + today.isoformat() +".txt"
            #print(cmd)
            subprocess.check_output(cmd, shell=True)
                    
            return True
        except:
            print("Unexpected error:",found, sys.exc_info()[0])
            return True
            
    def on_error(self, status):
        print(status)

#### main method
if __name__ == '__main__':

    #This handles Twitter authetification and the connection to Twitter Streaming API
    l = StdOutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    stream = Stream(auth, l)

    #This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
    #stream.filter(track=['#pokemongo','#PokemonGo', '#PokémonGo', '#Pokémon' ,'#Pokemon', '#pokemon'], languages=["en"])    
    stream.filter(track=['#Pokémon','#pokemongo','#PokemonGo', '#PokémonGo', '#Pokémon' ,'#Pokemon', '#pokemon'], languages=["ja"])    
    #stream.filter(track=['#pokemon'], languages=["en"])
    

7. Training model creation and data extraction

For the time being, let's check if the data can be obtained properly on the command line.

python


>>> # !/usr/bin/env python
... # -*- coding:utf-8 -*-
... from gensim.models import word2vec
>>>
>>> data = word2vec.Text8Corpus('/tmp/JP2016-07-23.txt')
>>> model = word2vec.Word2Vec(data, size=200)
>>> model.most_similar(positive=u'Pokemon')
[('Pokémon', 0.49616560339927673), ('ND', 0.47942256927490234), ('Yo-Kai Watch', 0.4783376455307007), ('I', 0.44967448711395264), ('9', 0.4415249824523926), ('j', 0.4309641122817993), ('B', 0.4284788966178894), ('CX', 0.42728638648986816), ('l', 0.42639225721359253), ('bvRxC', 0.41929835081100464)]
>>>
>>> model.most_similar(positive=u'Pikachu')
[('SolderingArt', 0.7791135311126709), ('61', 0.7604312896728516), ('Pokemon', 0.7314165830612183), ('suki', 0.7087007761001587), ('Chu', 0.6967192888259888), ('docchi', 0.6937340497970581), ('Latte art', 0.6864794492721558), ('EjPbfZEhIS', 0.6781727075576782), ('Soldering', 0.6571916341781616), ('latteart', 0.6411304473876953)]
>>>
>>> model.most_similar(positive=u'Pikachu')
[('tobacco', 0.9689614176750183), ('Create', 0.9548219442367554), ('Shibuya', 0.9207605123519897), ('EXCJ', 0.9159889221191406), ('Littering', 0.8906601667404175), ('Get trash', 0.7719830274581909), ('There is there', 0.6942187547683716), ('Thank you', 0.6873651742935181), ('Please', 0.6714405417442322), ('GET', 0.6686745285987854)]
>>>
>>> model.most_similar(positive=u'Rare Pokemon')
[('table', 0.8076062202453613), ('Hayami', 0.8065655827522278), ('Habitat', 0.7529213428497314), ('obtain', 0.7382372617721558), ('latest', 0.7039971351623535), ('Japanese version', 0.6925774216651917), ('base', 0.6455932855606079), ('300', 0.6433809995651245), ('YosukeYou', 0.6330702900886536), ('Enoshima', 0.6322115659713745)]
>>>
>>> model.most_similar(positive=u'Mass generation')
[('Area', 0.9162761569023132), ('chaos', 0.8581807613372803), ('Sakuragicho Station', 0.7103563547134399), ('EjPbfZEhIS', 0.702730655670166), ('Okura', 0.6720583438873291), ('Tonomachi', 0.6632444858551025), ('Imai Shoten', 0.6514744758605957), ('丿', 0.6451742649078369), ('Paris', 0.6437439918518066), ('entrance', 0.640221893787384)]

I'm curious about bases and Enoshima with rare Pokemon! What about Sakuragicho Station, Okura, Tonomachi, Imai Shoten, etc. due to the outbreak?

Postscript: Bonus

I started processing data with Deploy on EC2. If you don't have money, you can't publish it as an API w
note:The accuracy is still low, so please check your true intentions.!!!!

▼ Pokemon "Spot" word-of-mouth keyword ranking by twitter& word2vec
1.Kinshi Park
2.Aichi prefecture
3. gamespark 
4.Nagoya
5.park
6.Shopping street
7.Three places
8.Ohori Park
▼ Pokemon "mass outbreak" word-of-mouth keyword ranking by twitter& word2vec
1.Pokemon event collaboration
2.Sakuragicho Station
3.Okura
4.Nishi-Shinjuku
5.Shopping street
6.Paris
7.Central park
8.Fukushima
9.Imai Shoten
▼ Pokemon "Rare Pokemon" Review Keyword Ranking by twitter& word2vec
1.Legend
2.False rumor
3.Habitat
4.Midnight
5.Private house
6.east
7.Mewtwo
8.Hoax information
9.update
10.Evaluation
11.Mamizukamachi, Isesaki City, Gunma Prefecture

Postscript: neologd

https://github.com/neologd/mecab-ipadic-neologd If you read carefully, it seems that you can install the latest version with install-mecab-ipadic-neologd as shown below.

git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
cd mecab-ipadic-neologd
/usr/local/lib/mecab/dic/mecab-ipadic-neologd
./bin/install-mecab-ipadic-neologd -n
echo `mecab-config --dicdir`"/mecab-ipadic-neologd"

vi /usr/local/etc/mecabrc
dicdir = /usr/local/lib/mecab/dic/mecab-ipadic-neologd

Postscript: Addition of user dictionary

Add a user dictionary by referring to here. Enter the station name list, parks in Tokyo, and monster names.

cd /usr/local/lib/mecab/dic/ipadic
# add pokemon list
/usr/local/libexec/mecab/mecab-dict-index -u pokemon.dic -f utf-8 -t utf-8 /mnt/s3/resources/pokemons.csv
# add station list
/usr/local/libexec/mecab/mecab-dict-index -u station.dic -f utf-8 -t utf-8 /mnt/s3/resources/stations.csv
/usr/local/libexec/mecab/mecab-dict-index -u park.dic -f utf-8 -t utf-8 /mnt/s3/resources/park.csv

# copy into dict folder
cp pokemon.dic /usr/local/lib/mecab/dic/mecab-ipadic-neologd/ 
cp station.dic /usr/local/lib/mecab/dic/mecab-ipadic-neologd/
cp park.dic /usr/local/lib/mecab/dic/mecab-ipadic-neologd/

bonus

I will add the result of trying with the data model of wikipedia in 2018.

A woman's life is "love"
Adding "marriage" to a woman's life is "affair"
Subtracting "marriage" from a woman's life is "wisdom"
The answer from the venerable WikiPedia data model

26166696_645464965624132_9085888345293241147_n.jpg

By the way, if you search by job hunting, success, or case, you will find Rokkasho reprocessing plant ...

Recommended Posts

Search for Pokemon haunting information from Twitter
Collecting information from Twitter with Python (Twitter API)
Get images by keyword search from Twitter
Collecting information from Twitter with Python (Environment construction)
Refined search for Pokemon race values using Python
Search for yourself from methods in Django's model
[Python] This is easy! Search for tweets on Twitter
Collecting information from Twitter with Python (morphological analysis with MeCab)
[For beginners] Read DB authentication information from environment variables
Search and save images of Chino Kafu from Twitter
Search Twitter using Python
Search for large files on Linux from the command line
Collecting information from Twitter with Python (MySQL and Python work together)