[PYTHON] Try rudimentary sentiment analysis on Twitter Stream API data.

I think that sentiment analysis is one of the things I would like to do when conducting Twitter analysis. (Isn't it?) I think there are various methods, but I would like to start with the simplest example and see how it gradually becomes more advanced (preferably).

The data to be analyzed will be Twitter again. However, until now, it was acquired through Twitter REST APIs, but this time it was acquired through Twitter Stream API. (/ streaming / overview) I would like to import the Twitter data, and then digitize the emotion analysis results and store them in the database.

Please refer to Previous article for the explanation of getting Twitter data to mongodb.

1. Get data from Twitter Stream API and store it in mongoDB.

1-1. Preparation

First of all, it is a preparation. It imports various libraries, declares utility functions, and connects to the DB.

from requests_oauthlib import OAuth1Session
from requests.exceptions import ConnectionError, ReadTimeout, SSLError
import json, time, exceptions, sys, datetime, pytz, re, unicodedata, pymongo
import oauth2 as oauth
import urllib2 as urllib
import MeCab as mc
from collections import defaultdict
from pymongo import MongoClient
from httplib import IncompleteRead
import numpy as np

import logging
from logging import FileHandler, Formatter
import logging.config

connect = MongoClient('localhost', 27017)
db = connect.word_info
posi_nega_dict = db.posi_nega_dict
db2 = connect.twitter
streamdata = db2.streamdata

def str_to_date_jp(str_date):
    dts = datetime.datetime.strptime(str_date,'%a %b %d %H:%M:%S +0000 %Y')
    return pytz.utc.localize(dts).astimezone(pytz.timezone('Asia/Tokyo'))

def mecab_analysis(sentence):
    t = mc.Tagger('-Ochasen -d /usr/local/Cellar/mecab/0.996/lib/mecab/dic/mecab-ipadic-neologd/')
    sentence = sentence.replace('\n', ' ')
    text = sentence.encode('utf-8') 
    node = t.parseToNode(text) 
    result_dict = defaultdict(list)
    for i in range(140):  #Since it is a tweet, MAX 140 characters
        if node.surface != "":  #Exclude headers and footers
            word_type = node.feature.split(",")[0]
            if word_type in ["adjective", "verb","noun", "adverb"]:
                plain_word = node.feature.split(",")[6]
                if plain_word !="*":
                    result_dict[word_type.decode('utf-8')].append(plain_word.decode('utf-8'))
        node = node.next
        if node is None:
            break
    return result_dict

def logger_setting():
    import logging
    from logging import FileHandler, Formatter
    import logging.config

    logging.config.fileConfig('logging_tw.conf')
    logger = logging.getLogger('filelogger')
    return logger

logger = logger_setting()

KEYS = { #List the keys you got with your account below
        'consumer_key':'**********',
        'consumer_secret':'**********',
        'access_token':'**********',
        'access_secret''**********',
       }

This time, the log uses Logger to output a file. The log output configuration file is as follows.

logging_tw.conf


# logging_tw.conf

[loggers]
keys=root, filelogger

[handlers]
keys= fileHandler 

[formatters]
keys=logFormatter

[logger_root]
level=DEBUG
handlers=fileHandler

[logger_filelogger]
level=DEBUG
handlers=fileHandler
qualname=filelogger
propagate=0

[handler_fileHandler]
class=handlers.RotatingFileHandler
level=DEBUG
formatter=logFormatter
args=('logging_tw.log',)

[formatter_logFormatter]
format=%(asctime)s - %(name)s - %(levelname)s - %(message)s
datefmt=

1-2. Downloading and Persistence of Japanese Evaluation Polar Dictionary

Created by Tohoku University's Laboratory of Professor Inui and Professor Okazaki Emotions are quantified using the Japanese evaluation polarity dictionary, so first of all

From here Download and store in the same folder as .py To do.

Regarding the Japanese evaluation polarity dictionary (words), the positive term is quantified as 1 and the negative term is -1 and imported into mongodb. For the Japanese evaluation polarity dictionary (noun edition), the term p is quantified as 1, the term e is quantified as 0, and the term n is -1 and imported into mongodb. Below is the code.

#Importing positive and negative dictionaries of words into mongoDB

#Japanese Evaluation Polar Dictionary (Terms) ver.1.Import 0 (December 2008 version) to mongodb
#Positive term is 1,Negative terminology-Quantify as 1
with open("wago.121808.pn.txt", 'r') as f:
    for l in f.readlines():
        l = l.split('\t')
        l[1] = l[1].replace(" ","").replace('\n','')
        value = 1 if l[0].split('(')[0]=="Positive" else -1
        posi_nega_dict.insert({"word":l[1].decode('utf-8'),"value":value})
        

#Japanese Evaluation Polar Dictionary (Noun Edition) ver.1.Import 0 (December 2008 version) to mongodb
#The term p is 1 The term e is 0,The term of n is-Quantify as 1
with open("pn.csv.m3.120408.trim", 'r') as f:
    for l in f.readlines():
        l = l.split('\t')
        
        if l[1]=="p":
            value = 1
        elif l[1]=="e":
            value = 0
        elif l[1]=="n":
            value = -1
            
        posi_nega_dict.insert({"word":l[0].decode('utf-8'),"value":value})  

1-3. Processing to quantify emotions

Since we were able to create a database of emotional values for each word in 1-2, we will add processing that can convert sentences into emotional values. However, this time it is "elementary" so nothing is done and

  1. Simply check if each word contained in the sentence exists in the Japanese evaluation polarity dictionary.
  2. If it exists, use that number
  3. Calculate the emotional value of the sentence according to the formula below. ($ X_i $ is the emotion value of the word in the Japanese evaluation polarity dictionary, $ n $ is the number of words in the Japanese evaluation polarity dictionary)
{\rm sentiment\, value\, of\, the\, sentence} \, = \, \frac{1}{n}\sum_{i=1}^{n} x_i 

Therefore, the emotional value can be derived between -1 and 1 regardless of the number of words in the sentence, and comparison between sentences becomes possible.


#Emotion level setting(Include in dictionary object for hash search to speed up search)
pn_dict = {data['word']: data['value'] for data in posi_nega_dict.find({},{'word':1,'value':1})}

def isexist_and_get_data(data, key):
    return data[key] if key in data else None

# -Returns the emotional value for a given sentence (word list) in the range 1 to 1.(1:Most positive,-1:Most negative)
def get_setntiment(word_list):
    val = 0
    score = 0
    word_count = 0
    val_list = []
    for word in word_list:
        val = isexist_and_get_data(pn_dict, word)
        val_list.append(val)
        if val is not None and val != 0: #If found, add the scores and count the words
            score += val
            word_count += 1
    
    logger.debug(','.join(word_list).encode('utf-8'))       
    logger.debug(val_list)
    return score/float(word_count) if word_count != 0. else 0.

1-4. Downloading data from Twitter Stream API

This is the code to download the tweet data from the Twitter Stream API. While downloading tweets, morphological analysis is performed with MeCab, and each word is separated and listed by noun, verb, adjective, and adverb. So-called Bag of Words.

Then, the emotion value is derived by the function get_setntiment () defined earlier for those words, and it is stored in mongodb together with this.

# -----Stream data import---------#
consumer = oauth.Consumer(key=KEYS['consumer_key'], secret=KEYS['consumer_secret'])
token = oauth.Token(key=KEYS['access_token'], secret=KEYS['access_secret'])

url = 'https://stream.twitter.com/1.1/statuses/sample.json'
params = {}

request = oauth.Request.from_consumer_and_token(consumer, token, http_url=url, parameters=params)
request.sign_request(oauth.SignatureMethod_HMAC_SHA1(), consumer, token)
res = urllib.urlopen(request.to_url())

def get_list_from_dict(result, key):
    if key in result.keys():
        result_list = result[key]
    else:
        result_list = []
    return result_list

cnt = 1
try:
    for r in res:
        data = json.loads(r)
        if 'delete' in data.keys():
            pass
        else:    
            if data['lang'] in ['ja']: #['ja','en','und']:
                result = mecab_analysis(data['text'].replace('\n',''))

                noun_list      = get_list_from_dict(result, u'noun')
                verb_list      = get_list_from_dict(result, u'verb')
                adjective_list = get_list_from_dict(result, u'adjective')
                adverb_list    = get_list_from_dict(result, u'adverb')

                item = {'id':data['id'], 'screen_name': data['user']['screen_name'], 
                        'text':data['text'].replace('\n',''), 'created_datetime':str_to_date_jp(data['created_at']),\
                       'verb':verb_list, 'adjective':adjective_list, 'noun': noun_list, 'adverb':adverb_list}
                if 'lang' in data.keys():
                    item['lang'] = data['lang']
                else:
                    item['lang'] = None
                
                #Added sentiment analysis results####################
                word_list = [word for k in result.keys() for word in result[k] ]
                item['sentiment'] = get_setntiment(word_list)
                
                streamdata.insert(item)
                if cnt%1000==0:
                    logger.info("%d, "%cnt)
                cnt += 1
except IncompleteRead as e:
    logger.error( '===error contents===')
    logger.error(  'type:' + str(type(e)))
    logger.error(  'args:' + str(e.args))
    logger.error(  'message:' + str(e.message))
    logger.error(  'e self:' + str(e))
    try:
        if type(e) == exceptions.KeyError:
            logger.error( data.keys())
    except:
        pass
except Exception as e:
    logger.error( '===error contents===')
    logger.error( 'type:' + str(type(e)))
    logger.error( 'args:' + str(e.args))
    logger.error( 'message:' + str(e.message))
    logger.error( 'e self:' + str(e))
    try:
        if type(e) == exceptions.KeyError:
            logger.error( data.keys())
    except:
        pass 
except:
    logger.error( "error.")

logger.info( "finished.")

Up to this point, the analysis method has been a simple method of simply assigning emotional values to each word and averaging them. As for future development, we think that spam classification will be an issue as further preprocessing, and in the current situation, processing that takes into account the relationships between words will be issues. In particular, words such as "not cute" are divided into "cute" and "not", and "not" is denied, so the positive expression of "cute" +1.0 should be canceled with "not" to make it -1.0. Is natural, but at present, only "cute" is processed and it becomes +1.0, which is the opposite result.

In order to handle this correctly, it is necessary to relate and interpret which word "not" depends on by a method called "dependency analysis". In the next section, I would like to explain how to install the dependency analysis library CaboCha first.

2. Dependency analysis

2-1. Installation of dependency analysis library CaboCha

So I would like to handle the installation of the dependency analysis library CaboCha http://taku910.github.io/cabocha/ on Mac. It took a lot of time to install, so I hope it will be helpful.

** Download CaboCha ** https://drive.google.com/folderview?id=0B4y35FiV1wh7cGRCUUJHVTNJRnM&usp=sharing#list

A library called CRF + is required to install CaboCha. ** CRF + page ** http://taku910.github.io/crfpp/#install

** CRF + Download ** https://drive.google.com/folderview?id=0B4y35FiV1wh7fngteFhHQUN2Y1B5eUJBNHZUemJYQV9VWlBUb3JlX0xBdWVZTWtSbVBneU0&usp=drive_web#list

After downloading, unzip and make & install. Since there are some necessary environment variables and libraries, their application is also described below.

tar zxfv CRF++-0.58.tar
cd CRF++-0.58
./configure 
make
sudo make install

export LIBRARY_PATH="/usr/local/include:/usr/local/lib:"
export CPLUS_INCLUDE_PATH="/usr/local/include:/opt/local/include"
export OBJC_INCLUDE_PATH="/usr/local/include:/opt/local/lib"

brew tap homebrew/dupes
brew install libxml2 libxslt libiconv
brew link --force libxml2
brew link --force libxslt
brew link libiconv —force

tar zxf cabocha-0.69.tar.bz2
cd cabocha-0.69
./configure --with-mecab-config=`which mecab-config` --with-charset=UTF8
make
make check
sudo make install

#[output: install information]
#.././install-sh -c -d '/usr/local/share/man/man1'
#/usr/bin/install -c -m 644 cabocha.1 '/usr/local/share/man/man1'
#./install-sh -c -d '/usr/local/bin'
#/usr/bin/install -c cabocha-config '/usr/local/bin'
#./install-sh -c -d '/usr/local/etc'
#/usr/bin/install -c -m 644 cabocharc '/usr/local/etc'

cd cabocha-0.69/python
python setup.py install

cp build/lib.macosx-10.10-intel-2.7/_CaboCha.so /Library/Python/2.7/site-packages
cp build/lib.macosx-10.10-intel-2.7/CaboCha.py /Library/Python/2.7/site-packages

The above installation method was created by referring to the following site.

The site that was used as a reference for installing CaboCha

http://qiita.com/nezuq/items/f481f07fc0576b38e81d#1-10 http://hotolab.net/blog/mac_mecab_cabocha/ http://qiita.com/t_732_twit/items/a7956a170b1694f7ffc2 http://blog.goo.ne.jp/inubuyo-tools/e/db7b43bbcfdc23a9ff2ad2f37a2c72df http://qiita.com/t_732_twit/items/a7956a170b1694f7ffc2

2-2. CaboCha trial

Try the dependency analysis with the trial text.

import CaboCha

c = CaboCha.Parser()

sentence = "Soseki handed this book to the woman who saw Ryunosuke."

tree =  c.parse(sentence)

print tree.toString(CaboCha.FORMAT_TREE)
print tree.toString(CaboCha.FORMAT_LATTICE)

The result of executing this code is as follows.

output


Soseki-----------D
this-D       |
Book---D   |
Ryunosuke-D   |
saw-D |
To women-D
I handed it over.
EOS

* 0 6D 0/1 -2.475106
Soseki noun,Proper noun,Personal name,Name,*,*,Soseki,SO SEKI,Soseki
Is a particle,Particle,*,*,*,*,Is,C,Wow
* 1 2D 0/0 1.488413
This adnominal adjective,*,*,*,*,*,this,this,this
* 2 4D 0/1 0.091699
Book noun,General,*,*,*,*,Book,Hong,Hong
Particles,Case particles,General,*,*,*,To,Wo,Wo
* 3 4D 0/1 2.266675
Ryunosuke noun,Proper noun,Personal name,Name,*,*,Ryunosuke,Ryunosuke,Ryunosuke
Particles,Case particles,General,*,*,*,To,Wo,Wo
* 4 5D 0/1 1.416783
Verb,Independence,*,*,One step,Continuous form,to see,Mi,Mi
Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta
* 5 6D 0/1 -2.475106
Feminine noun,General,*,*,*,*,Female,Josei,Josei
Particles,Case particles,General,*,*,*,To,D,D
* 6 -1D 0/1 0.000000
Passing verb,Independence,*,*,Godan / Sa line,Continuous form,hand over,I,I
Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta
.. symbol,Kuten,*,*,*,*,。,。,。
EOS

The line with "*" is the analysis result, and some words following it are the clauses.

Next to * is the "phrase number". Next is the clause number of the contact, which is -1 if there is no contact. It seems that you don't have to worry about "D".

The next two numbers are the positions of heads / function words

The last number indicates the degree of engagement score. Generally, the larger the value, the easier it is to engage.

So, the first phrase is 0 "Soseki is", and the person in charge is 6D, so it is "passed".

In this article, I even installed CaboCha, a dependency analysis library. In the next article, I'll apply this to tweet data.

References, etc.

Japanese Evaluation Polarity Dictionary --Inui Okazaki Laboratory --Tohoku University Nozomi Kobayashi, Kentaro Inui, Yuji Matsumoto, Kenji Tateishi, Shunichi Fukushima. Collection of evaluation expressions for extracting opinions. Natural language processing, Vol.12, No.3, pp.203-222, 2005. Masahiko Higashiyama, Kentaro Inui, Yuji Matsumoto, Acquisition of Noun Evaluation Polarity Focusing on Predicate Selection Preference, Proceedings of the 14th Annual Meeting of the Language Processing Society, pp.584-587, 2008.

Recommended Posts

Try rudimentary sentiment analysis on Twitter Stream API data.
Try using the Twitter API
[Python] Notes on data analysis
Try "100 knocks on data science" ①
Try using the Twitter API
Analyzing Twitter Data | Trend Analysis
Starbucks Twitter Data Location Visualization and Analysis
Data analysis environment centered on Datalab (+ GCP)
Sentiment analysis of large-scale tweet data by NLTK
Try importing MLB data on Mac and Python
Get a large amount of Starbucks Twitter data with python and try data analysis Part 1