[PYTHON] Starbucks Twitter Data Location Visualization and Analysis

This is the 4th installment of the Starbucks Twitter series. This time, I would like to process the location information contained in the tweet data!

Part 1: Import data with Twitter REST APIs and import it into mongoDB http://qiita.com/kenmatsu4/items/23768cbe32fe381d54a2

Part 2: Separation of spam from the acquired Twitter data http://qiita.com/kenmatsu4/items/8d88e0992ca6e443f446

Part 3: Why did the number of tweets increase after one day? http://qiita.com/kenmatsu4/items/02034e5688cc186f224b

Part 4: Visualization of location information hidden in Twitter (this time) http://qiita.com/kenmatsu4/items/114f3cff815aa5037535

** <<< Data to be analyzed >>> **

Number of data: 600,777 cases ・・・ It has increased considerably
Acquisition data period: From 2015-03-11 04:43:42 to 2015-04-03 02:09:30
Number of tweets per second: 3.292 tweet/sec

** Schematic diagram of this content **

This time as well, we will analyze tweets that include "Starbucks" in the text. Also, in addition to the latitude and longitude information attached to the tweet itself, MeCab is used to extract the place name from the tweet body, and then use it as the Yahoo! Geocoder API is used to convert to latitude / longitude information, and this content is also viewed. The first half is about how to code the data processing, and the second half is about visualizing and visualizing the results, so if you want to see what is going on pictorially, this is the way to go. Please see [About the lower half](http://qiita.com/kenmatsu4/items/114f3cff815aa5037535#2-Visualization of location information) on the page.

1. Guessing location information from the tweet body

1-1. Preparation

First of all, import the libraries to be used and establish a connection to mongoDB.

%matplotlib inline
import numpy as np
import json, requests, pymongo, re
from pymongo import Connection
from collections import defaultdict
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

connect = Connection('localhost', 27017)
db = connect.starbucks
tweetdata = db.tweetdata
location_dict = db.location

The tweet information itself contains a field called "coordinates", and if you tweet with location information such as GPS, the latitude and longitude will be included here. First, let's see how many people are tweeting with location information.

num_not_geo = tweetdata.find({'coordinates':None,'spam':None,'retweeted_status': None},{'_id':1, 'coordinates':1}).count()
num_geo = tweetdata.find({'coordinates':{"$ne":None},'spam':None,'retweeted_status': None},{'_id':1, 'coordinates':1}).count()

print "num_not_geo",num_not_geo
print "num_geo", num_geo
print "%.3f"%(num_geo / float(num_geo+num_not_geo) * 100),"%"

** <<< Result >>> **

Number of tweets without location information: 444,188
Number of tweets with location information: 5,122
Location tweet rate: 1.140 %

* Counting excluding RT

@ arieee0's "Introduction of SNS user location estimation method from text and usage example" In the case of p24, the ratio of Tweet with location information is 0.3% That's why Starbucks enthusiasts may tend to insist a little w (although I can't tell unless I test to see if the difference is significant).

1-2. Extraction of words indicating place names

I was looking for some geographic information from the body of the tweet, but I found that MeCab can extract the place name in the first place, so I will use it. How convenient! The following is an example of morphological analysis with MeCab, which is in Roppongi, Shibuya and the text, but these are tagged as "proper noun, region" so they can be easily extracted: satisfied:

Noun today,Adverbs possible,*,*,*,*,today,today,Kyo
Is a particle,Particle,*,*,*,*,Is,C,Wow
Roppongi noun,Proper noun,area,General,*,*,Roppongi,Roppongi,Roppongi
Particles,Case particles,General,*,*,*,To,D,D
Verb to go,Independence,*,*,Five-stage / ka reminder,Uninflected word,go,Iku,Iku
But particles,Connection particle,*,*,*,*,but,Ked,Ked
, Symbol,Comma,*,*,*,*,、,、,、
That adnominal adjective,*,*,*,*,*,That,Sono,Sono
Pre-noun,Adverbs possible,*,*,*,*,Before,Mae,Mae
Particles,Case particles,General,*,*,*,To,D,D
Shibuya noun,Proper noun,area,General,*,*,Shibuya,Shibuya,Shibuya
Particles,Case particles,General,*,*,*,To,D,D
Go verb,Independence,*,*,Five-stage / ka reminder,Continuous form,go,Iki,Iki
Tai auxiliary verb,*,*,*,Special Thailand,Uninflected word,Want,Thailand,Thailand
.. symbol,Kuten,*,*,*,*,。,。,。

Since the noun was already extracted with MeCab and put in the DB, the place name is extracted from here and put in another field.

#Extract the area name from the text with Mecab and field: location_Set as name
def location_name_mecab(sentence):
    t = mc.Tagger('-Ochasen -d /usr/local/Cellar/mecab/0.996/lib/mecab/dic/mecab-ipadic-neologd/')
    sentence = sentence.replace('\n', ' ')
    text = sentence.encode('utf-8') 
    node = t.parseToNode(text) 
    result_dict = defaultdict(list)
    for i in range(140):
        if node.surface != "":  #Exclude headers and footers
            #Select a proper noun and a local word
            if (node.feature.split(",")[1] == "Proper noun") and (node.feature.split(",")[2] == "area"):
                plain_word = node.feature.split(",")[6]
                if plain_word !="*":
                    result_dict[u'Area name'].append(plain_word.decode('utf-8'))
        node = node.next
        if node is None:
            break
    return result_dict

for d in tweetdata.find({'spam':None},{'_id':1, 'text':1}):
    ret = location_name_mecab(d['text'])
    tweetdata.update({'_id' : d['_id']},{'$push': {'location_name':{'$each':ret[u'Area name']}}})

1-3. Convert from place name to latitude / longitude

Now that the place name has been extracted, the latitude and longitude information will be obtained based on it. I use the Yahoo! Geocoder API, but every time I go to access, it consumes the number of times and I get stuck in the access limit, so I pick up the place name to be converted first and have a table with a set of place name and latitude / longitude Will be brought to mongoDB.

First, create a list of place names for which you want latitude and longitude information.

#Tweet location_Make name unique and dictionary object"loc_name_dict"Aggregate to
loc_name_dict = defaultdict(int)
for d in tweetdata.find({'spam':None},{'_id':1, 'location_name':1}):
    for name in d['location_name']:
        loc_name_dict[name] += 1

Throw the aggregated place name set to Yahoo! Geocoder API to get latitude and longitude information. Since appid is required to use the geocoder API, create an account on Yahoo! Developer Network, obtain the appid, and set it.

#Add latitude and longitude to the place name extracted from the tweet and import it into mongoDB
def get_coordinate_from_location(location_name):
    payload = {'appid': '<Set Yahoo appid>', 'output':'json'} #Set the appid to that of your account!
    payload['query'] = location_name # eg.g u'Roppongi'
    url = "http://geo.search.olp.yahooapis.jp/OpenLocalPlatform/V1/geoCoder"
    r = requests.get(url, params=payload)
    if r.status_code == 200:
        jdata = json.loads(r.content)
        #Calculate the average from the list of location information obtained by the query and use it as the latitude and longitude of the place name.
        try:
            ret = np.array([map(float,j['Geometry']['Coordinates'].split(',')) for j in jdata['Feature']])
        except KeyError, e:
            "KeyError(%s)" % str(e)
            return []
        
        return np.average(ret,axis=0)
    else:
        print "%d: error." % r.status_code
        return []
    
#Place name-Table holding latitude / longitude links"location"Put in
for name in loc_name_dict.keys():
    loc = get_coordinate_from_location(name)
    if len(loc) > 0:
        location_dict.insert({"word":name,"latitude":loc[1],"longitude":loc[0]})

1-4. Add latitude / longitude information to Tweet data

Now that the place name and latitude / longitude are linked, we will apply this to the Tweet data. Katakana place names often represent country names, etc., and there were few cases in which they were represented as their own location, so place names with only katakana were excluded. In addition, there is an area called Shinkaihotsu in Imizu City, Toyama Prefecture, but this was also excluded as an exception because it was used in a different sense in many cases. (It's a rare place name) Also, "Japan" is very vague, so I'm excluding it.

#Add text location information to Tweet data

#Extract place name and latitude / longitude from DB and keep them in dictionary object
loc_dict = {loc['word']:[loc['longitude'],loc['latitude']] for loc in location_dict.find({})}

def get_coord(loc_name):

    #Exclude place names only in katakana (because there are many country names and it is unlikely that they represent the location)
    regex = u'^[A-Down]*$'
    match = re.search(regex, loc_name, re.U)
    if match:
        return 0
    
    #Excluded words (because the new development is a place name for some reason and Japan is too vague but frequent)
    if loc_name in [u'New development', u'Japan']:
        return 0
    
    if loc_name in loc_dict:
        #If there is, return location information
        return (loc_dict[loc_name][0],loc_dict[loc_name][1])
    else:
        #If not, return zero
        return 0
    
def exist_check(word):
    return True if word in loc_dict else False

for d in tweetdata.find({'coordinates':None,'spam':None},{'_id':1, 'location_name':1}):
    if len(d['location_name']) > 0:
        name_list = np.array(d['location_name'])
        #True if there is location information,If not, False sequence generation
        ind = np.array(map(exist_check, name_list))
        #True number count
        T_num = len(ind[ind==True])

        #Process only Tweets with place names
        if T_num > 0:
            coordRet = map(get_coord, name_list[ind])  # key_list[ind]Is only for those that have location information
            [coordRet.remove(0) for i in range(coordRet.count(0))]  #Remove 0
            if len(coordRet) == 0:
                continue
            #Adopt the first place name (There are cases where multiple place names are in Tweet, but the first one is more important)
            lon, lat = coordRet[0]
            #Reflected in DB
            tweetdata.update({'_id' : d['_id']}, 
                     {'$set' : {'text_coord' : {'longitude':lon, 'latitude': lat}}})

2. Location information visualization

2-1. Plot

Now that we have all the data, I would like to visualize it. First of all, I will plot without thinking about anything.

#Retrieving latitude / longitude information included in tweets
loc_data = np.array([[d['coordinates']['coordinates'][1],d['coordinates']['coordinates'][0]]\
           for d in tweetdata.find({'coordinates':{"$ne":None},'spam':None},{'_id':1, 'coordinates':1})])

#Extract tweet extraction location information list from DB
text_coord = np.array([[d['text_coord']['latitude'],d['text_coord']['longitude']] for d in tweetdata.find({'text_coord':{'$ne':None}},{'_id':1, 'text_coord':1})])

lat1 = loc_data[:,0]  #latitude(latitude)
lon1 = loc_data[:,1]  #longitude(longitude)

lat2 = text_coord[:,0]  #latitude(latitude)
lon2 = text_coord[:,1]  #longitude(longitude)

xlim_min = [np.min(lon)*.9,120,139]
xlim_max = [np.max(lon)*1.1,150,140.5]
ylim_min = [np.min(lat)*.9,20,35.1]
ylim_max = [np.max(lat)*1.1,50,36.1]

for x1,x2,y1,y2 in zip(xlim_min,xlim_max,ylim_min,ylim_max):
    plt.figure(figsize=(10,10))
    plt.xlim(x1,x2)
    plt.ylim(y1,y2)
    plt.scatter(lon1, lat1, s=20, alpha=0.4, c='b')
    
    
for x1,x2,y1,y2 in zip(xlim_min,xlim_max,ylim_min,ylim_max):
    plt.figure(figsize=(10,10))
    plt.xlim(x1,x2)
    plt.ylim(y1,y2)
    plt.scatter(lon2, lat2, s=20, alpha=0.4, c='g')

Let's start with the Tweet data that originally contains latitude and longitude.

How about it? I'm not sure.

However, as you may have noticed, there are spots in the upper right corner.
I would like to expand it a little.

It's Japan! : smile:
Since the search word is "Starbucks", it is natural, but the fact that the Japanese archipelago can be identified with about 1%, 5,000 tweets means that the people tweeting with "Starbucks" are evenly distributed. It can be said that there is.

2-2. Plot on the map

Now, I would like to put this data on the map and see it more clearly. We will use a library called Matplotlib basemap, so install the library by referring to this link.

#Plot on the map
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt

#ite = 20
ar = np.arange

enlarge = [1,2,4,8,16,32]
w_list = [15000000./(i) for i in enlarge]
h_list = [9000000./(i) for i in enlarge]

xlim_min = [-142,  80,  120,  135,   139]#[3:5]
xlim_max = [ 192, 160,  150,  142,   141]#[3:5]
ylim_min = [ -55,   0,   20,   33,    35]#[3:5]
ylim_max = [  75,  50,   50,   37,  36.2]#[3:5]
ss       = [ 0.7, 0.3,  0.1, 0.03, 0.005]#[3:5]

for lon, lat in zip([lon1,lon2],[lat1,lat2]):
    for i, s in zip(ar(len(xlim_min)),ss):
    
        m = Basemap(projection='merc',llcrnrlat=ylim_min[i] ,urcrnrlat=ylim_max[i] ,\
            llcrnrlon=xlim_min[i],urcrnrlon=xlim_max[i] ,lat_ts=20, resolution='c')
        plt.figure(figsize=(13,13))

        m.bluemarble()
    
        if i > 2:
            m.drawcoastlines(linewidth=0.25)
    
        for x, y in zip(lon,lat):
            m.tissot(x,  y, s,100,facecolor='red',zorder=100,alpha=0.4)

        plt.show()
        plt.savefig('plot_map_%s.png'%(str(i)))

Well, this is the result.

If you put it on the map, you can see at a glance which area you are tweeting from. Since I'm searching for "Starbucks", there are still many in Japan, but surprisingly, "Starbucks" tweets are also being made from various regions such as Europe, the United States, and Southeast Asia!

Also, I will expand it.

It's filling up Japan w You can see after tweets in Taiwan, China, South Korea, and Southeast Asia.

Expand further.

Although it is scattered all over, Higashi-Meihan is still particularly crowded.

After all, there are many in urban areas where the population seems to be large, and tweets have not been tweeted from mountainous areas.

This is the one that focused on the metropolitan area with the highest magnification. The whitish part is the plain part, but there are many tweets from here, and it is not tweeted from the green mountain part. I think it fits my intuition somehow.

3. Visualize the location information inferred from the tweet text

The latitude and longitude information that can be inferred from the text is ** 50,310 **, which is nearly 10 times as many as the previous GPS information-based data. Since the process of plotting the latitude and longitude estimated from the tweet body with the previous code is already included, we will look at the map again.

I'm looking forward to seeing what the plot will look like from the place name in the body of the tweet.

This time I'm completely focused on Japan. Because the Japanese place names are extracted by MeCab and the katakana place names are excluded as mentioned above, I think that the result can be said to be as expected.

Enlarge.

It's tighter than before! Hokkaido is less dense, but Honshu, Shikoku, and Kyushu seem to be much denser. When converting from a place name to latitude and longitude, a mysterious place name was included, or a point in the middle of the sea was also included, so I think that GPS information cannot be beaten in terms of accuracy. Also, since the place name in the body of the tweet does not always indicate the current position, I think that accuracy will improve if a method of guessing how the place name is used in the sentence from the surrounding words is taken, so I would like to make it an issue. I will.

Enlarge.

That said, it's scattered in a nice way!

Finally, the metropolitan area again.

Somehow, the number seems to be less than that of GPS information because the latitude and longitude are acquired from the place name, so they are aggregated at the same point. You can see dark orange dots, which means that many dots are gathered in the same place.

So, this time, I took out the location information from the Tweet data and visualized it. What is the content of tweeting "Starbucks" overseas? There are some things that I'm curious about, but it's been a long time, so I'll write about the analysis results in the next installment.
The full code can be found at Gist.