This is the 4th installment of the Starbucks Twitter series. This time, I would like to process the location information contained in the tweet data!
Part 1: Import data with Twitter REST APIs and import it into mongoDB http://qiita.com/kenmatsu4/items/23768cbe32fe381d54a2
Part 2: Separation of spam from the acquired Twitter data http://qiita.com/kenmatsu4/items/8d88e0992ca6e443f446
Part 3: Why did the number of tweets increase after one day? http://qiita.com/kenmatsu4/items/02034e5688cc186f224b
Part 4: Visualization of location information hidden in Twitter (this time) http://qiita.com/kenmatsu4/items/114f3cff815aa5037535
** <<< Data to be analyzed >>> **
** Schematic diagram of this content **
This time as well, we will analyze tweets that include "Starbucks" in the text. Also, in addition to the latitude and longitude information attached to the tweet itself, MeCab is used to extract the place name from the tweet body, and then use it as the Yahoo! Geocoder API is used to convert to latitude / longitude information, and this content is also viewed. The first half is about how to code the data processing, and the second half is about visualizing and visualizing the results, so if you want to see what is going on pictorially, this is the way to go. Please see [About the lower half](http://qiita.com/kenmatsu4/items/114f3cff815aa5037535#2-Visualization of location information) on the page.
First of all, import the libraries to be used and establish a connection to mongoDB.
%matplotlib inline
import numpy as np
import json, requests, pymongo, re
from pymongo import Connection
from collections import defaultdict
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
connect = Connection('localhost', 27017)
db = connect.starbucks
tweetdata = db.tweetdata
location_dict = db.location
The tweet information itself contains a field called "coordinates", and if you tweet with location information such as GPS, the latitude and longitude will be included here. First, let's see how many people are tweeting with location information.
num_not_geo = tweetdata.find({'coordinates':None,'spam':None,'retweeted_status': None},{'_id':1, 'coordinates':1}).count()
num_geo = tweetdata.find({'coordinates':{"$ne":None},'spam':None,'retweeted_status': None},{'_id':1, 'coordinates':1}).count()
print "num_not_geo",num_not_geo
print "num_geo", num_geo
print "%.3f"%(num_geo / float(num_geo+num_not_geo) * 100),"%"
** <<< Result >>> **
@ arieee0's "Introduction of SNS user location estimation method from text and usage example" In the case of p24, the ratio of Tweet with location information is 0.3% That's why Starbucks enthusiasts may tend to insist a little w (although I can't tell unless I test to see if the difference is significant).
I was looking for some geographic information from the body of the tweet, but I found that MeCab can extract the place name in the first place, so I will use it. How convenient! The following is an example of morphological analysis with MeCab, which is in Roppongi, Shibuya and the text, but these are tagged as "proper noun, region" so they can be easily extracted: satisfied:
Noun today,Adverbs possible,*,*,*,*,today,today,Kyo
Is a particle,Particle,*,*,*,*,Is,C,Wow
Roppongi noun,Proper noun,area,General,*,*,Roppongi,Roppongi,Roppongi
Particles,Case particles,General,*,*,*,To,D,D
Verb to go,Independence,*,*,Five-stage / ka reminder,Uninflected word,go,Iku,Iku
But particles,Connection particle,*,*,*,*,but,Ked,Ked
, Symbol,Comma,*,*,*,*,、,、,、
That adnominal adjective,*,*,*,*,*,That,Sono,Sono
Pre-noun,Adverbs possible,*,*,*,*,Before,Mae,Mae
Particles,Case particles,General,*,*,*,To,D,D
Shibuya noun,Proper noun,area,General,*,*,Shibuya,Shibuya,Shibuya
Particles,Case particles,General,*,*,*,To,D,D
Go verb,Independence,*,*,Five-stage / ka reminder,Continuous form,go,Iki,Iki
Tai auxiliary verb,*,*,*,Special Thailand,Uninflected word,Want,Thailand,Thailand
.. symbol,Kuten,*,*,*,*,。,。,。
Since the noun was already extracted with MeCab and put in the DB, the place name is extracted from here and put in another field.
#Extract the area name from the text with Mecab and field: location_Set as name
def location_name_mecab(sentence):
t = mc.Tagger('-Ochasen -d /usr/local/Cellar/mecab/0.996/lib/mecab/dic/mecab-ipadic-neologd/')
sentence = sentence.replace('\n', ' ')
text = sentence.encode('utf-8')
node = t.parseToNode(text)
result_dict = defaultdict(list)
for i in range(140):
if node.surface != "": #Exclude headers and footers
#Select a proper noun and a local word
if (node.feature.split(",")[1] == "Proper noun") and (node.feature.split(",")[2] == "area"):
plain_word = node.feature.split(",")[6]
if plain_word !="*":
result_dict[u'Area name'].append(plain_word.decode('utf-8'))
node = node.next
if node is None:
break
return result_dict
for d in tweetdata.find({'spam':None},{'_id':1, 'text':1}):
ret = location_name_mecab(d['text'])
tweetdata.update({'_id' : d['_id']},{'$push': {'location_name':{'$each':ret[u'Area name']}}})
Now that the place name has been extracted, the latitude and longitude information will be obtained based on it. I use the Yahoo! Geocoder API, but every time I go to access, it consumes the number of times and I get stuck in the access limit, so I pick up the place name to be converted first and have a table with a set of place name and latitude / longitude Will be brought to mongoDB.
First, create a list of place names for which you want latitude and longitude information.
#Tweet location_Make name unique and dictionary object"loc_name_dict"Aggregate to
loc_name_dict = defaultdict(int)
for d in tweetdata.find({'spam':None},{'_id':1, 'location_name':1}):
for name in d['location_name']:
loc_name_dict[name] += 1
Throw the aggregated place name set to Yahoo! Geocoder API to get latitude and longitude information. Since appid is required to use the geocoder API, create an account on Yahoo! Developer Network, obtain the appid, and set it.
#Add latitude and longitude to the place name extracted from the tweet and import it into mongoDB
def get_coordinate_from_location(location_name):
payload = {'appid': '<Set Yahoo appid>', 'output':'json'} #Set the appid to that of your account!
payload['query'] = location_name # eg.g u'Roppongi'
url = "http://geo.search.olp.yahooapis.jp/OpenLocalPlatform/V1/geoCoder"
r = requests.get(url, params=payload)
if r.status_code == 200:
jdata = json.loads(r.content)
#Calculate the average from the list of location information obtained by the query and use it as the latitude and longitude of the place name.
try:
ret = np.array([map(float,j['Geometry']['Coordinates'].split(',')) for j in jdata['Feature']])
except KeyError, e:
"KeyError(%s)" % str(e)
return []
return np.average(ret,axis=0)
else:
print "%d: error." % r.status_code
return []
#Place name-Table holding latitude / longitude links"location"Put in
for name in loc_name_dict.keys():
loc = get_coordinate_from_location(name)
if len(loc) > 0:
location_dict.insert({"word":name,"latitude":loc[1],"longitude":loc[0]})
Now that the place name and latitude / longitude are linked, we will apply this to the Tweet data. Katakana place names often represent country names, etc., and there were few cases in which they were represented as their own location, so place names with only katakana were excluded. In addition, there is an area called Shinkaihotsu in Imizu City, Toyama Prefecture, but this was also excluded as an exception because it was used in a different sense in many cases. (It's a rare place name) Also, "Japan" is very vague, so I'm excluding it.
#Add text location information to Tweet data
#Extract place name and latitude / longitude from DB and keep them in dictionary object
loc_dict = {loc['word']:[loc['longitude'],loc['latitude']] for loc in location_dict.find({})}
def get_coord(loc_name):
#Exclude place names only in katakana (because there are many country names and it is unlikely that they represent the location)
regex = u'^[A-Down]*$'
match = re.search(regex, loc_name, re.U)
if match:
return 0
#Excluded words (because the new development is a place name for some reason and Japan is too vague but frequent)
if loc_name in [u'New development', u'Japan']:
return 0
if loc_name in loc_dict:
#If there is, return location information
return (loc_dict[loc_name][0],loc_dict[loc_name][1])
else:
#If not, return zero
return 0
def exist_check(word):
return True if word in loc_dict else False
for d in tweetdata.find({'coordinates':None,'spam':None},{'_id':1, 'location_name':1}):
if len(d['location_name']) > 0:
name_list = np.array(d['location_name'])
#True if there is location information,If not, False sequence generation
ind = np.array(map(exist_check, name_list))
#True number count
T_num = len(ind[ind==True])
#Process only Tweets with place names
if T_num > 0:
coordRet = map(get_coord, name_list[ind]) # key_list[ind]Is only for those that have location information
[coordRet.remove(0) for i in range(coordRet.count(0))] #Remove 0
if len(coordRet) == 0:
continue
#Adopt the first place name (There are cases where multiple place names are in Tweet, but the first one is more important)
lon, lat = coordRet[0]
#Reflected in DB
tweetdata.update({'_id' : d['_id']},
{'$set' : {'text_coord' : {'longitude':lon, 'latitude': lat}}})
Now that we have all the data, I would like to visualize it. First of all, I will plot without thinking about anything.
#Retrieving latitude / longitude information included in tweets
loc_data = np.array([[d['coordinates']['coordinates'][1],d['coordinates']['coordinates'][0]]\
for d in tweetdata.find({'coordinates':{"$ne":None},'spam':None},{'_id':1, 'coordinates':1})])
#Extract tweet extraction location information list from DB
text_coord = np.array([[d['text_coord']['latitude'],d['text_coord']['longitude']] for d in tweetdata.find({'text_coord':{'$ne':None}},{'_id':1, 'text_coord':1})])
lat1 = loc_data[:,0] #latitude(latitude)
lon1 = loc_data[:,1] #longitude(longitude)
lat2 = text_coord[:,0] #latitude(latitude)
lon2 = text_coord[:,1] #longitude(longitude)
xlim_min = [np.min(lon)*.9,120,139]
xlim_max = [np.max(lon)*1.1,150,140.5]
ylim_min = [np.min(lat)*.9,20,35.1]
ylim_max = [np.max(lat)*1.1,50,36.1]
for x1,x2,y1,y2 in zip(xlim_min,xlim_max,ylim_min,ylim_max):
plt.figure(figsize=(10,10))
plt.xlim(x1,x2)
plt.ylim(y1,y2)
plt.scatter(lon1, lat1, s=20, alpha=0.4, c='b')
for x1,x2,y1,y2 in zip(xlim_min,xlim_max,ylim_min,ylim_max):
plt.figure(figsize=(10,10))
plt.xlim(x1,x2)
plt.ylim(y1,y2)
plt.scatter(lon2, lat2, s=20, alpha=0.4, c='g')
Let's start with the Tweet data that originally contains latitude and longitude.
How about it? I'm not sure.
However, as you may have noticed, there are spots in the upper right corner.
I would like to expand it a little.
It's Japan! : smile:
Since the search word is "Starbucks", it is natural, but the fact that the Japanese archipelago can be identified with about 1%, 5,000 tweets means that the people tweeting with "Starbucks" are evenly distributed. It can be said that there is.
Now, I would like to put this data on the map and see it more clearly.
We will use a library called Matplotlib basemap, so install the library by referring to this link.
#Plot on the map
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
#ite = 20
ar = np.arange
enlarge = [1,2,4,8,16,32]
w_list = [15000000./(i) for i in enlarge]
h_list = [9000000./(i) for i in enlarge]
xlim_min = [-142, 80, 120, 135, 139]#[3:5]
xlim_max = [ 192, 160, 150, 142, 141]#[3:5]
ylim_min = [ -55, 0, 20, 33, 35]#[3:5]
ylim_max = [ 75, 50, 50, 37, 36.2]#[3:5]
ss = [ 0.7, 0.3, 0.1, 0.03, 0.005]#[3:5]
for lon, lat in zip([lon1,lon2],[lat1,lat2]):
for i, s in zip(ar(len(xlim_min)),ss):
m = Basemap(projection='merc',llcrnrlat=ylim_min[i] ,urcrnrlat=ylim_max[i] ,\
llcrnrlon=xlim_min[i],urcrnrlon=xlim_max[i] ,lat_ts=20, resolution='c')
plt.figure(figsize=(13,13))
m.bluemarble()
if i > 2:
m.drawcoastlines(linewidth=0.25)
for x, y in zip(lon,lat):
m.tissot(x, y, s,100,facecolor='red',zorder=100,alpha=0.4)
plt.show()
plt.savefig('plot_map_%s.png'%(str(i)))
Well, this is the result.
If you put it on the map, you can see at a glance which area you are tweeting from.
Since I'm searching for "Starbucks", there are still many in Japan, but surprisingly, "Starbucks" tweets are also being made from various regions such as Europe, the United States, and Southeast Asia!
Also, I will expand it.
It's filling up Japan w You can see after tweets in Taiwan, China, South Korea, and Southeast Asia.
Expand further.
Although it is scattered all over, Higashi-Meihan is still particularly crowded.
After all, there are many in urban areas where the population seems to be large, and tweets have not been tweeted from mountainous areas.
This is the one that focused on the metropolitan area with the highest magnification. The whitish part is the plain part, but there are many tweets from here, and it is not tweeted from the green mountain part. I think it fits my intuition somehow.
The latitude and longitude information that can be inferred from the text is ** 50,310 **, which is nearly 10 times as many as the previous GPS information-based data. Since the process of plotting the latitude and longitude estimated from the tweet body with the previous code is already included, we will look at the map again.
I'm looking forward to seeing what the plot will look like from the place name in the body of the tweet.
This time I'm completely focused on Japan. Because the Japanese place names are extracted by MeCab and the katakana place names are excluded as mentioned above, I think that the result can be said to be as expected.
Enlarge.
It's tighter than before! Hokkaido is less dense, but Honshu, Shikoku, and Kyushu seem to be much denser. When converting from a place name to latitude and longitude, a mysterious place name was included, or a point in the middle of the sea was also included, so I think that GPS information cannot be beaten in terms of accuracy. Also, since the place name in the body of the tweet does not always indicate the current position, I think that accuracy will improve if a method of guessing how the place name is used in the sentence from the surrounding words is taken, so I would like to make it an issue. I will.
Enlarge.
That said, it's scattered in a nice way!
Finally, the metropolitan area again.
Somehow, the number seems to be less than that of GPS information because the latitude and longitude are acquired from the place name, so they are aggregated at the same point. You can see dark orange dots, which means that many dots are gathered in the same place.
So, this time, I took out the location information from the Tweet data and visualized it.
What is the content of tweeting "Starbucks" overseas? There are some things that I'm curious about, but it's been a long time, so I'll write about the analysis results in the next installment.
The full code can be found at Gist.
Recommended Posts