[PYTHON] I tried to wake up the place name that appears in the lyrics of Masashi Sada on the heat map

Introduction

There are many songs by Masashi Sada (Massan) that cover specific areas and places.

All are wonderful songs (^ o ^) In particular, Tobiume is a super-masterpiece that is regenerated in the brain every time you go to Dazaifu Tenmangu Shrine.

Which region is often featured in Massan's songs including these? After all Nagasaki? Tokyo? Or ...? In order to solve the mystery of, I picked up the keyboard and wrote the code.

By the way, "From the North Country" is a very famous place, but as it has been featured many times in the Advent calendar so far, there is no lyrics, so this time it is out of the scope of the survey.

What I did (roughly)

What I did (details)

Part-speech decomposition of Massan's lyrics

Get lyrics

I scraped the lyrics site. Due to copyright issues, we will not list them here.

Part of speech decomposition

I used Python's morphological analysis library janome for part-speech decomposition.

tokenize_sample.py


#!/usr/bin/env python
# -* encoding: utf-8 -*

from janome.tokenizer import Tokenizer


def main():
	for token in Tokenizer().tokenize('Three red bridges over Shinji Pond'):
		print(token)


if __name__ == '__main__':
	main()

If you write and execute code like the above ...

Shinjiike noun,General,*,*,*,*,Shinji Pond,Shinjiike,Shinjiike
Particles,Case particles,General,*,*,*,To,D,D
Such a verb,Independence,*,*,Five steps, La line,Uninflected word,Take,Kakar,Kakar
Three nouns,General,*,*,*,*,three,Mitz,Mitz
Particles,Case particles,General,*,*,*,of,No,No
Red adjective,Independence,*,*,Adjective, Auoudan,Uninflected word,red,Akai,Akai
Bridge noun,General,*,*,*,*,bridge,Hashi,Hashi

Oh! Shinji Pond is properly recognized! !! And the tension goes up.

For the detailed mechanism and history of janome, please refer to the middle slide "Pyconjp2015 --Morphological analysis made with Python". Please give me. (You were born in 2015! Thank you!)

Extract region names and geocode

Extract region

The lyrics that have been scraped in advance are analyzed by janome and the words that correspond to "region" are picked up. Also, since I want to reflect the number of appearances of words in the density of the heat map, I also counted the words at the same time.

def count_place():
	place_count_dict = {}
	with open('sada_lyrics.csv','r') as lyrics:
		reader = csv.reader(lyrics)
		for row in reader:
			t = Tokenizer(udic='sada_dict.csv', udic_enc='utf8')
			for token in t.tokenize(row[1]):
				if 'noun,固有noun,area' in token.part_of_speech:
					place_name = token.surface
					if place_name in place_count_dict:
						place_count_dict[place_name] = place_count_dict[place_name]+1
					else:
						place_count_dict[place_name] = 1
	return place_count_dict

You can also read custom dictionaries. According to the janome documentation, the dictionary format is the same as Mecab.

sada_dict.csv


Yushima Cathedral,1288,1288,5000,noun,固有noun,General,*,*,*,Yushima Cathedral,Yushima Seidou,Yushima Seidou
Sky tree,1288,1288,5001,noun,固有noun,General,*,*,*,Sky tree,Sky tree,Sky tree

An example of the janome document is "Tokyo Sky Tree", but since it is sung as "Sky Tree" in "Kasutira" that everyone knows, it is recommended to register with "Sky Tree".

(Notice)

Actually, I wanted to reflect the result of identifying the location of the proper noun in the heat map, but I gave up because of time. So I don't even use a dictionary. I'm sorry to mention it.

Geocoding

For geocoding, I also used Python's googlemaps library.

Since you will be using the Google Maps API, you need to specify the API key. How to get it is described in googlemaps GitHub.

import googlemaps

def geocode(place_name):
	gmaps = googlemaps.Client(key='write your API key')
	geocode_result = gmaps.geocode(place_name)
	coord = geocode_result[0]['geometry']['viewport']['northeast']
	return coord['lat'], coord['lng']

Summary up to here

The following is a code that connects the work up to this point (lyrics are decomposed into part of speech-extracting area names-geocoding).

sada_place_geocoder.py


#!/usr/bin/env python
# -* encoding: utf-8 -*

from janome.tokenizer import Tokenizer
import csv
import googlemaps


def main():
	writer = csv.writer(open('sada_places.csv','w'), delimiter=',')
	place_count_dict = count_place()
	gmaps = googlemaps.Client(key='write your API key')
	for place_name, place_count in place_count_dict.items():
		lat, lon = geocode(gmaps, place_name)
		writer.writerow([place_name, place_count, lat, lon])


def count_place():
	place_count_dict = {}
	with open('sada_lyrics.csv','r') as lyrics:
		reader = csv.reader(lyrics)
		for row in reader:
			t = Tokenizer()
			for token in t.tokenize(row[1]):
				if 'noun,固有noun,area' in token.part_of_speech:
					place_name = token.surface
					if place_name in place_count_dict:
						place_count_dict[place_name] = place_count_dict[place_name]+1
					else:
						place_count_dict[place_name] = 1
	return place_count_dict


def geocode(gmaps, place_name):
	geocode_result = gmaps.geocode(place_name)
	coord = geocode_result[0]['geometry']['viewport']['northeast']
	return coord['lat'], coord['lng']


if __name__ == '__main__':
	main()

As a result, the area name, the number of appearances in the lyrics, latitude, and longitude are output. Since I don't use a custom dictionary, I can see garbage records here and there, but this time I will ignore it.

sada_places.csv


America,1,49.38,-66.94
Bermuda,1,14.5192371802915,121.0361231302915
Akita,1,39.86527460000001,140.5154199
Kasugayama,1,37.1489639802915,138.2363259802915
Victoria,1,48.450518,-123.322346
Mimiya,1,36.4073904302915,136.4570957
Minase,1,34.8791869802915,135.6691649802915
Kyo,3,30.5403905,120.3877692
spring,1,33.8689809,130.8083576
Asuka,3,38.8972965,139.9375578
Day,2,50.68819,5.675110099999999
Kamakura,2,35.3682478,139.5933376
Bathhouse,1,34.93531738029149,135.7610285302915
Yamami,1,36.5698502,136.9701007
Jerusalem,1,31.8829601,35.2652869
West Kyo,1,34.67190798029149,135.7844679802915
Addition,1,36.5431863,-6.255334599999999
Berlin,1,52.6754542,13.7611176
Kiraku,1,35.1904253,136.7319704
Mitsuke,3,37.5933274,139.0009869
Nagasaki,5,35.7377658,139.6976565
Urashima,1,35.4839466,139.6447166
Atago,1,35.9737504,139.6042941
Happy,1,34.4654479,135.5854033
Akishino,1,34.7155978,135.7837222
Heiankyo,1,44.5883529,127.1930004
Hong Kong,1,14.4904672802915,121.0242180302915
Karuizawa,2,36.4240846,138.6571307
Han,2,32.555258,114.2922103
Inasa,1,32.7592694,129.8647033
Kyoto,1,35.0542,135.8236
Musashi Koganei,1,35.70241118029149,139.5080892802915
Hakuhagi,1,38.2529733,140.9109412
Chino,3,34.047811,-117.5995851
Under the slope,1,35.3120498,139.5356368
Yukon,1,69.646498,-123.8009179
Home,1,34.0886418,132.9547384
Nanjing,1,32.3940135,119.050169
France,3,51.0891658,9.5597934
Mimomi,1,35.6882069802915,140.0695889802915
Welcome,1,37.9205189,112.7839926
Sound money,1,37.2102144,139.9250478
Yabu,1,35.3875492,140.1588221
Kano,1,35.2016331,135.4969237
Ebina,1,35.4774536,139.4364727
Renge,5,48.02912,8.027220699999999
Magellan,1,31.8199301,76.95342
Michinoku,1,35.5030142302915,139.6870448302915
Pearl Harbor,1,21.3885713,-157.9335744
Wharf,1,34.6863148,135.1933421
Harumi,1,35.6634906,139.7897775
Japan,2,34.6687571,135.5100311
Sophia,2,42.7877752,23.4569049
Alaska,4,71.3868712,-129.9945562
Yangtze River,1,36.4361024802915,139.8532846302915
Kitamae,1,26.3027021,127.7615069
Tsugaru,1,35.0117177302915,135.7573022302915
Ginza,1,35.6760255,139.7724941
Dew,1,51.2964846,22.6735312
United States,1,49.38,-66.94
Casablanca,2,33.6486015,-7.4582757
Tokyo,31,35.817813,139.910202
Pharmacy,1,35.0155830302915,135.7545184802915
Far north,1,12.9797045,15.683687
France orchid west,1,35.17525588029149,139.6558066802915
Gojo,1,39.5593820302915,115.7611693
Baghdad,2,33.4350586,44.5558261
Kanzeonji Temple,1,33.5222913,130.5254343
Akasaka,1,35.6782744,139.7459391
Buddha,4,34.9489952,136.9632495
New York,1,40.91525559999999,-73.70027209999999
Nishiki,1,32.2516958,130.9134777
Tigris,1,-15.4044999,-42.8735213
Hisakata,1,35.10910000000001,136.9854947
Rippling,1,33.9145777,130.8043569
Yushima,2,35.711327,139.7724702
Narayama,1,34.71184798029149,135.8116589802915
Koshien,1,34.7234607,135.3633836
Shinjuku,2,35.7298963,139.7451654
Kasumi,1,24.0234098,82.02101979999999
Fuji,1,35.3539032,138.8118555
curry,1,50.9818821,1.9320691
Nagasaki,24,32.9686469,129.9938174
Minamiyamate,2,32.7361422,129.8708733
Kutchan,1,43.015163,140.9243102
Alley,1,36.1243706,139.5655411
Sakamoto,1,37.9298369802915,140.9141139802915
Shijo,1,35.0044451802915,135.7580809302915
Mediterranean,1,45.7927967,36.215244
Akebono,1,26.2435843,127.6904124
Kagura,1,34.6626033,135.1513682
Azumi,2,36.3649943,137.8106765
Hiroshima,1,31.9163645,131.4305945
Mt. Emei,1,29.7169085,103.6231299
Yokohama,1,35.5113,139.674
Ueno,1,36.1325774,138.8291853
Chile,6,-17.4983293,-66.4169643
Yoga,1,35.62797998029149,139.6354899802915
Kilimanjaro,2,-3.0562826,37.3716347
Lifting feathers,1,35.1848028,136.9673238
Hiroshima,6,34.4426,132.4865
Nairobi,1,-1.164744,37.0493746
Tanifu,1,35.5452879,136.6135764
Nerima,2,35.779946,139.6811359
Namba,1,43.648665,-116.48121
Asakusa,1,35.7233639,139.8055923
Oshiage,2,35.71155898029149,139.8137769802915
Shinsaibashi,2,30.6801709802915,114.2062109802915
Japan,9,45.5227719,145.8175503
Tokyo,1,36.0447089,139.3743599
Rokuto,1,36.0028345,140.1105419
Tree root,2,36.9430004,137.4747414
Tateyama,1,36.5847934,137.6343407
Arakawa,1,36.1415564,139.8589857
Germany,5,41.2296285,141.0143767
Kimikage,1,34.7205324,135.1428907
Nara,3,34.70489999999999,135.8384
Shanghai,2,31.6688967,122.1137989
Yunnan,4,29.2233272,106.1977228
Yue,1,36.7995957,138.4063989
Gion,1,34.4529231,132.4693298
Shinano,2,36.8707572,138.2803909
Higashiyama,1,35.010837,135.7914226
Yotsuya,2,35.6726745,139.4551008
Nagano,1,36.835842,138.3190722
Planting,1,33.6152803,130.5166492

Generate a heat map based on the number of appearances and coordinates

Since the coordinates and the number of appearances for each region are recorded in sada_places.csv generated earlier, use this information to reflect it in the heat map.

I used the Google Maps API for geocoding, so I also tried using the Google Maps API for heatmaps.

I have written both the style and the script in HTML, but the amount of code is like this.

<!DOCTYPE html>
<html>
  <head>
    <style>
      #map {
        width: 1200px;
        height: 600px;
      }
    </style>
    <script
  src="https://maps.googleapis.com/maps/api/js?key='write your API key'&libraries=geometry,visualization">
</script>
    <script>
      function initialize() {
        var mapCanvas = document.getElementById('map');
        var mapOptions = {
          center: new google.maps.LatLng(36.83566824724438,138.372802734375),
          zoom: 6,
          mapTypeId: google.maps.MapTypeId.ROADMAP
        }
        var map = new google.maps.Map(mapCanvas, mapOptions)

        var heatmapData = [
         //Objects are lined up as many as the number of coordinates. Omitted because it is long.
         //It is definitely better to be able to create an external file
         { weight :  2 ,  location :  new google.maps.LatLng(32.7361422,129.8708733) },
         { weight :  4 ,  location :  new google.maps.LatLng(71.3868712,-129.9945562) }, 
         { weight :  3 ,  location :  new google.maps.LatLng(37.5933274,139.0009869) }
        ]
        var heatmap = new google.maps.visualization.HeatmapLayer({
          data: heatmapData,
          radius: 50,
          map: map
        });
      }
      google.maps.event.addDomListener(window, 'load', initialize);
    </script>
  </head>
  <body>
    <div id="map"></div>
  </body>
</html>

Referenced page

result

Map of Japan

日本地図

Looking at the results, as expected, Nagasaki, Tokyo, which is related to Massan, is the brightest. Hiroshima in "Hiroshima no Sora" and Kyoto / Nara appearing in "Yesterday / Kyo / Nara, Asuka / Tomorrow" and "Shuni-e" are also getting brighter.

When I wondered, "Why is the upper right corner of Hokkaido brighter?", The geocoding result of "Japan" was here ...

world map

世界地図

If you pull the zoom and look at the world map, you can see that many areas other than Japan are sung. Massan, who should be a veteran with 42 years of experience as a singer, is a terrifying global talent.

The white snow of Kilimanjaro in "Lion Standing in the Wind" that makes me cry every time I listen to a live song The Alaska that appears in "Aurora" and "Byakuya no Tasumi no Hikari" sung with the theme of a real photographer is also faintly colored.

It's a bit off the topic, but if you listen to "Aurora" and "Byakuya no Tasugi no Hikari" while watching the winter night sky, you can cry very much, so if you've never heard of it, please take this opportunity to master it. "Aurora" and "Byakuya no Tasumi" -Masashi Sada, Mitsuho Agishi, and Michio Hoshino

It is brighter around France and Germany, but it seems to be brighter than expected. This was because the words "Buddha" and "Germany" were interpreted as regions when the part of speech was decomposed, so I think that accuracy can be improved by adding a dictionary.

at the end

Masashi Sada x IT Advent Calendar The organizer wrote on the 4th day of "Easy part-speech decomposition of Masashi Sada using kuromoji" It's embarrassing because the result of the heat map and the part of the content is thin.

I want to actually try it and reflect the Sky Tree and Shinji Pond on the display & some geocoding results with Google Maps API are "?", So take time to prepare and study again. I wanted to try making a heat map again.

If you get an interesting result, I'd like to post a postcard to the raw ...

Recommended Posts

I tried to wake up the place name that appears in the lyrics of Masashi Sada on the heat map
I tried to display the infection condition of coronavirus on the heat map of seaborn
I tried to vectorize the lyrics of Hinatazaka46!
I tried to recognize the wake word
I tried to sort out the objects from the image of the steak set meal --③ Similar image Heat map detection
I tried to display the altitude value of DTM in a graph
[Python] I tried to analyze the characteristics of thumbnails that are easy to play on YouTube by deep learning
I want to use Python in the environment of pyenv + pipenv on Windows 10
I tried to predict the horses that will be in the top 3 with LightGBM
I tried to rescue the data of the laptop by booting it on Ubuntu
I tried web scraping to analyze the lyrics.
I tried cluster analysis of the weather map
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried to predict the price of ETF
I tried to visualize the lyrics of GReeeen, which I used to listen to crazy in my youth but no longer listen to it.
[Linux] I learned LPIC lv1 in 10 days and tried to understand the mechanism of Linux.
I tried to extract the text in the image file using Tesseract of the OCR engine
I tried to put HULFT IoT (Agent) in the gateway Rooster of Sun Electronics
[First data science ⑥] I tried to visualize the market price of restaurants in Tokyo
I tried to make it easy to change the setting of authenticated Proxy on Jupyter
I tried to graph the packages installed in Python
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried to notify the honeypot report on LINE
I tried to classify the voices of voice actors
I tried to summarize the string operations of Python
I tried to create a Python script to get the value of a cell in Microsoft Excel
I want to output a beautifully customized heat map of the correlation matrix. matplotlib edition
I tried to summarize the languages that beginners should learn from now on by purpose
I tried to predict the genre of music from the song title on the Recurrent Neural Network
I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"
How to delete "(base)" that appears in the terminal when Anaconda is installed on Mac
I tried to put HULFT IoT (Edge Streaming) in the gateway Rooster of Sun Electronics
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
I tried to find the trend of the number of ships in Tokyo Bay from satellite images.
I tried to make a script that traces the tweets of a specific user on Twitter and saves the posted image at once
I tried to find the entropy of the image with python
[Horse Racing] I tried to quantify the strength of racehorses
I tried the accuracy of three Stirling's approximations in python
I tried to find the average of the sequence with TensorFlow
I tried to summarize the code often used in Pandas
I tried to illustrate the time and time in C language
[Python] I tried to visualize the follow relationship of Twitter
I tried to summarize the commands often used in business
I tried to implement the mail sending function in Python
[Machine learning] I tried to summarize the theory of Adaboost
I tried to fight the Local Minimum of Goldstein-Price Function
I tried changing the python script from 2.7.11 to 3.6.0 on windows10
I tried to launch ipython cluster to the minimum on AWS
I tried to implement blackjack of card game in Python
I tried to make a site that makes it easy to see the update information of Azure
I tried to find out in which language that software that I always take care of is written
I tried ranking the user name and password of phpMyAdmin that was targeted by the server attack
I tried to reproduce Mr. Saito who appears in "Eine Kleine Nachtmusik" as Mr. Sakurai of Mr. Children
I tried to summarize the contents of each package saved by Python pip in one line
[Python] I tried to make a simple program that works on the command line using argparse.
A story that didn't work when I tried to log in with the Python requests module
I tried to build a SATA software RAID configuration that boots the OS on Ubuntu Server
I tried to predict the up and down of the closing price of Gurunavi's stock price using TensorFlow (progress)
I tried fitting the exponential function and logistics function to the number of COVID-19 positive patients in Tokyo