[Python] Visualize Arashi's lyrics with WordCloud and try to understand what I wanted to convey to fans in the 20th year of formation.


It's only one year left until Arashi's activity is suspended. It's been 20 years since the appearance of the invisibility costume. What did the national idols who are active in multiplayer want to tell their fans in the 20 years since their formation? I'd like to meet you in person, but that's not the case. So I decided to "visualize the lyrics" and convey the message I want to convey to the fans ~~ the sixth member ~~ I will convey to Arashi fans.


-Python 3.7.3 ・ Windows10

Reference material

Rough flow

  1. Collecting lyrics (scraping)
  2. Turn lyrics into words (morphological analysis)
  3. Visualization (WordCloud)

1. Collecting lyrics (scraping)


import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

#Create a table to store scraped data
list_df = pd.DataFrame(columns=['lyrics'])

for page in range(1, 3):
	#Song page top address
	base_url = 'https://www.uta-net.com'

	#Lyrics list page
	url = 'https://www.uta-net.com/artist/3891/0/' + str(page) + '/'
	response = requests.get(url)
	soup = BeautifulSoup(response.text, 'lxml')
	links = soup.find_all('td', class_='side td1')
	for link in links:
		a = base_url + (link.a.get('href'))

		#Lyrics detail page
		response = requests.get(a)
		soup = BeautifulSoup(response.text, 'lxml')
		song_lyrics = soup.find('div', itemprop='lyrics')
		song_lyric = song_lyrics.text
		song_lyric = song_lyric.replace('\n','')
		#Wait 1 second to not load the server

		#Add the acquired lyrics to the table
		tmp_se = pd.DataFrame([song_lyric], index=list_df.columns).T
		list_df = list_df.append(tmp_se)


#csv save
list_df.to_csv('list.csv', mode = 'a', encoding='cp932')

2. Turn lyrics into words (morphological analysis)


from janome.tokenizer import Tokenizer
import pandas as pd
import re

#list.Read csv file
df_file = pd.read_csv('list.csv', encoding='cp932')

song_lyrics = df_file['lyrics'].tolist()

t = Tokenizer()

results = []

for s in song_lyrics:
	tokens = t.tokenize(s)
	r = []

	for tok in tokens:
		if tok.base_form == '*':
			word = tok.surface
			word = tok.base_form

		ps = tok.part_of_speech

		hinshi = ps.split(',')[0]

		if hinshi in ['noun', 'adjective', 'verb', 'adverb']:

	rl = (' '.join(r)).strip()
	#Replacement of extra character code
	result = [i.replace('\u3000','') for i in results]

text_file = 'wakati_list.txt'
with open(text_file, 'w', encoding='utf-8') as fp:

3. Visualization (WordCloud)


from wordcloud import WordCloud

text_file = open('wakati_list.txt', encoding='utf-8')
text = text_file.read()

#Japanese font path
fpath = 'C:/Windows/Fonts/YuGothM.ttc'

#Word removal that seems meaningless
stop_words = ['so', 'Absent', 'Is', 'To do', 'As it is', 'Yo', 'Teru', 'Become', 'thing', 'Already', 'Good', 'is there', 'go', 'To be']

wordcloud = WordCloud(background_color='white',
	font_path=fpath, width=800, height=600, stopwords=set(stop_words)).generate(text)

#The image is wordcloud.Save png in the same directory as the py file

↓ ↓ How about the result ↓ ↓

Execution result

It feels good! wordcloud_sample.png


By visualizing the lyrics, I found that words such as "future," "us," "here," and "see" that feel the warmth of Arashi frequently appear (* ´ ▽ ` *).

Message from the storm

Let's walk toward the future with us. And I'll be by your side all the time. One year left until the activity is suspended, it will cause A / RA / SHI whirlwind all over Japan (~~ Message from me, the sixth member. ~~)

Fans can convey Arashi's feelings without me saying it, right?

"We" Arashi fans will support Arashi with all their might until the end. Good luck ARASHI. And if it pops, Yea!

in conclusion

I enjoyed learning about scraping, morphological analysis, and how to use WordCloud based on Arashi songs. It's been a long time, but thank you for reading this far. If you find any mistakes, I would be very grateful if you could point them out in the comments.

