[PYTHON] "Freedom" in AKB48 and Nogizaka46 practiced in word2vec

Introduction

By learning the lyrics of AKB48 and Nogizaka46 respectively using word2vec of the Python library, I searched for words similar to "freedom" in the lyrics of AKB48 and Nogizaka46 respectively. Here, processing is mainly performed by 6 programs. First, I get the title and ID of the song to get the lyrics of AKB48 and Nogizaka46 from Uta-Net (https://www.uta-net.com/). Next, the lyrics for each song are obtained using that ID. After that, morphological analysis is performed using MeCab, and stopword processing is performed. Next, morphologically analyzed data is aggregated to confirm what kind of words are often used. Then, a learning model is created using word2vec from the morphologically analyzed data. Finally, I tried to extract words similar to "freedom" from the learning model.

Environment required to run this program

Libraries such as BeautifulSoup and gensim are installed. (Packages such as anaconda are recommended, only gensim needs to be installed separately)

The environment where mecab can be used with python is in place, and dictionaries such as mecab-ipadic-NEologd are included.

The program introduced here can be executed with "python file name.py" on the command line.

Main flow of the program

This analysis can be processed by executing 6 programs in order for each of the following AKB48 and Nogizaka46.

--Scraping the IDs and titles of each song of AKB48 and Nogizaka46 from Uta-Net. --Scraping the lyrics of AKB48 and Nogizaka46 from Uta-Net. --Morphological analysis of scraped lyrics. --Aggregate the text data analyzed by morphological analysis. --Create a learning model from text data morphologically analyzed with word2vec. --Extract words similar to "freedom" from the learned model.

Similar to "freedom" in AKB48

First, I would like to extract the similar word for "freedom" in AKB48. As explained earlier, the process here is to acquire the ID and title from the song net, acquire the lyrics, perform morphological analysis, and create a learning model with word2vec.

Scraping the lyrics of AKB48

As the data used for learning in this analysis, I would like to use the lyrics of all the songs of AKB48 that have been announced so far. On Uta-Net, there are lyrics of songs for each artist, so I will scrape the lyrics from here. Looking at the Uta-Net site, the page where the lyrics are posted is made with ID, so in order to get the ID of that song, first from the page where you can list the songs of AKB48, the ID of the song Scrap the title. You can use that ID to get the lyrics for each song.

Get the ID of the song title of AKB48

In the following program, the song ID and song title are scraped using the URL that narrows down only the songs of AKB48 from the song net. The process is performed until the acquired data is written out as a file called akb48_id.csv.

scraping_akb48_id.py


# -*- coding:utf-8 -*-
import csv
import requests
import codecs
from bs4 import BeautifulSoup

f = codecs.open('akb48_id.csv', 'w', 'utf-8')
f.write("code,title" "\n")

target_url = 'https://www.uta-net.com/search/?Aselect=1&Bselect=3&Keyword=AKB48&sort=&pnum={0}'

for i in range(1, 4):
    r = requests.get(target_url.format(i))
    req = requests.Request(r)
    soup = BeautifulSoup(r.text, 'html5lib')
    codes = soup.find_all('td',{'class':'side td1'})
    titles = soup.find_all('td',{'class':'side td1'})


    for code, title in zip(codes, titles):
        print(code.find('a').attrs['href'][6:].replace("/", ''), title.text)
        f.write(str(code.find('a').attrs['href'][6:].replace("/", '')) + ',' + title.text + "\n")


f.close()

Get lyrics from AKB48 song title ID

The text data of the lyrics is scraped using the ID of the song obtained in the previous program. The acquired text data is written in a file called akb48_lyrics.csv.

scraping_akb48_lyrics.py


# -*- coding:utf-8 -*-
import csv
import requests
import codecs
from bs4 import BeautifulSoup
import pandas as pd

f = codecs.open('akb48_lyrics.csv', 'w', 'utf-8')
f.write("lyrics" "\n")

target_url = 'https://www.uta-net.com/song/{0}/'

akb48_01 = pd.read_csv('akb48_id.csv',dtype = 'object')
akb48_02 = akb48_01["code"].values.tolist()

for i in akb48_02:
    r = requests.get(target_url.format(i))
    req = requests.Request(r)
    soup = BeautifulSoup(r.text, 'html5lib')
    lyrics = soup.find_all('div',{'id':'kashi_area'})

    for lyric in lyrics:

        print(lyric.text.replace(",", ''))

        f.write(str(lyric.text.replace(",", '') + "\n"))


f.close()

Morphological analysis of AKB48 lyrics

Next, the text data of the acquired lyrics is morphologically analyzed to obtain data corresponding to word2vec model generation. When studying with word2vec, it must be in the form of word space words. Therefore, the morphologically analyzed data is processed so that it is separated by spaces. The stop word process is to extract only nouns, adjectives, verbs, and adverbs. The morphologically analyzed data is output as a file called akb48_wakati.txt.

akb48_mecab.py


import MeCab

lyrics = open('akb48_lyrics.csv', 'r')
text = lyrics.readlines()

def extractKeyword(line):
    tagger = MeCab.Tagger('-Ochasen')
    tagger.parse('')
    node = tagger.parseToNode(line)
    keywords = []
    while node:
        if node.feature.split(",")[0] == u"noun":
            keywords.append(node.surface)
        elif node.feature.split(",")[0] == u"adjective":
             keywords.append(node.surface)
        elif node.feature.split(",")[0] == u"verb":
             keywords.append(node.surface)
        elif node.feature.split(",")[0] == u"adverb":
             keywords.append(node.surface)
        node = node.next
    return keywords

import codecs
import re
f = codecs.open('akb48_wakati.txt', 'w', 'utf-8')
single =  r"^[Ah-Hmm]$"

for line in text:
    kekka = extractKeyword(line)
    wakati = ' '.join(kekka)
    re_wakati1 = wakati.split()

    for line2 in re_wakati1:
        if re.match(single,line2):
            re_wakati2 = ""
        elif len(line2.encode('utf-8')) < 3:
            re_wakati2 = ""
        else:
            re_wakati2 = line2
        print(re_wakati2,end=" ")
        f.write(re_wakati2)
        f.write(" ")
    print("\n")
    f.write("\n")
f.close()

Aggregation of morphologically analyzed text data (AKB48)

The processing here is not directly related to this analysis, so you can skip it. For the time being, I am doing morphological analysis text to confirm what kind of words are actually used a lot. You can see that the word "freedom" that I am going to look up this time is also used 75 times. I think that you can output this aggregation process more easily by using pandas.

akb48_count.py


f = open('akb48_wakati.txt')
lines2 = f.readlines()
f.close()

import codecs
f = codecs.open('akb48_count.txt', 'w', 'utf-8')

words =[]
for line in lines2:
    line3 = line.replace(" ", "\n")
    f.write(line3)
f.close()

f = open('akb48_count.txt')
lines2 = f.read()
f.close()

lines3 = lines2.split()

import collections
words = collections.Counter(lines3)

print(words.most_common())

Create a model from lyrics with word2vec (AKB48)

This is the learning program. I am creating a learning model with word2vec using the space-separated text data that was morphologically analyzed earlier. The learning result will differ depending on the parameters such as size, min_count, and window, so it is recommended to change the parameters several times. The created learning model is saved as a file called akb48.model.

akb48_word2vec.py


from gensim.models import word2vec
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = word2vec.LineSentence('akb48_wakati.txt')

model = word2vec.Word2Vec(sentences, size=100, min_count=5, window=10, hs=1, sg=1, seed=100)
model.save("akb48.model")

Extract words similar to "freedom" (AKB48)

Finally, using the learning model we created earlier, we are extracting words that are similar to the word "freedom". If topn is specified, similar words at the top will be output.

akb48_word2vec_model.py


from gensim.models import word2vec

model = word2vec.Word2Vec.load("akb48.model")
results = model.wv.most_similar(positive=["freedom"], topn=20)
for result in results:
    print(result)

Output result (AKB48)

World 0.5644636154174805
Riding the wind 0.5584157705307007
Paradise 0.5396431088447571
Aiya 0.5267466902732849
Fly 0.521303117275238
Halloween 0.5185834765434265
Can fly 0.5173347592353821
Dance 0.4997383952140808
Tied up 0.4945579767227173
Good 0.4936122000217438
Risk 0.49195727705955505
Rule 0.4813954532146454
Aiyaiyai 0.472299188375473
Robbed 0.45674586296081543
Not 0.4543856084346771
Stupid 0.4531818926334381
Have 0.4521175026893616
Teru 0.44793424010276794
Reverberating 0.436395525932312
Far 0.4335517883300781

A synonym for "freedom" in Nogizaka46

Next, I would like to extract a similar word to "freedom" in Nogizaka46. The process here is the same as the process performed at AKB48 earlier, such as acquiring the ID and title from the song net, acquiring the lyrics, performing morphological analysis, and creating a learning model with word2vec. I am.

Scraping the lyrics of Nogizaka46

As the data used for learning in this analysis, I would like to use the lyrics of all the songs of Nogizaka46 that have been announced so far. On Uta-Net, there are lyrics of songs for each artist, so I will scrape the lyrics from here. Looking at the Uta-Net site, the page where the lyrics are posted is made with ID, so in order to get the ID of that song, first from the page where you can list the songs of Nogizaka46, the ID of the song And scraping the title. You can use that ID to get the lyrics for each song.

Get the ID of the song title of Nogizaka46

In the following program, the song ID and song title are scraped using the URL that narrows down only the songs of Nogizaka46 from the song net. We are processing until the acquired data is written out as a file called nogi46_id.csv.

scraping_nogizaka46_id.py


# -*- coding:utf-8 -*-
import csv
import requests
import codecs
from bs4 import BeautifulSoup

f = codecs.open('nogi46_id.csv', 'w', 'utf-8')
f.write("code,title" "\n")

target_url = 'https://www.uta-net.com/search/?Keyword=%E4%B9%83%E6%9C%A8%E5%9D%8246&x=0&y=0&Aselect=1&Bselect={0}'

for i in range(1, 2):
    r = requests.get(target_url.format(i))         #Get from the web using requests
    req = requests.Request(r)
    soup = BeautifulSoup(r.text, 'html5lib') #Extract elements
    codes = soup.find_all('td',{'class':'side td1'})
    titles = soup.find_all('td',{'class':'side td1'})


    for code, title in zip(codes, titles):
        print(code.find('a').attrs['href'][6:].replace("/", ''), title.text)
        f.write(str(code.find('a').attrs['href'][6:].replace("/", '')) + ',' + title.text + "\n")


f.close()

Get lyrics from Nogizaka46's song title ID

The text data of the lyrics is scraped using the ID of the song obtained in the previous program. The acquired text data is written in a file called nogi46_lyrics.csv.

scraping_nogizaka46_lyrics.py


# -*- coding:utf-8 -*-
import csv
import requests
import codecs
from bs4 import BeautifulSoup
import pandas as pd

f = codecs.open('nogi46_lyrics.csv', 'w', 'utf-8')
f.write("lyrics" "\n")

target_url = 'https://www.uta-net.com/song/{0}/'

nogi46_01 = pd.read_csv('nogi46_id.csv',dtype = 'object')
nogi46_02 = nogi46_01["code"].values.tolist()

for i in nogi46_02:
    r = requests.get(target_url.format(i))         #Get from the web using requests
    req = requests.Request(r)
    soup = BeautifulSoup(r.text, 'html5lib') #Extract elements
    lyrics = soup.find_all('div',{'id':'kashi_area'})

    for lyric in lyrics:

        print(lyric.text.replace(",", ''))

        f.write(str(lyric.text.replace(",", '') + "\n"))


f.close()

Morphological analysis of Nogizaka46 lyrics

Next, the text data of the acquired lyrics is morphologically analyzed to obtain data corresponding to word2vec model generation. When studying with word2vec, it must be in the form of word space words. Therefore, the morphologically analyzed data is processed so that it is separated by spaces. The stop word process is to extract only nouns, adjectives, verbs, and adverbs. The morphologically analyzed data is output as a file called nogi46_wakati.txt.

nogizaka46_mecab.py


import MeCab

lyrics = open('nogi46_lyrics.csv', 'r')
text = lyrics.readlines()

def extractKeyword(line):
    tagger = MeCab.Tagger('-Ochasen')
    tagger.parse('')
    node = tagger.parseToNode(line)
    keywords = []
    while node:
        if node.feature.split(",")[0] == u"noun":
            keywords.append(node.surface)
        elif node.feature.split(",")[0] == u"adjective":
             keywords.append(node.surface)
        elif node.feature.split(",")[0] == u"verb":
             keywords.append(node.surface)
        elif node.feature.split(",")[0] == u"adverb":
             keywords.append(node.surface)
        node = node.next
    return keywords

import codecs
import re
f = codecs.open('nogi46_wakati.txt', 'w', 'utf-8')
single =  r"^[Ah-Hmm]$"

for line in text:
    kekka = extractKeyword(line)
    wakati = ' '.join(kekka)
    re_wakati1 = wakati.split()

    for line2 in re_wakati1:
        if re.match(single,line2):
            re_wakati2 = ""
        elif len(line2.encode('utf-8')) < 3:
            re_wakati2 = ""
        else:
            re_wakati2 = line2
        print(re_wakati2,end=" ")
        f.write(re_wakati2)
        f.write(" ")
    print("\n")
    f.write("\n")
f.close()

Aggregation of text data analyzed by morphological analysis (Nogizaka46)

The processing here is not directly related to this analysis, so you can skip it. For the time being, I am doing morphological analysis text to confirm what kind of words are actually used a lot. You can see that the word "freedom" that I am going to look up this time is also used 60 times. I think that you can output this aggregation process more easily by using pandas.

nogizaka46_count.py


f = open('nogi46_wakati.txt')
lines2 = f.readlines()
f.close()

import codecs
f = codecs.open('nogizaka46_count.txt', 'w', 'utf-8')

words =[]
for line in lines2:
    line3 = line.replace(" ", "\n")
    f.write(line3)
f.close()

f = open('nogizaka46_count.txt')
lines2 = f.read()
f.close()

lines3 = lines2.split()

import collections
words = collections.Counter(lines3)

print(words.most_common())

Create a model from lyrics with word2vec (Nogizaka46)

This is the learning program. I am creating a learning model with word2vec using the space-separated text data that was morphologically analyzed earlier. The learning result will differ depending on the parameters such as size, min_count, and window, so it is recommended to change the parameters several times. The created learning model is saved as a file called nogizaka46.model.

nogizaka46_word2vec.py


from gensim.models import word2vec
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = word2vec.LineSentence('nogi46_wakati.txt')

model = word2vec.Word2Vec(sentences, size=100, min_count=5, window=10, hs=1, sg=1, seed=100)
model.save("nogizaka46.model")

Extract words similar to "freedom"

Finally, using the learning model we created earlier, we are extracting words that are similar to the word "freedom". If topn is specified, similar words at the top will be output.

nogizaka46_word2vec_model.py


from gensim.models import word2vec

model = word2vec.Word2Vec.load("nogizaka46.model")
results = model.wv.most_similar(positive=["freedom"], topn=20)
for result in results:
    print(result)

Output result (Nogizaka46)

Privilege 0.6652380228042603
Take off 0.6220737099647522
Naro 0.5961438417434692
Start 0.5454341173171997
Border 0.45137685537338257
Look up 0.44773566722869873
Get 0.4456521272659302
Required 0.44296208024024963
Friends 0.4364272952079773
Receive 0.4297247529029846
Advance 0.4280410706996918
Sky 0.4277403652667999
Youth 0.422186940908432
Border 0.42211493849754333
Uniform 0.4201713502407074
Discotic 0.4199380874633789
Knock 0.41815412044525146
Takumi 0.4154769778251648
Get up 0.412681519985199
Girls Talk 0.40724512934684753

result of analysis

AKB48 has a message-like impression such as "world", "riding the wind", and "paradise", and Nogizaka46 has conspicuous words with a bird's-eye view such as "privilege", "border", and "border". It was.

AKB48 Cosine similarity Nogizaka46 Cosine similarity
world 0.5644636154174805 Privilege 0.6652380228042603
Riding the wind 0.5584157705307007 Take off 0.6220737099647522
Paradise 0.5396431088447571 Naro 0.5961438417434692
Aiya 0.5267466902732849 start 0.5454341173171997
jump 0.521303117275238 Border 0.45137685537338257
Halloween 0.5185834765434265 Look up 0.44773566722869873
Can fly 0.5173347592353821 Get 0.4456521272659302
Dance 0.4997383952140808 necessary 0.44296208024024963
Tied up 0.4945579767227173 Friends 0.4364272952079773
Good 0.4936122000217438 receive 0.4297247529029846
risk 0.49195727705955505 Advance 0.4280410706996918
rule 0.4813954532146454 Sky 0.4277403652667999
I don't like it 0.472299188375473 Youth 0.422186940908432
Robbed 0.45674586296081543 border 0.42211493849754333
Absent 0.4543856084346771 uniform 0.4201713502407074
Stupid 0.4531818926334381 Discotic 0.4199380874633789
Have 0.4521175026893616 knock 0.41815412044525146
Teru 0.44793424010276794 Pioneer 0.4154769778251648
Reverberate 0.436395525932312 Get up 0.412681519985199
far 0.4335517883300781 Girls talk 0.40724512934684753

reference

"OK word2vec! Tell me the meaning of" seriously "" I tried word2vec in Python

Recommended Posts

"Freedom" in AKB48 and Nogizaka46 practiced in word2vec