[PYTHON] I tried to vectorize the lyrics of Hinatazaka46!

0: What you want to do

Using ** word2vec ** based on the lyrics data of all songs of Hinatazaka46, ** Natural language ⇒ Numerical value ** I would like to convert it to and play with it.

What is Word2vec?

** Magic to convert words to vectors **

image.png

** How words turn into vectors? ** ** image.png

The weights W, W "are calculated by using the words around the input word (** this time, the distance is 1 **) as the teacher data. The calculated weight W represents the vector of each word.

Flow of natural language processing

自然言語処理の流れ

1: Data collection

Collect data according to the task you want to solve

2: Cleaning process

Remove meaningless noise such as HTML tags ・ ** Beautiful Soup ** ・ ** Standard library re module **

1,2: Data collection & cleaning process

#1.Scraping
import requests
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings('ignore')
target_url = "https://www.uta-net.com/search/?Aselect=1&Keyword=%E6%97%A5%E5%90%91%E5%9D%82&Bselect=3&x=0&y=0"
r = requests.get(target_url)
soup = BeautifulSoup(r.text,"html.parser")
music_list = soup.find_all('td', class_='side td1')
url_list = [] #Extract the URL of each song name from the song list and put it in the list
for elem in music_list:
    a = elem.find("a")
    b = a.attrs['href']
    url_list.append(b)

#<td class="side td1">
#    <a href="/song/291307/"> Azato Kawaii </a>
#</td>  
#<td class="side td1">
#    <a href="/song/250797/"> Uncomfortable and grown up </a>
#</td>  
自然言語処理の流れ
hinataza_kashi = "" #Send Request for each song and extract lyrics
base_url = "https://www.uta-net.com"
for i in range(len(url_list)):
    target_url = base_url + url_list[i]
    r = requests.get(target_url)
    soup = BeautifulSoup(r.text,"html.parser")
    div_list = soup.find_all("div", id = "kashi_area")

    for i in div_list:
        tmp = i.text
        hinatazaka_kashi += tmp

#<div id="kashi_area" itemprop="text">
#I was caught(Yeah at a glance, Yeah, Yeah)
#<br>
#I fell in love without permission
#<br>
#It's not your fault
自然言語処理の流れ
#Preprocessing(Remove English and symbols with regular expressions)
import re
kashi=re.sub("[a-xA-Z0-9_]","",hinatazaka_kashi)#Delete alphanumeric characters
kashi=re.sub("[!-/:-@[-`{-~]","",kashi)#Remove sign
kashi=re.sub(u"\n\n","\n",kashi)#Remove line breaks
kashi=re.sub(u"\r","",kashi)#Remove whitespace
kashi=re.sub(u"\u3000","",kashi)#Remove double-byte spaces
kashi=kashi.replace(' ','')
kashi=kashi.replace(' ','')
kashi=kashi.replace('?','')
kashi=kashi.replace('。','')
kashi=kashi.replace('…','')
kashi=kashi.replace('!','')
kashi=kashi.replace('!','')
kashi=kashi.replace('「','')
kashi=kashi.replace('」','')
kashi=kashi.replace('y','')
kashi=kashi.replace('“','')
kashi=kashi.replace('”','')
kashi=kashi.replace('、','')
kashi=kashi.replace('・','')
kashi=kashi.replace('\u3000','')
自然言語処理の流れ
with open("hinatazaka_kashi_1.txt",mode="w",encoding="utf-8") as fw:
    fw.write(kashi)

3: Word normalization

Unify half-width, full-width, lowercase and uppercase letters, etc. ・ ** Okurigana ** "Do" and "Do" "Reception" and "Reception"

・ ** Character type ** "Apples" and "apples" "Dog", "Inu" and "Dog"

・ ** Uppercase and lowercase ** Apple and apple

** * Ignore this time **

4: Morphological analysis (word division)

Divide sentences word by word ・ ** MeCab ** ・ ** Janome ** ・ ** Juman +++

5: Conversion to uninflected word

Unify to the stem (the part that is not used) Example: Learn → Learn In recent implementations, it may not be converted to the basic form.

4,5: Morphological analysis (word division) & conversion to basic form

path="hinatazaka_kashi_1.txt"
f = open(path,encoding="utf-8")
data = f.read()  #Returns the data read all the way to the end of the file
f.close()
#3.Morphological analysis
import MeCab

text = data
m = MeCab.Tagger("-Ochasen")#Tagger instance creation for parsing text

nouns = [line for line in m.parse(text).splitlines()#Using the parse method of the Tagger class returns the result of morphological analysis of the text
               if "noun" or "Adjectival noun" or "adjective" or"Adjectival noun" or "verb" or "固有noun" in line.split()[-1]]
自然言語処理の流れ
nouns = [line.split()[0] for line in m.parse(text).splitlines()
               if "noun" or "Adjectival noun" or "adjective" or "Adjectival noun" or "verb" or "固有noun" in line.split()[-1]]
自然言語処理の流れ

6: Stop word removal

Remove useless words, such as words that appear too often It may not be removed in recent implementations


my_stop_word=["To do","Teru","Become","Is","thing","of","Hmm","y","one","Sa","so","To be","Good","is there","Yo","もof","Absent","End up",
                 "Be","Give me","From","I wonder","That","but","Only","Tsu","hand","Until","Tsuhand","See you","Want","If","Cod","Without","Be","As it is","Taku"]

nouns_new=[]
for i in nouns:
    if i in my_stop_word:
        continue
    else:
        nouns_new.append(i)
自然言語処理の流れ
import codecs
with codecs.open("hinatazaka_kashi_2.txt", "w", "utf-8") as f:
    f.write("\n".join(nouns_new))

7: Word digitization

Convert strings to numbers so that they can be handled by machine learning

8: Model learning

Classic machine learning ~ Neural network selection according to the task Now, let's grasp what corresponds to preprocessing in this flow.

7,8: Word quantification & model learning

from gensim.models import word2vec

corpus = word2vec.LineSentence("hinatazaka_kashi_2.txt")
model = word2vec.Word2Vec(corpus, size=100 ,min_count=3,window=5,iter=30)
model.save("hinatazaka.model")
model = word2vec.Word2Vec.load("hinatazaka.model")

#See words that are similar to the driver
print('Top 10 words related to likes')
similar_words = model.wv.most_similar(positive=[u"Like"], topn=10)
for key,value in similar_words:
    print('{}\t\t{:.2f}'.format(key, value))

print('-----')
# #Calculate the similarity between two words
similarity = model.wv.similarity(w1=u"Smile", w2=u"summer")
print('Similarity between "smile" and "summer"=>' + str(similarity))

similarity = model.wv.similarity(w1=u"friend", w2=u"summer")
print("Similarity between "friends" and "summer"=>" + str(similarity))

similarity = model.wv.similarity(w1=u"girl", w2=u"Man")
print('Similarity between "girl" and "man"=>' + str(similarity))
自然言語処理の流れ

Degree of similarity

The similarity that appears in this is ** cos similarity **. To put it simply, cos similarity is a numerical value of how much two vectors point in the same direction (similarity). A cos similarity of 0 indicates low similarity, and a cos similarity of 1 indicates low similarity. The cos similarity is expressed by the following formula.

自然言語処理の流れ

[Overview]

*** "Papa Duwa Duwa Duwa Duwa Duwa Duwa Duwa Papa Papa" *** What is it? If you get the data from the member's blog instead of the lyrics, you can see how close the members are. (Let's try next time ...)

References

1: Basics of Natural Language Processing (TensorFlow)

2: [uepon daily memorandum] (https://uepon.hatenadiary.com/entry/2019/02/10/150804)

3: [Np-Ur data analysis class] (https://www.randpy.tokyo/entry/word2vec_skip_gram_model)

Recommended Posts

I tried to vectorize the lyrics of Hinatazaka46!
I tried web scraping to analyze the lyrics.
I tried to touch the API of ebay
I tried to predict the price of ETF
I tried to summarize the basic form of GPLVM
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried to classify the voices of voice actors
I tried to summarize the string operations of Python
I tried to move the ball
I tried to estimate the interval.
I tried to find the entropy of the image with python
[Horse Racing] I tried to quantify the strength of racehorses
I tried to get the location information of Odakyu Bus
I tried to find the average of the sequence with TensorFlow
[Machine learning] I tried to summarize the theory of Adaboost
I tried to fight the Local Minimum of Goldstein-Price Function
I tried the asynchronous server of Django 3.0
I tried to summarize the umask command
I tried to recognize the wake word
I tried to summarize the graphical modeling.
I tried to touch the COTOHA API
[Linux] I tried to summarize the command of resource confirmation system
I tried to get the index of the list using the enumerate function
I tried to build the SD boot image of LicheePi Nano
I tried to expand the size of the logical volume with LVM
I tried to summarize the frequently used implementation method of pytest-mock
I tried to improve the efficiency of daily work with Python
I tried to visualize the common condition of VTuber channel viewers
I tried to transform the face image using sparse_image_warp of TensorFlow Addons
I tried the pivot table function of pandas
I tried cluster analysis of the weather map
I tried to get the batting results of Hachinai using image processing
I tried to visualize the age group and rate distribution of Atcoder
I tried transcribing the news of the example business integration to Amazon Transcribe
I tried to notify slack of Redmine update
I tried to optimize while drying the laundry
zoom I tried to quantify the degree of excitement of the story at the meeting
I tried to estimate the similarity of the question intent using gensim's Doc2Vec
I tried to save the data with discord
I tried to find 100 million digits of pi
I tried how to improve the accuracy of my own Neural Network
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
I tried to get the authentication code of Qiita API with Python.
I tried to automatically extract the movements of PES players with software
(Python) I tried to analyze 1 million hands ~ I tried to estimate the number of AA ~
I tried to find the optimal path of the dreamland by (quantum) annealing
I tried to extract and illustrate the stage of the story using COTOHA
I tried to verify and analyze the acceleration of Python by Cython
Qiita Job I tried to analyze the job offer
I tried to streamline the standard role of new employees with Python
I tried to visualize the text of the novel "Weathering with You" with WordCloud
[Linux] I tried to verify the secure confirmation method of FQDN (CentOS7)
I tried to get the RSS of the top song of the iTunes store automatically
I tried to get the movie information of TMDb API with Python
I want to customize the appearance of zabbix
I tried the common story of using Deep Learning to predict the Nikkei 225
Using COTOHA, I tried to follow the emotional course of Run, Melos!
I tried using the image filter of OpenCV
LeetCode I tried to summarize the simple ones
I tried to predict the behavior of the new coronavirus with the SEIR model.