Using ** word2vec ** based on the lyrics data of all songs of Hinatazaka46, ** Natural language ⇒ Numerical value ** I would like to convert it to and play with it.
** Magic to convert words to vectors **

** How words turn into vectors? ** **

The weights W, W "are calculated by using the words around the input word (** this time, the distance is 1 **) as the teacher data. The calculated weight W represents the vector of each word.
 
Collect data according to the task you want to solve
Remove meaningless noise such as HTML tags ・ ** Beautiful Soup ** ・ ** Standard library re module **
#1.Scraping
import requests
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings('ignore')
target_url = "https://www.uta-net.com/search/?Aselect=1&Keyword=%E6%97%A5%E5%90%91%E5%9D%82&Bselect=3&x=0&y=0"
r = requests.get(target_url)
soup = BeautifulSoup(r.text,"html.parser")
music_list = soup.find_all('td', class_='side td1')
url_list = [] #Extract the URL of each song name from the song list and put it in the list
for elem in music_list:
    a = elem.find("a")
    b = a.attrs['href']
    url_list.append(b)
#<td class="side td1">
#    <a href="/song/291307/"> Azato Kawaii </a>
#</td>  
#<td class="side td1">
#    <a href="/song/250797/"> Uncomfortable and grown up </a>
#</td>  
 
hinataza_kashi = "" #Send Request for each song and extract lyrics
base_url = "https://www.uta-net.com"
for i in range(len(url_list)):
    target_url = base_url + url_list[i]
    r = requests.get(target_url)
    soup = BeautifulSoup(r.text,"html.parser")
    div_list = soup.find_all("div", id = "kashi_area")
    for i in div_list:
        tmp = i.text
        hinatazaka_kashi += tmp
#<div id="kashi_area" itemprop="text">
#I was caught(Yeah at a glance, Yeah, Yeah)
#<br>
#I fell in love without permission
#<br>
#It's not your fault
 
#Preprocessing(Remove English and symbols with regular expressions)
import re
kashi=re.sub("[a-xA-Z0-9_]","",hinatazaka_kashi)#Delete alphanumeric characters
kashi=re.sub("[!-/:-@[-`{-~]","",kashi)#Remove sign
kashi=re.sub(u"\n\n","\n",kashi)#Remove line breaks
kashi=re.sub(u"\r","",kashi)#Remove whitespace
kashi=re.sub(u"\u3000","",kashi)#Remove double-byte spaces
kashi=kashi.replace(' ','')
kashi=kashi.replace(' ','')
kashi=kashi.replace('?','')
kashi=kashi.replace('。','')
kashi=kashi.replace('…','')
kashi=kashi.replace('!','')
kashi=kashi.replace('!','')
kashi=kashi.replace('「','')
kashi=kashi.replace('」','')
kashi=kashi.replace('y','')
kashi=kashi.replace('“','')
kashi=kashi.replace('”','')
kashi=kashi.replace('、','')
kashi=kashi.replace('・','')
kashi=kashi.replace('\u3000','')
 
with open("hinatazaka_kashi_1.txt",mode="w",encoding="utf-8") as fw:
    fw.write(kashi)
Unify half-width, full-width, lowercase and uppercase letters, etc. ・ ** Okurigana ** "Do" and "Do" "Reception" and "Reception"
・ ** Character type ** "Apples" and "apples" "Dog", "Inu" and "Dog"
・ ** Uppercase and lowercase ** Apple and apple
** * Ignore this time **
Divide sentences word by word ・ ** MeCab ** ・ ** Janome ** ・ ** Juman +++
Unify to the stem (the part that is not used) Example: Learn → Learn In recent implementations, it may not be converted to the basic form.
path="hinatazaka_kashi_1.txt"
f = open(path,encoding="utf-8")
data = f.read()  #Returns the data read all the way to the end of the file
f.close()
#3.Morphological analysis
import MeCab
text = data
m = MeCab.Tagger("-Ochasen")#Tagger instance creation for parsing text
nouns = [line for line in m.parse(text).splitlines()#Using the parse method of the Tagger class returns the result of morphological analysis of the text
               if "noun" or "Adjectival noun" or "adjective" or"Adjectival noun" or "verb" or "固有noun" in line.split()[-1]]
 
nouns = [line.split()[0] for line in m.parse(text).splitlines()
               if "noun" or "Adjectival noun" or "adjective" or "Adjectival noun" or "verb" or "固有noun" in line.split()[-1]]
 
Remove useless words, such as words that appear too often It may not be removed in recent implementations
my_stop_word=["To do","Teru","Become","Is","thing","of","Hmm","y","one","Sa","so","To be","Good","is there","Yo","もof","Absent","End up",
                 "Be","Give me","From","I wonder","That","but","Only","Tsu","hand","Until","Tsuhand","See you","Want","If","Cod","Without","Be","As it is","Taku"]
nouns_new=[]
for i in nouns:
    if i in my_stop_word:
        continue
    else:
        nouns_new.append(i)
 
import codecs
with codecs.open("hinatazaka_kashi_2.txt", "w", "utf-8") as f:
    f.write("\n".join(nouns_new))
Convert strings to numbers so that they can be handled by machine learning
Classic machine learning ~ Neural network selection according to the task Now, let's grasp what corresponds to preprocessing in this flow.
from gensim.models import word2vec
corpus = word2vec.LineSentence("hinatazaka_kashi_2.txt")
model = word2vec.Word2Vec(corpus, size=100 ,min_count=3,window=5,iter=30)
model.save("hinatazaka.model")
model = word2vec.Word2Vec.load("hinatazaka.model")
#See words that are similar to the driver
print('Top 10 words related to likes')
similar_words = model.wv.most_similar(positive=[u"Like"], topn=10)
for key,value in similar_words:
    print('{}\t\t{:.2f}'.format(key, value))
print('-----')
# #Calculate the similarity between two words
similarity = model.wv.similarity(w1=u"Smile", w2=u"summer")
print('Similarity between "smile" and "summer"=>' + str(similarity))
similarity = model.wv.similarity(w1=u"friend", w2=u"summer")
print("Similarity between "friends" and "summer"=>" + str(similarity))
similarity = model.wv.similarity(w1=u"girl", w2=u"Man")
print('Similarity between "girl" and "man"=>' + str(similarity))
 
The similarity that appears in this is ** cos similarity **. To put it simply, cos similarity is a numerical value of how much two vectors point in the same direction (similarity). A cos similarity of 0 indicates low similarity, and a cos similarity of 1 indicates low similarity. The cos similarity is expressed by the following formula.
 
*** "Papa Duwa Duwa Duwa Duwa Duwa Duwa Duwa Papa Papa" *** What is it? If you get the data from the member's blog instead of the lyrics, you can see how close the members are. (Let's try next time ...)
1: Basics of Natural Language Processing (TensorFlow)
2: [uepon daily memorandum] (https://uepon.hatenadiary.com/entry/2019/02/10/150804)
3: [Np-Ur data analysis class] (https://www.randpy.tokyo/entry/word2vec_skip_gram_model)
Recommended Posts