[PYTHON] I analyzed whether it was Mr. Nakajima who sang "I" and Yumin who sang "you".

Former story

https://www.news-postseven.com/archives/20201203_1617538.html/2

In the article, it is written as follows.

Mr. Nakajima sings "I" and Yumin sings "you". Mr. Nakajima sings "night," "crying," and "lie." Yumin sings "morning," "love," and "like."

Certainly, I feel like that. I don't know much about it, so I think so. Except for the unpleasant double quotes in the original article, it's smooth.

Are you sure?

But I think it's a little too catchy. Let's analyze a little and make a copy of ourselves to see if it really is.

Since it's a big deal, I'll try to study the library, which I don't usually use.

How to proceed

If this is the case, I may be able to do it, as I am sitting in the gym in the corner of the IT industry.

Thanks to the people who make a nice library.

Scraping

Scraping is a beautiful soup that I have always been indebted to. I scrape the lyrics quickly, but it can be a little worrisome. It is a copyright.

About copyright of lyrics

https://dailytextmining.hatenablog.com/entry/2018/08/02/065500

Hmmm, it seems to be a problem if it is used for data analysis. However, it is a problem if the lyrics are spilled. Recently, I often mention it in git, but if I inadvertently upload the original data after scraping, Be careful as you are likely to enter in ↓. https://qiita.com/advent-calendar/2020/yarakashi-production

Morphological analysis

I used to use MeCab a lot until now, but I was always worried about stocking it here and there. I will try using GINZA immediately. Japanese NLP Library GiNZA Recommendation

Let's go sideways and analyze and visualize the dependent structure.

Because I use it mostly at work, I long for this kind of thing ... Wai "User dictionary, it's hard to make by hand. Santa Maria"

One evening elephant is in an elephant hut, looking up at the moon of the tenth while eating three straws, He said, "It's painful. Santa Maria." Source: Aozora Bunko Obbel and the Elephant Kenji Miyazawa

A word of the whole body of an elephant can be visualized like this.

displacy.PNG

what's this? 3 Do you want to go?

nlp = spacy.load('ja_ginza')
doc = nlp('One evening, an elephant was in an elephant hut, eating three straws, looking up at the moon on the tenth day, and saying, "It's painful. Santa Maria."')
displacy.serve(doc, style='dep')

Return to morphological analysis

As a salaryman, I have a lot of sympathy for elephants, but I will wipe my tears and proceed with my work at GINZA. What I want to do in morphological analysis is to return to the part of speech and basic type. In the first place, the original material is lyrics, so I thought it would be okay to just use nouns, but I was so lonely. With nouns, adjectives, and verbs, the acquired words are returned to the basic type.

def make_words_list(text: str) -> list:
    rs = []
    doc = nlp(text)
    for sent in doc.sents:
        for token in sent:
            tag = token.tag_.split('-')[0]
            if tag in ['noun','adjective','verb']:
#            if tag in ['noun']:
                rs.append(token.lemma_)
    return rs

As mentioned above, spaCy is also wonderful, but I am grateful to the people of GINZA who have improved the Japanese language.

DataFrame state

The point is the Pandas data frame. After this, I will use nlplot to visualize it nicely, but it is very comfortable because ** DataSeries direct delivery ** is possible.

I have omitted it, but the title and lyrics are in the state of being acquired by scraping.

title lyrics words
For kindness... Good poem [Word 1,Word 2,Word 3]
Airplane... Good poem [Word 4,Word 5,Word 6]

Output nicely

Think about visualization

This time I will use nlplot. I've been interested in this for a long time, but I haven't had a chance to use it until now, so I'll take this opportunity.

  1. N-gram bar chart
  2. N-gram tree Map

Especially 3-5 is something I've never done before.

N-gram bar chart Oh, that's good! It's clean. It is also interactive because it is displayed in the browser with pyplot.

newplot.png

N-gram tree Map It's more flashy than the bar chart. This is good when you want to see a rough atmosphere rather than a small number. It may be good to use the presentation as a quiet talk or as a chapter cover.

newplot (1).png

newplot (2).png

wordcloud It looks like this in the word cloud Word cloud doesn't look good without long words to some extent.

Yumi Matsutoya Figure 2020-12-28 161908.png

Miyuki Nakajima Figure 2020-12-28 161847.png

co-occurrence networks co-occurrence networks

This is also displayed in the browser with pyplot, so it is also interactive. The co-occurrence network is interesting when you look at the relationships between words like this one. Above all, I'm glad that it's easy to make.

Yumi Matsutoya newplot (3).png

Miyuki Nakajima newplot (4).png

sunburst chart This is also amazing, the output is pretty clean. The view is the same, but it would be nice if there was a stronger message, but it was my fault. I should have put in a stop word. .. ..

newplot (5).png

newplot (6).png

If I make a copy

** "Yumin sings time, Miyuki Nakajima sings place." **. I was surprised when I analyzed it, but the top words are quite the same, aren't they?

That is, the one with a small number of cases may have more characteristics, so let's look at the one with a smaller number of cases. Yumin tends to have many verbs, and Miyuki Nakajima tends to have many nouns. And I think Miyuki Nakajima has many words related to nature such as "sky" and "sea", and Yumin has many words related to personal names such as "two people" and "you".

About the analyst

The age is a little lower than the Yumin generation, and it is indistinguishable between "Michopa" and "Yuki Poyo".

About Yumi Matsutoya ・ Wind crossing the pier ・ Refrain is screaming I like It is said that the wind across the pier is tuned at 450Hz, which is higher than the standard pitch. That refreshing feeling may be something that can be achieved by analyzing the voice system rather than analyzing the natural language.

About Miyuki Nakajima ·Fight! ·light sleep I like. We also provide music to many artists.

github https://github.com/Katsutoshi-Inuga/qiita_2020_advent_cal_lyrics_nlp

Recommended Posts

I analyzed whether it was Mr. Nakajima who sang "I" and Yumin who sang "you".
I took Apple Watch data into Google Colaboratory and analyzed it