[PYTHON] What are the characteristics of an AV actress? I guessed from the title of the work! (^ _ ^) / ~~

Introduction

Have you ever been concerned about ** AV titles **?

At that moment I had a question.

"The name of the AV work represents the characteristics of an AV actress, right?" "If so, I think you can tell your AV habit from its characteristics."

If you think so, take action! Let's do it

This time, we will prove the hypothesis using a method called ** word cloud **. (I would like to ask my favorite ** Mia Nanasawa ** to cooperate.)

What is Word Cloud?

A "word cloud" is a single picture of words that frequently appear in a sentence. It's one of the quickest and easiest ways to get a feel for a sentence because you can visually see what it's like.

HTML acquisition

import requests #Library to get web pages
from bs4 import BeautifulSoup #A library that can read and operate tags from the acquired HTML data
url = "https://ja.wikipedia.org/wiki/%E4%B8%83%E6%B2%A2%E3%81%BF%E3%81%82" #Mia Nanasawa's wiki URL
response = requests.get(url)
response.encoding = response.apparent_encoding #response.apparent_SHIFT, which is the correct character code for encoding_JIS is stored(You can prevent garbled characters)
soup = BeautifulSoup(response.text, "html.parser") #BeautifulSoup(HTML to be parsed/XML,Parser to use(Parser))
#HTML can be indented
print(soup.prettify())

image.png

I was able to get the HTML correctly.

Acquisition of work name

span_list1=soup.findAll("td")
titles=[]
for i in span_list1:
    tmp=i.find("b")
    if tmp==None:
        continue
    else:
        print(tmp.text)
        titles.append(tmp.text)
データ The above output contains elements that are not needed for this analysis, such as the "!" Mark and the "-" mark, so we will remove them from now on.

Crazing

import re
changed_titles1=[]

for i in titles:
    tmp=re.sub("!","",i)
    tmp=re.sub(" ","",tmp)
    tmp=re.sub("!","",tmp)
    tmp=re.sub("!!","",tmp)
    tmp=re.sub("〜","",tmp)
    tmp=re.sub("~","",tmp)
    tmp=re.sub("-","",tmp)
    tmp=re.sub("・","",tmp)
    tmp=re.sub("「","",tmp)
    tmp=re.sub("」","",tmp)
    tmp=re.sub("Nanasawa Mia","",tmp)
    if tmp=="":
        continue
    else:
        changed_titles1.append(tmp)
changed_titles1
データ

Now you have removed the unnecessary characters. From here, we will start morphological analysis.

Morphological analysis

import MeCab

changed_titles2=''.join(changed_titles1) #Must be a string from the list
text = changed_titles2
m = MeCab.Tagger("-Ochasen")#Tagger instance creation for parsing text

#I will try to remove only the nouns
nouns = [line for line in m.parse(text).splitlines()#Using the parse method of the Tagger class returns the result of morphological analysis of the text
               if "noun" in line.split()[-1]]
for str in nouns:
    print(str.split())
データ
nouns = [line.split()[0] for line in m.parse(text).splitlines()
               if "noun" in line.split()[-1]]
print(nouns)
データ

Result is! ??

from wordcloud import WordCloud
import matplotlib.pyplot as plt

text_new=""
for i in nouns:
    text_new = text_new + " " + i

word_cloud=WordCloud(background_color='white',font_path=r"C:\Users\tomoh\Machine learning able\Word cloud\meiryo.ttc",min_font_size=5,prefer_horizontal=1)
word_cloud.generate(text_new)

plt.figure(figsize=(10,8))
plt.imshow(word_cloud)
plt.axis("off")
plt.show()

七沢みあ七沢みあ

It can be seen that the above results represent the characteristics of Mia Nanasawa ** correctly **.

This is because I have the experience of watching Mia Nanasawa's videos without missing a single one. (I'm sorry for my experience.)

Looking back, ** ・ Tsundere ** ** ・ Provocation ** ** ・ Women's College ** I felt something that attracted me a lot.

If I had a girlfriend, I wish I had these three points ...

Compare with other actresses

高橋しょうこ高橋しょうこ

** Shoko Takahashi ** is a famous actress who made her debut in the gravure world. From this result, you can read not only the feature of "idol, gravure" but also the feature of ** older S-ki ** from the word ** "boss, older sister" **.

** Recommended for those with M temperament who have a desire to get angry **.

三上悠亜三上悠亜

** Yua Mikami ** is a popular actress who belongs to SKE. From this result, not only the characteristic of "idol" but also the characteristic of ** luxury soap lady ** can be read from the word ** "luxury, big breasts, soap" **.

It's recommended for those who don't have money but want to taste high-class soap **.

水卜さくら水卜さくら

** Sakura Miura ** is an actress who was taken care of before she fell in love with Mia Nanasawa. From this result, we can read the characteristics of ** "boobs, big breasts, sober" **. Probably, I think that it is recommended for ** those who like Aniota's sober busty women **.

From the above results, I found from WordCloud that I like ** "a sober, big-breasted, tsundere-minded female college student" **.

That may very well be right

In terms of "big breasts", Shoko Takahashi and Yua Mikami agree, but Since there are more opportunities to watch videos of Mia Nanasawa and Sakura Miura than that, ** This hypothesis is proof. ** **

Please give it a try.

Recommended Posts

What are the characteristics of an AV actress? I guessed from the title of the work! (^ _ ^) / ~~
What I thought about in the entrance exam question of "Bayesian statistics from the basics"
When I created an ECR scan from a CDK, I could see the back side of the scan
Not surprisingly known! ?? What about the arguments of built-in functions? What school are you from? [Python]
I tried to predict the genre of music from the song title on the Recurrent Neural Network
I made an emotion radar chart of Aozora Bunko's work
What beginners learned from the basics of variables in python
What I saw by analyzing the data of the engineer market
I made a Line bot that guesses the gender and age of a person from an image
I want to extract an arbitrary URL from the character string of the html source with python