[PYTHON] Clustering books from Aozora Bunko with Doc2Vec

I will try to find a similar work from the famous works published in Aozora Bunko. I will use Doc2Vec for implementation. Note) Although the title says clustering, it is not strictly clustering, isn't it?

What is Doc2Vec?

Doc2Vec is an extension of the Word2Vec concept, using paragraph vectors (sentence ids) in addition to word vectors. To understand Doc2Vec, I referred to the following page. https://benrishi-ai.com/doc2vec01/ https://kento1109.hatenablog.com/entry/2017/11/15/181838 Also, when implementing Doc2Vec, I referred to the following page. https://qiita.com/g-k/items/5ea94c13281f675302ca This qiita article also briefly explained about Doc2Vec, I didn't know what a paragraph vector (sentence id) looks like. I think it's a vector that is the sum of all the word vectors that appear in that paragraph. You need to read and study the Doc2Vec dissertation! It's another opportunity! This time it's an implementation!

Implementation start

This time, I will scrape the works of Aozora Bunko to get the sentences, and I will investigate the works close to "No Longer Human" by Osamu Dazai from the works of Aozora Bunko.

#Import the required libraries
from bs4 import BeautifulSoup
import requests
import jaconv
from gensim import corpora
from gensim import models
from pprint import pprint
import pandas as pd
import string
import MeCab
import unicodedata
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

Dictionary specification & function definition

It is troublesome to specify the dictionary of mecab each time, so specify it first. Also, it's good to organize the code by defining the code as a function.

Dictionary designation

#Specify NEologd in the MeCab dictionary.
#mecab is for mobile phone analysis, wakati is for word-separation
mecab = MeCab.Tagger('-d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/')
wakati = MeCab.Tagger("-Owakati -d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd/")

Function definition

#Define a function to perform morphological analysis
#When you enter a file, it outputs a file, and when you pass a string, it returns a string. Change with the argument file.
#If you just want to divide it, mecab as an argument=This can be achieved with wakati.
def MecabMorphologicalAnalysis(path='./text.txt', output_file='wakati.txt', mecab=mecab, file=False):
    mecab_text = ''
    if file:
        with open(path) as f:
            for line in f:
                mecab_text += mecab.parse(line)
        with open(output_file, 'w') as f:
            print(mecab_text, file=f)
    else:
        for path in path.split('\n'):
            mecab_text += mecab.parse(path)
        return mecab_text
    

#Since symbolic characters interfere with the analysis, we define a function that removes the symbol.
#Aozora shown below_Used in the table function.
def symbol_removal(soup):
    soup = unicodedata.normalize("NFKC", soup)
    exclusion = """ "" [] ,. ・" + "\n" + "\r" + "\u3000"
    soup = soup.translate(str.maketrans("", "", string.punctuation  + exclusion))
    return soup


#Define a function that scrapes the information of Aozora Bunko and formats it into table data.
#Outputs the number of titles specified in the argument.(The default is 30)  
#Symbol inside_I am using the removal function.
def Aozora_table(n=30):
    url = "https://www.aozora.gr.jp/access_ranking/2019_xhtml.html"
    res = requests.get(url)
    res.encoding = 'shift-jis'
    soup = BeautifulSoup(res.content, "html.parser")

    url_list = [url["href"] for i, url in enumerate(soup.find_all("a", target="_blank")) if i < n]

    title = []
    category = []
    text = []
    for url in url_list:
        res = requests.get(url)
        url_start = url[:37]
        res.encoding = 'shift-jis'
        soup = BeautifulSoup(res.content, "html.parser")
        for i, a in enumerate(soup.find_all("a")):
            if i == 7:
                url_end = a["href"][1:]
        url = url_start + url_end
        res = requests.get(url)
        res.encoding = 'shift-jis'
        soup = BeautifulSoup(res.content, "html.parser")
        title.append(soup.find("h1").string)
        category.append(soup.find("h2").string)
        for tag in soup.find_all(["rt", "rp"]):
            tag.decompose()
        soup = soup.find("div",{'class': 'main_text'}).get_text()
        text.append(symbol_removal(soup))
    df = pd.DataFrame({'title': title, 'category': category, 'text': text})
    return df


#TF by passing a list of divided two-tiered words-Get a unique list of words sorted by IDF.
def sortedTFIDF(sentences):
    
    #Attach the ID to the word.
    dictionary = corpora.Dictionary(sentences)
    
    #Count the number of appearances of words for each work
    corpus = list(map(dictionary.doc2bow, sentences))
    
    #TF for each word-Calculate IDF
    test_model = models.TfidfModel(corpus)
    corpus_tfidf = test_model[corpus]
    
    # ID:TF-IDF → TF-IDF:Convert to words. TF-By bringing the IDF to the left, TF using sorted-You can sort by IDF.
    texts_tfidf = []
    for doc in corpus_tfidf:
        text_tfidf = []
        for word in doc:
            text_tfidf.append([word[1], dictionary[word[0]]])
        texts_tfidf.append(text_tfidf)
    
    # TF-Sort based on IDF.
    sorted_texts_tfidf = []
    for text in texts_tfidf:
        sorted_text = sorted(text, reverse=True)
        sorted_texts_tfidf.append(sorted_text)

    return sorted_texts_tfidf

Creating a data table

First, scrape the page of Aozora Bunko to create a data bable. Use your own ʻAozora_table function` to create the data table.

df = Aozora_table(50)
df.head()
title category text
0 [Ame Nimomakezu] Kenji Miyazawa Ame nimo Makezu-style Nimo Makezu Snow Nimo Natsu no Hatsu Sanimo Makezu Durable Nakaradawomochi 慾 Hanaku decision Shite Dvesha Razuitsumo ...
1 Run, Melos! Osamu Dazai Melos decides to get rid of the angry king of wicked violence, and Melos doesn't understand politics ...
2 The Moon Over the Mountains Atsushi Nakajima Longxi Commandery's Li Zhi was supplemented by Lieutenant Gangnam at the young age of the scholarly talented Tianbao, and was supplemented by Lieutenant Gangnam, but he himself ...
3 heart Natsume Soseki Kami-sensei and Iichi I used to call that person a teacher, so even here, just write "teacher" and hit your real name ...
4 Rashomon Ryunosuke Akutagawa One day, a servant was waiting for the rain under the Rashomon gate. Under the wide gate, this man was ...

If the text is katakana, it cannot be parsed well, so it will be converted to hiragana.

for i in range(len(df)):
    if df['title'][i] in ["[Ame ni mo Makezu]", "Denden Musino Kanashimi"]:
        df['text'][i] = jaconv.kata2hira(df['text'][i])
df.head()
title category text
0 [Ame Nimomakezu] Kenji Miyazawa I have a strong body that can withstand rain, wind, snow, and summer heat, and I never squint at all ...
1 Run, Melos! Osamu Dazai Melos decides to get rid of the angry king of wicked violence, and Melos doesn't understand politics ...
2 The Moon Over the Mountains Atsushi Nakajima Longxi Commandery's Li Zhi was supplemented by Lieutenant Gangnam at the young age of the scholarly talented Tianbao, and was supplemented by Lieutenant Gangnam, but he himself ...
3 heart Natsume Soseki Kami-sensei and Iichi I used to call that person a teacher, so even here, just write "teacher" and hit your real name ...
4 Rashomon Ryunosuke Akutagawa One day, a servant was waiting for the rain under the Rashomon gate. Under the wide gate, this man was ...

I was able to successfully create the data table for Aozora Bunko.

Pre-processing to pass to Doc2Vec model

Use your own Mecab Morphological Analysis function to divide it.

texts = []
for i in range(len(df)):
    texts.append(MecabMorphologicalAnalysis(df['text'][i], mecab=wakati))
#Check if you can divide it. Display 5 works.
for i, text in enumerate(texts):
    if i < 5:
        display(text[:100])

'Neither the rain, the wind, the snow, the heat of the summer, the strong body, the dvesha, the dvesha, the day 'Meros was angry and decided that he had to get rid of the king of wicked violence. Melos didn't understand politics. Meros was a village shepherd. 'Longxi's Li Zhi was a young student of the scholarly talented heavenly treasure, and his name was added to the tiger, and then he was supplemented by Lieutenant Gangnam. 'Kami-sensei and Iichi I used to call that person a teacher, so even here, I can't tell my real name just by writing it as a teacher. This is a refrain from the world.' 'One day's way of life, one servant was waiting for the rain under Rashomon. Under the wide gate, there was no one but this man.'

I was able to divide it normally.

Then convert the word-separated text into a list.

#Convert to list
sentences = []
for text in texts:
    text_list = text.split(' ')
    sentences.append(text_list)

Model learning

Now that the pre-processing is done and we have a word-separated list of sentences, we can pass it to Doc2Vec for learning.

#First, create documents and pass them to the model.
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(sentences)]
model = Doc2Vec(documents, vector_size=100, window=7, min_count=1)
#The content of the documents passed to the model is a list corresponding to the number and the body. The author is also shown here for reference.
for i, doc in enumerate(documents):
    print(doc[1], df['title'][i], df['category'][i], doc[0][:8])

[0] [Ame ni Mo Makezu] Kenji Miyazawa ['Rain',' ni',' mo',' bonus',' zu','wind',' ni','mo'] [1] Run, Melos! Osamu Dazai ['Meros','ha',' rage','shi','ta','must',' or',''] [2] The Moon Over the Mountains Atsushi Nakajima ['Longxi','',' Li Zhi',' is','Academic Talent','Tenho','',' End'] [3] Kokoro Natsume Soseki ['Upper',' Teacher',' and',' I',' One',' I',' is',' That'] [4] Rashomon Ryunosuke Akutagawa ['One day','',' Living','',' Things',' de','Are','Alone'] [5] Night on the Galactic Railroad Kenji Miyazawa ['I',' Afternoon','',' Class',' de',' is','everyone','is' [6] Human disqualification Osamu Dazai ['Foreword','I','wa','that','man','',' photo',' [7] I am a cat Natsume Soseki ['I am a cat',' I am a cat',' name',' is','yet','not','where','de'] [8] Yamanashi Kenji Miyazawa ['Small',' Tanigawa','',' Bottom',' to',' Copy',' Ta',' Two'] [9] Ten Nights of Dreams Natsume Soseki ['No.',' One','Night','Such','Dream',' wo','See','Ta'] [10] Kusamakura Natsume Soseki ['Ichi',' Yamaji',' to',' Climb',' While',' Ko',' Thought',' Ta'] [11] Kenji Miyazawa, a restaurant with a lot of orders ['Two people','',' Young',' Gentleman',' Ga',' Completely','British',''] [12] Spider's Thread Ryunosuke Akutagawa ['One','One day','',' Things','de','Are','Masu','Shaka'] [13] Botchan Natsume Soseki ['Ichi',' Parental','',' Mutsugun','de','Kosuke','',' Time'] [14] Lemon Motojiro Kajii ['Etai','',' Know',' No',' Ominous',' Na',' Mass',' Ga'] [15] Dogra Magra Yumeno Kyusaku ['Page','',' Left and Right','Center','Intro','Song','Fetus',' Yo'] [16] Chieko's Sky Kotaro Takamura ['People',' to',' No',' Na',' N',' Is','You',''] [17] Hojoki Kamo no Chomei ['Go',' River','',' Nagare',' Ha','Continuously','Shi','Te'] [18] Recommendation of learning Satoshi Fukuzawa ['First',' Hen',' Heaven',' is','People','',' Above','to'] [19] "Spring and Shura" Kenji Miyazawa ['Page','',' Left and Right','Center','Mental Image','Sketsuchi','Spring and Shura','Taisho'] [20] Osamu Dazai ['Osamu',' Osamu',' Osamu',' Masu',' Husband',' Sama',' That','People'] [21] The Dancing Girl Ogai Mori ['Coal',' O',' Ba',' Hayaya',' Stacking',' End',' Tsu','Medium'] [22] The Nighthawk Star Kenji Miyazawa ['yo','da','ka','is','really','hard to see','bird','is'] [23] Momotaro Masao Kusuyama ['Ichi','Once upon a time','Once upon a time','Are','Tokoro',' to','Grandfather','and'] [24] Motojiro Kajii under the cherry tree ['Under the cherry tree','corpse',' is','buried','te','is','this',' is'] [25] Buying Mittens Nankichi Niimi ['Cold','Winter',' Ga','Northern',' From','Fox','',' Parent and Child'] [26] Ryunosuke Akutagawa ['Zen','Chinai','Supplement','',' Nose',' and',' Say','Ba'] [27] Takasebune Ogai Mori ['Takasebune','wa','Kyoto','',' Takasegawa',' to','up and down','do'] [28] A handful of sand Takuboku Ishikawa ['Hakodate',' Naru',' Ikuu Miyazaki',' Miyazaki',' Daishiro',' Kimi',' Country',''] [29] Tosa Nikki Ki no Tsurayuki ['Man',' Mo',' Su',' Naru',' Diary',' Toifu',' Things',' [30] Star Tour Song Kenji Miyazawa ['Akai','Medama','',' Sasori',' Hiroge',' Ta','Eagle','''] [31] Letter in the cement barrel Yoshiki Hayama ['Matsudo',' Yozo',' is','Cement','Ake',' to','Ya','Te'] [32] Girl Hell Yumeno Kyusaku ['What',' N','De',' No',' No',' Shirataka',' Hidemaru','Brother'] [33] The Setting Sun Osamu Dazai ['Ichi','Breakfast','Do',' at','Suup','O','Ichi','Spoon'] [34] Kappa Ryunosuke Akutagawa ['Please','Kappa','and','Pronunciation',',',',','Please','Introduction'] [35] Obbel and the Elephant Kenji Miyazawa ['Are','Cow','Cowman',' Ga',' Monogataru','No.',' One','Sunday'] [36] Sakutaro Hagiwara barking in the moon ['Cousin',' Hagiwara',' Eiji',' Mr.',' To',' Dedicated','Introduction',' Hagiwara'] [37] Matasaburo of the Wind Kenji Miyazawa ['Dododo','Dododo','Dododo','Dodo','Blue','Walnut','Mo',' Blow off'] [38] Gauche the Cellist Kenji Miyazawa ['Gauche','ha','town','',' activity photo','kan','de','cello'] [39] Sanshiro Natsume Soseki ['Ichi','Drowsiness',' As',' Eyes',' Ga','Sameru','and','Woman'] [40] Hell Screen Ryunosuke Akutagawa ['Ichi',' Horikawa','',' Large','Den-sama','',' Yau','Na'] [41] Schoolgirl Osamu Dazai ['A',' Sa',' Eyes','O',' Sa',' When','',' Feeling'] [42] Kani Kosen Takiji Kobayashi ['Ichi','Hey','Hell','Sa','Gogun','Da','De','Two people'] [43] Denden Musino Kanashimi Nankichi Niimi ['When','Piki','',' Dendenmushi','ga','Yes','Mashi','Ta'] [44] Night on the Galactic Railroad Kenji Miyazawa ['I',' Afternoon','',' Class',' de',' is','everyone',' is'] [45] Toshishun Ryunosuke Akutagawa ['Ichi',' or',' Spring','',' Higurashi',' is',' Tang','''] [46] To Yasunari Kawabata Osamu Dazai ['You',' is','Bungei Shunju','September',' Issue','to','I','to'] [47] In Praise of Shadows Junichiro Tanizaki ['○','Today','Public','Doraku','','People','ga','Jun'] [48] The Human Chair Ranpo Edogawa ['Kako',' is','every morning','husband','',' to go to the office',' to',' see off'] [49] Ryunosuke Akutagawa ['Odawara','Atami',' Between',' to','Light Railway','Laying','',' Construction']

Make predictions using a trained model

Now, let's use the model we learned to extract works that are close to Osamu Dazai's "No Longer Human".

#6 is Osamu Dazai's "No Longer Human".
#It must have been a story of suffering while getting lost in one's life.
#At the end, I was sent to a mental hospital, and it felt like it ended up being heavy.
ranking = model.docvecs.most_similar(6, topn=50)
ranking[:5] #Top 5 works with high cosine similarity
[(14, 0.9827772974967957),
 (1, 0.9771202802658081),
 (46, 0.9766896367073059),
 (48, 0.975338876247406),
 (4, 0.9737432599067688)]
ranking[-5:] #Top 5 works with low cosine similarity
[(5, 0.861607551574707),
 (32, 0.8596963882446289),
 (44, 0.8453813195228577),
 (22, 0.8167744874954224),
 (37, 0.8134371042251587)]

The work most similar to "Ningen Shikkaku" came out as "Lemon". It's certainly close. I feel that this was a system in which the main character lived while suffering, as in the case of "human disqualification." The 2nd and 3rd place were "Run, Melos!" And "To Yasunari Kawabata". Both are works by Osamu Dazai! Isn't the heavy atmosphere of "Rashomon" in 5th place quite close to "human disqualification"? ??

Most of the works that are not similar are Kenji Miyazawa's works. (For some reason, "Night on the Galactic Railroad" appears twice) "Night on the Galactic Railroad" and "The Nighthawk Star" may be worrisome to the main character, but they must have been somehow warm. .. I feel that the atmosphere is quite different from "human disqualification".

We got good results that resembled our feelings! However, I'm honestly worried whether all the works are too close to each other and can be clustered properly.

Considering the use of TF-IDF

So let's use TF-IDF to cluster only the important words! !! Here are the expected advantages and disadvantages of using TF-IDF.

merit

――The text contains many frequently-used and meaningless words such as "o" and "ta" of "throwing the ball", and there is a risk that each work will have a similar vector. I thought there was. Therefore, by focusing on important words with TF-IDF and emphasizing the characteristics of the work, it may be easier to cluster.

Demerit

――Since it is no longer a sentence, it may be difficult to grasp the wording and composition of the sentence peculiar to the author. ――Since the words are grouped together, it may not be a good approach for works that feature the same expression over and over again. ――Maybe in the work "Run, Melos!", Words such as "Meros" and "Serinuntius" show high values for TF-IDF. Can you say that it represents the characteristics of the text? I have a question.

I can think of more disadvantages than advantages, but I'll try it for the time being.

TF-IDF pretreatment

sentences contains a list of words that have already been divided. If it is in this form, you can get a sorted list of TF-IDF for each work by passing it to your own sortedTFIDF function. What we are doing with the sortedTFIDF function is the previous article ["Implementing TF-IDF using gensim"](https://qiita.com/kei0919/items/1e191964e727b83372c0#df-idf%E3%82%92 % E5% AE% 9F% E8% A1% 8C% E3% 81% 99% E3% 82% 8B% E5% 89% 8D% E3% 81% AE% E6% BA% 96% E5% 82% 99) It's the same as what you're doing in the TF-IDF part.

#Check the contents of sentences
for i, sentence in enumerate(sentences):
    if i < 5:
        print(sentence[:10])

['Rain',' to',' mo',' bonus',' zu','wind',' to',' mo',' bonus',' zu'] ['Meros','wa',' rage','shi','ta','must',' or','',' wickedness','violence'] ['Longxi','',' Li Zhi','ha',' scholarly talent','Tenba','',' end',' young',' ['On',' Teacher',' and',' I',' One',' I',' is',' That','People','] ['One day',''',' Living','',' Things',' at','Are','One person','',' Lower man']

#TF using your own sortedTFIDF function-Get a list sorted by IDF.
sorted_texts_tfidf = sortedTFIDF(sentences)
#TF for each work-Check words with high IDF
for i, tfidf in enumerate(sorted_texts_tfidf):
    if i < 5:
        print('%s.' % i, '〜%s〜' % df['title'][i]) #The title is also displayed for the time being
        pprint(tfidf[:10])
        print('')
  1. ~ [Ame Nimomakezu] ~ [[0.5270176717513841,'Nanmu'], [0.335824762232814,'Bonus'], [0.2720937923932906,'Maki'], [0.1360468961966453,'萓'], [0.1360468961966453,'Dvesha'], [0.1360468961966453,'Brown rice'], [0.1360468961966453,'Bodhisattva without a side'], [0.1360468961966453,'Jyogyo Bodhisattva'], [0.1360468961966453,'Bodhisattva Anritsu'], [0.1360468961966453,'Prabhutaratna']]

  2. ~ Run, Melos! ~ [[0.8826660798685602,'Meros'], [0.1647643349087979,'Serinuntius'], [0.12468722733644991,'King'], [0.09375235972221699,'I'], [0.07738167204990741,'You'], [0.07061328638948482,'Muddy stream'], [0.06778538031378131,'crowd'], [0.06439978483773416,'Friend'], [0.06166407227059827,'No'], [0.05884440532457068,'Nobumi']]

  3. ~ The Moon Over the Mountains ~ [[0.46018061185481746,'袁'], [0.46018061185481746,'Li Zhi'], [0.32450428070224685,'self'], [0.2989295156005757,'Plexus'], [0.1698659669116043,'Tiger'], [0.10946065580700806,'Sorry'], [0.07971453749348684,'吏'], [0.0726600966086554,'Shame'], [0.0726600966086554,'Dignified'], [0.07127857508099637,'Self-esteem']]

  4. ~ Heart ~ [[0.5354264061948253,'I'], [0.4564728456651801, 'K'], [0.282595486317482,'wife'], [0.2504145163549083,'Teacher'], [0.17885230572329233,'is'], [0.1597103741100704,'Lady'], [0.12084196131850143,'Things'], [0.11917933998957644,'Father'], [0.11460565741637332,'not'], [0.10324388965526733,'I']]

  5. ~ Rashomon ~ [[0.7324117083727154,'Lower'], [0.46608017805536434,'Old woman'], [0.1618414518545496,'Death'], [0.1416112703727309,'Rashomon'], [0.1059977890898406,'say'], [0.10332033314297188,'corpse'], [0.09868036169619741,'Ladder'], [0.0809207259272748,'Comedone'], [0.07675139243037576,'Tachi'], [0.07581439176692739,'Dead']]

I was able to sort in descending order of TF-IDF. Now, let's create a ʻall_title` list focusing on the top 100 words for each work and pass it to the Doc2Vec model.

all_title = []
for tfidf in sorted_texts_tfidf:
    title = []
    for word in tfidf[:100]: #Narrow down to 100 words
        title.append(word[1])
    all_title.append(title)
# all_Check the contents of the title list
for i, text in enumerate(all_title):
    if i < 5:
        print(text[:10])

['Nanmu','Make','Maki',' 萓','Dvesha','Brown rice','Mubeyuki Bodhisattva','Jyogyo Bodhisattva','Anritsu Bodhisattva','Prabhutaratna'] ['Meros','Serinuntius','King','I','You','Muddy Stream','Crowd','Friends','None','Nobumi'] ['袁','Li Zhi','Self','Plexus','Tiger','Sorrow','Shame','Shame','Self-esteem'] ['I','K','wife',' teacher','is','young lady','thing','father','not','i'] ['Lower','Old Woman','Death','Rashomon',' Say','Dead','Ladder','Comedo','Tachi','Dead']

Learn the Doc2Vec model again

#List of important words in Doc2Vec model all_Pass the title.
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(all_title)]
model = Doc2Vec(documents, vector_size=100, window=5, min_count=1)
#Display a list of numbers and works for easy comparison with the results
for i, doc in enumerate(documents):
    print(doc[1], df['title'][i], df['category'][i], doc[0][:8])

[0] [Ame ni Mo Makezu] Kenji Miyazawa ['Nanmu','Make','Maki','萓','Dvesha','Brown Rice','Mubeyuki Bodhisattva','Jyogyo Bodhisattva'] [1] Run, Melos! Osamu Dazai ['Meros','Serinuntius','King','I','You','Muddy Stream','Crowd','Friends'] [2] The Moon Over the Mountains Atsushi Nakajima ['袁',' Li Zhi','Self','Plexus','Tiger','Sorrow',' 吏','Shame'] [3] Kokoro Natsume Soseki ['I','K','wife',' teacher','is','young lady','thing','father'] [4] Rashomon Ryunosuke Akutagawa ['Senior',' Old Woman','Death','Rashomon',' Say',' Corpse','Ladder','Comedo'] [5] Night on the Galactic Railroad Kenji Miyazawa ['Giovanni','Campanella','I say','I','Mashi','Galaxy','Capture','Milky Way'] [6] No Longer Human Osamu Dazai ['Horiki','Self','Flounder','Jester','Deshi','Yoshiko','Things','Tsuneko'] [7] I am a cat Natsume Soseki ['I Am a Cat',' I Am a Cat',' I Am a Cat',' I Am a Cat',' Master',' Kimi',' Things',' Kaneda'] [8] Yamanashi Kenji Miyazawa ['Clammbon','Crab',' Yamanashi','Wara','Foam','Dad',' Crab','Kapukapu'] [9] Ten Nights of Dreams Natsume Soseki ['Shotaro',' Unkei',' Old man',' Self',' Nioh',' Pig',' I'm',' Unkei'] [10] Kusamakura Natsume Soseki ['Yo','Speaking','Kuichi','Painting','Inhumanity','Nu','Iru','Nami'] [11] Kenji Miyazawa, a restaurant with a lot of orders ['Door',' Cream',' I','Bokura','Backside','Rattling','Please','Goton'] [12] Spider's Thread Ryunosuke Akutagawa ['The Spider's Thread','The Spider's Thread','The Spider's Thread',' Paradise',' Blood Pond','Are','Hell',' Needle Mountain'] [13] Botchan Natsume Soseki ['Red Shirt','Yamaarashi','I','Uranari','Yu','Kiyo','Yu','Principal'] [14] Lemon Motojiro Kajii ['Lemon','I','Maruzen','Town','Fruit shop','――','廂','Paints'] [15] Dogra Magra Yumeno Kyusaku ['Masaki',' Wakabayashi',' Ichiro',' Gozai',' Wu',' Doctor',' I'm','Brain'] [16] Chieko's Sky Kotaro Takamura ['Chieko','She','Tsute','Yau','ゐ','Atsu','Toifu','Tsuta'] [17] Hojoki Kamo no Chomei ['ゝ','Samurai',' Hi','ゞ','or',' Mizuka',' I',' 經'] [18] Academic Recommendations Fukuzawa Yukichi ['Government','People',' at','Beshi','Bekara','Zaru','Ara','or'] [19] "Spring and Shura" Kenji Miyazawa ['Iru','Watakushi','Tsute','Yau','Kai','Page','Swaying','┃'] [20] Osamu Dazai ['I','Peter','that','you','disciple','no','Jerusalem','said'] [21] The Dancing Girl Ogai Mori ['Yo',' Ellis','ゝ','at',' Aizawa',' Taru','I','Minister'] [22] The Nighthawk Star Kenji Miyazawa ['Hawk',' Ichizo','Nest',' I',' Burnt',' Sparrow',' Star',' Sora'] [23] Momotaro Masao Kusuyama ['Momotaro','Grandmother','Grandfather','Kibidango',' Onigashima','Pheasant','Peach','Oni'] [24] Under the cherry tree, Motojiro Kajii ['I','corpse',' 溪','Under the cherry tree','you','buried','thin feather','hair root'] [25] Nankichi Niimi to buy gloves ['Child fox',' Fox',' Boy',' Mother',' Hands',' Mother',' Hat shop',' Gloves'] [26] Ryunosuke Akutagawa ['Internal offering',' Monk',' Nose',' Disciple',' 哂',' Say',' Short',' Doji'] [27] Takasebune Ogai Mori ['Kisuke','Tsute',' 衞','Sho','Cloud','Departure','Watakushi','Yau'] [28] A handful of sand Takuboku Ishikawa ['Friend','Furusato','Ariki','Kanashi','Kanashiki','Thought','Taru','Ramu'] [29] Tosa Nikki Ki no Tsurayuki ['ゝ','read','i','ship','if','keri','country','song'] [30] Star Tour Song Kenji Miyazawa ['Medama','Small Dog','Snake','Tsuyu','Tsubasa','Shimoto','Guma no Ashio',' Oguma'] [31] Letter in Cement Barrel Yoshiki Hayama ['Cement','Lover','Barrel','He','Masu','Mixer','Small Box','Rip'] [32] Girl Hell Yumeno Kyusaku ['Shirataka','Principal teacher','She','Yuriko','Concubine','Niitaka','Princess','I'] [33] The Setting Sun Osamu Dazai ['Mother','Naoji','I','Uehara','I','Uncle','Uncle','O'] [34] Kappa Ryunosuke Akutagawa ['Kappa',' Totsuku',' I',' ゐ',' ゐ','Tetsu','Geel','Un'] [35] Obbel and the Elephant Kenji Miyazawa ['Obbel',' Elephant',' Gralaagaa Gralaagaa','White Elephant','Hut','Such','Handle','I'] [36] Sakutaro Hagiwara barking in the moon ['Yau',' ゐ',' tsute',' tsuta','poetry','me','ゐ','atsu'] [37] Matasaburo of the Wind Kenji Miyazawa ['Kasuke',' Saburo',' Ro',' Ichiro',' Matasaburo',' Kosuke',' Say',' Mashi'] [38] Gauche the Cellist Kenji Miyazawa ['Gauche','Cuckoo','Cerro','Conductor','Raccoon',' Gauche','Gauche',' Field Mouse'] [39] Sanshiro Natsume Soseki ['Ro',' Sanshi',' Yojiro','Mieko',' Nonomiya',' Sanshiro',' Hirota',' Haraguchi'] [40] Hell Screen Ryunosuke Akutagawa ['Yoshihide','Are','Ya','Tsute','Cloud','Den-sama','Go','Iru'] [41] Schoolgirl Osamu Dazai ['Mom','I','Japii','I'm','Imaida','Dad','Kaa','Sister'] [42] Kani Kosen Takiji Kobayashi ['Fisherman','Director','――','But','Kawasaki','Tatsu','Tsu','Captain'] [43] Denden Musino Kanashimi Nankichi Niimi ['Dendemushi','Friends','Kanasimi','Ippai','I','Yun','Iki','Naka'] [44] Night on the Galactic Railroad Kenji Miyazawa ['Giovanni','Campanella','I','Milky','Milky Way','Say','Beyond','Galaxy'] [45] Toshishun Ryunosuke Akutagawa ['Toshishun','Tsute',' ゐ',' ゐ','Emeiyama','Yau','Luoyang','Much rich'] [46] To Yasunari Kawabata Osamu Dazai ['Hana of the Clown','Kazuo Dan','Brother','I','Mr.','Yasunari Kawabata','Dostoevsky','Sentence'] [47] In Praise of Shadows Junichiro Tanizaki ['Speaking','ゝ','In Praise',' Aro','Darkness','We',''','I'] [48] Human Chair Ranpo Edogawa ['Yes','I','Chair','She','Sama','Masu','Hotel','Wife'] [49] Minecart Ryunosuke Akutagawa ['Ryohei',' Minecart',' earthwork','railroad track','――','construction','he','push']

Result announcement!

ranking = model.docvecs.most_similar(6, topn=50) #This time too, it was verified by Osamu Dazai's "No Longer Human"
ranking[:5]
[(45, 0.25298941135406494),
 (26, 0.22999905049800873),
 (36, 0.1593010127544403),
 (21, 0.15090803802013397),
 (47, 0.1467716097831726)]
ranking[-5:]
[(12, -0.11983974277973175),
 (41, -0.12426283210515976),
 (0, -0.1281222403049469),
 (13, -0.1791403889656067),
 (25, -0.2501679062843323)]

I look at the works that are judged to be similar, but there are no works that I feel are similar. On the contrary, if you look at the works that are judged not to be similar, Dazai's work may be, but the atmosphere is different, but ...

That's why I found that using TF-IDF reduces the accuracy of novels. The reason is the [disadvantage] mentioned above (https://qiita.com/kei0919/items/bde365bf179c0a1573af#%E3%83%87%E3%83%A1%E3%83%AA%E3%83%83 I think that% E3% 83% 88) is involved.

It's frustrating as it is, so next I'll use Doc2Vec and TF-IDF to analyze news articles!

Recommended Posts

Clustering books from Aozora Bunko with Doc2Vec
Clustering with python-louvain
Clustering with scikit-learn (1)
Clustering with scikit-learn (2)
Let's have Aozora Bunko summarized while talking with COTOHA
Author estimation using neural network and Doc2Vec (Aozora Bunko)
[2020 version] Scraping and processing the text from Aozora Bunko
Clustering with scikit-learn + DBSCAN
DBSCAN (clustering) with scikit-learn