[PYTHON] I tried morphological analysis and vectorization of words

Try using Word2vec

pip install gensim
pip install janome
#Import required libraries

from janome.tokenizer import Tokenizer
from gensim.models import word2vec
import re

#Read after opening txt file
binarydata = open("kazeno_matasaburo.txt").read()

#By the way, the one who printed and checked one by one
binarydata = open("kazeno_matasaburo.txt")
print(type(binarydata))

Execution result <class'_io.BufferedReader'>

binarydata = open("kazeno_matasaburo.txt").read()
print(type(binarydata))

Execution result <class'bytes'>


#Convert data type to string type (how to write python)
text = binarydata.decode('shift_jis')
#Remove unnecessary data
text = re.split(r'\-{5,}',text)[2]
text = re.split(r'Bottom book:',text)[0]
text = text.strip()

#Perform morphological analysis
t = Tokenizer()
results = []
lines = text.split("\r\n")  #Separated by line

for line in lines:
    s = line
    s = s.replace('|','')
    s = re.sub(r'《.+?》','',s)
    s = re.sub(r'[#.+?]','',s)
    tokens = t.tokenize(s)  #Contains the analyzed one
    r = []
  #Take them out one by one.base_form.You can access it on the surface
    for token in tokens:
        if token.base_form == "*":
            w = token.surface
        else:
            w = token.base_form
        ps = token.part_of_speech
        hinshi = ps.split(',')[0]
        if hinshi in ['noun','adjective','verb','symbol']:
            r.append(w)
    rl = (" ".join(r)).strip()
    results.append(rl)
    print(rl)

#Write the analyzed one at the same time as the file is generated
wakachigaki_file = "matasaburo.wakati"
with open(wakachigaki_file,'w', encoding='utf-8') as fp:
    fp.write('\n'.join(results))

#Analysis start
data = word2vec.LineSentence(wakachigaki_file)
model = word2.Word2Vec(data,size=200,window=10,hs=1,min_count=2,sg=1)
model.save('matasaburo.model')

#try using model
model.most_similar(positive=['school'])

Summary

① Get the sentence you want to analyze. ② Process so that it is only sentences. Get rid of things like the last bibliography ③ Take out line by line with the for statement and remove unnecessary parts. ④ Perform morphological analysis with tokenizer. Put it in the list. ⑤ Write the created list to a file ⑥ Create a model using the morphologically analyzed file

Recommended Posts

I tried morphological analysis and vectorization of words
I tried morphological analysis of the general review of Kusoge of the Year
I tried cluster analysis of the weather map
I tried various things with Python: scraping (Beautiful Soup + Selenium + PhantomJS) and morphological analysis.
I displayed the chat of YouTube Live and tried playing
I tried using GrabCut of OpenCV
Clash of Clans and image analysis (3)
I played with Mecab (morphological analysis)!
Morphological analysis of sentences containing recent words in Windows10 64bit environment
I tried to make an analysis base of 5 patterns in 3 years
I tried multiple regression analysis with polynomial regression
I tried the asynchronous server of Django 3.0
I tried using Twitter api and Line api
I tried to visualize the age group and rate distribution of Atcoder
I installed DSX Desktop and tried it
I tried time series analysis! (AR model)
I tried factor analysis with Titanic data!
I tried using PyEZ and JSNAPy. Part 2: I tried using PyEZ
I tried to verify and analyze the acceleration of Python by Cython
Conversion between singular and plural of words
I tried to perform a cluster analysis of customers using purchasing data
I tried 3D detection of a car
I tried combining Fabric, Cuisine and Jinja2
I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"
[Python] From morphological analysis of CSV data to CSV output and graph display [GiNZA]
I tried handwriting recognition of runes with scikit-learn
I tried using PyEZ and JSNAPy. Part 1: Overview
[Python / DynamoDB / boto3] List of operations I tried
I tried hundreds of millions of SQLite with python
I tried the pivot table function of pandas
I tried image recognition of CIFAR-10 with Keras-Learning-
I tried web scraping using python and selenium
I tried image recognition of CIFAR-10 with Keras-Image recognition-
I tried to notify slack of Redmine update
I tried object detection using Python and OpenCV
I tried to find 100 million digits of pi
I tried Flask with Remote-Containers of VS Code
Before the coronavirus, I first tried SARS analysis
I tried playing with PartiQL and MongoDB connected
I tried principal component analysis with Titanic data!
I tried Jacobian and partial differential with python
I tried FX technical analysis by AI "scikit-learn"
I tried function synthesis and curry with python
I tried to correct the keystone of the image
Thorough comparison of three Python morphological analysis libraries
I / O related summary of python and fortran
I read and implemented the Variants of UKR
I tried using the image filter of OpenCV
[Introduction to PID] I tried to control and play ♬
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
[Linux] I learned LPIC lv1 in 10 days and tried to understand the mechanism of Linux.
I implemented "Basics of Time Series Analysis and State Space Model" (Hayamoto) with pystan
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
I tried PyQ
I tried to automate the article update of Livedoor blog with Python and selenium.
I tried AutoKeras
I tried papermill
Image processing with Python (I tried binarizing it into a mosaic art of 0 and 1)
I tried django-slack
I tried Django