[PYTHON] Two-dimensional visualization of document vectors using Word2Vec trained model

Overview

--Since there was a person who created Patented Text Version Distributed Expression Model in Word2Vec, using this trained model (loading) the text information I created the code to perform vectorization, word vectorization → document vector generation → 2D visualization.

――It is difficult to generate a learning model that uses a huge amount of data as an input with personal PC specifications, but I would be grateful if you could publish the learned model in this way.

environment

OS: macOS Catalina Language: python3.6.6

code

Here is the code I created. MeCab is used for Japanese morphological analysis. The Word2Vec model is 300D and uses scikit-learn's TSNE to reduce it to 2D.

patent_w2v_JP.py


from sklearn.manifold import TSNE
from gensim.models import word2vec
from statistics import mean
import string
import re
import MeCab
import pandas as pd
import numpy as np

#Reading input data
df = pd.read_csv('input.csv', encoding="utf-8")
#Combine title and summary
df['text'] = df['title'].str.cat(df['abstract'], sep=' ')

#List of stop words (add to this list, if any)
stop_words = []

#MeCab tokenizer
def mecab_token_list(text):
    token_list = []
    tagger = MeCab.Tagger()
    tagger.parse('') 
    node = tagger.parseToNode(text)
    while node:
        pos = node.feature.split(",")
        if not node.surface in stop_words: #Exclude stop words
            if pos[6] != '*': #Add any found words
                token_list.append(pos[6])
            else: #If not, add a surface word
                token_list.append(node.surface)
        node = node.next
    return list(token_list)

#Remove numbers
df['text_clean'] = df.text.map(lambda x: re.sub(r'\d+', '', x))
#All English letters are unified to lowercase
df['text_clean'] = df.text_clean.map(lambda x: x.lower())
#MeCab Tokenize
df['text_tokens'] = df.text_clean.map(lambda x: mecab_token_list(x))

#Load Word2Vec model
model = word2vec.Word2Vec.load("patent_w2v_d300_20191125.model")

#Initialize the data frame into a 300-dimensional array
doc_vec = np.zeros((df.shape[0], 300))
#Prepare a list to store the model coverage of the appearing words in each document
coverage = []
#Store the average vector of each document in an array
for i,doc in enumerate(df['text_tokens']): #Process text information after morphological analysis in document order
    feature_vec = np.zeros(300) #Initialize 300 dimensions to 0
    num_words = 0
    no_count = 0
    for word in doc: #Process word by word in each document
        try: #Processing to add word vectors
            feature_vec += model.wv[word]
            num_words += 1
        except: #Words not covered by the analytical model pass
            no_count += 1
    #Divide the sum of the obtained word vectors by the number of words to calculate the average value and store it as a document vector.
    feature_vec = feature_vec / num_words
    doc_vec[i] = feature_vec
    #Calculate and store word coverage for each document
    cover_rate = num_words / (num_words + no_count)
    coverage.append(cover_rate)

#Show average word coverage
mean_coverage = round(mean(coverage)*100, 2)
print("Word cover-rate: " + str(mean_coverage) + "%")

#t-Dimensionality reduction by SNE
tsne= TSNE(n_components=2, init='pca', verbose=1, random_state=2000, perplexity=50, learning_rate=200, method='exact', n_iter=1000)
embedding = tsne.fit_transform(doc_vec)
#Store in DataFrame
embedding = pd.DataFrame(embedding, columns=['x', 'y'])
embedding["id"]= df.id
embedding["year"]= df.year
embedding["title"]= df.title
embedding["abstract"]= df.abstract

#Output as CSV file
embedding.to_csv("output.csv", encoding="utf_8")

Input data structure

--The input data is assumed to be a patent document, and it is assumed that the following CSV format includes data such as id, year (year of filing, etc.), title (name of invention), description (text). I would like you to arrange this area as appropriate.

id title year description
1 (Title of invention) (Application year, etc.) (Text)
2 ・ ・ ・ ・ ・ ・ ・ ・ ・

--It is assumed that the input data (input.csv), Word2Vec model and this python script are all in the same directory. Please arrange this area appropriately according to the configuration.

Points to remember

--As a method to generate a document vector from a word vector of Word2Vec, the method of using the average value of the word vectors contained in each document as the document vector is adopted.

--For the time being, the coverage rate of the appearing words in the model is set to be displayed on the console. This is calculated by averaging the word coverage of each document.

--Since it is based on code created for other purposes, we have confirmed the operation with sample data, but we have not tried it with the actual patent document dataset. I wonder if I should get it at J-PlatPat.

Image of 2D visualization

It seems that you can even visualize with python only if you use matplotlib etc., but for reference, if you divert GIS (Geographic Information System) and visualize it, it will be like this. Each plot is document data and is heat-mapped according to the degree of concentration on the 2D plane. I think these expressions are useful as a basis for classification and analysis. sample_w2v.png

Recommended Posts

Two-dimensional visualization of document vectors using Word2Vec trained model
Implementation of VGG16 using Keras created without using a trained model
Visualization of mixed matrices using sklearn.metrics.ConfusionMatrixDisplay
Diversion of layers of trained keras model
I tried using the trained model VGG16 of the deep learning library Keras
Basics of Tableau Basics (Visualization Using Geographic Information)
Benefits of using slugfield in Django's model
We have released a trained model of fastText
Make inferences using scikit-learn's trained model in PySpark