[PYTHON] Creating a map of patent problem solutions with Guided LDA (first half)

1. Purpose

Since the patent text is long, I want to read it efficiently, or I want to grasp the overall tendency as a patent group. At this time, it is easy to understand if the sentences can be categorized by "problem (purpose)" and "solution" and mapped. The figure looks like the one below.

image.png Reference: http://www.sato-pat.co.jp/contents/service/examination/facture.html

I want to automatically extract this problem axis and the solution axis (label) from the text. The problem awareness is almost the same as this article. One of the methods is LDA. However, the topic cannot be freely manipulated in normal LDA. Guided LDA is a way for humans to adjust that "this topic has such words (I want them to appear)". See if you can set the axis well as you want it to be used.

2. Guided LDA

See here and here for an overview of guided LDA. Official

3. Process flow

First of all, I will bring it to the point where it can be output.

1. Required library and function definitions

@ title ← [STEP1]( required </ font>) Preparation for execution

!pip install guidedlda
import numpy as np
import pandas as pd
import guidedlda

#Function for creating onehot encoded result from corpus () from sklearn.feature_extraction.text import CountVectorizer def get_X_vocab(corpus): vectorizer = CountVectorizer(token_pattern='(?u)\b\w\w+\b') X = vectorizer.fit_transform(corpus) return X.toarray(), vectorizer.get_feature_names()

#Function to extract the main word for each topic def out1(model,vocab): n_top_words = 10 dic = {} topic_word = model.topic_word_ for i,topic_dist in enumerate(topic_word): topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1] print('Topic {}: {}'.format(i, ' '.join(topic_words))) dic['topic'+str(i)] = ' '.join(topic_words) return dic

2. Store patent data (csv or excel) in pandas dataframe.

Divide with mecab (I think it doesn't have to be mecab). col is the column to be processed.

df[col+'_1g']= df[col].apply(wakati,args=('DE',))

3. Specify the column you want to process (problem axis and solution axis)

@ title ← [STEP2]( required </ font>) Specify the column to be processed

col_name = "Problem that the invention tries to solve _1g" # @param {type: "string"} col_name2 = "Claims _1g" # @ param {type: "string"}

df[col_name].replace({'\d+':''},regex=True,inplace=True)
df[col_name2].replace({'\d+':''},regex=True,inplace=True)

#Corpus ⇒ X (Document word frequency matrix & vocal list output corpus = df[col_name].apply(lambda x:" ".join(x.split("|"))) X,vocab = get_X_vocab(corpus) word2id = dict((v,idx) for idx,v in enumerate(vocab))

#Corpus ⇒ X (Document word frequency matrix & vocal list output corpus2 = df[col_name2].apply(lambda x:" ".join(x.split("|"))) X2,vocab2 = get_X_vocab(corpus2) word2id2 = dict((v,idx) for idx,v in enumerate(vocab2))

print ("Extracted vocabulary list ---------------") print(vocab) print ("word count:" + str (len (vocab))) pd.DataFrame (vocab) .to_csv (col_name + "word list.csv") print(vocab2) print ("word count:" + str (len (vocab2))) pd.DataFrame (vocab2) .to_csv (col_name2 + "word list.csv") print ("The word list was saved in a virtual file as" word list.xlsx "")

image.png

4. Problem axis and solution Specify the word of the axis for each axis + Apply Guided LDA

Since the wording of the axis is troublesome, select it appropriately.

@ title ← [STEP3] Task side_Semi-teacher LDA execution

#It is really important to specify the word list here topic0_subj = ",".join(vocab[51:60]) topic1_subj = ",".join(vocab[61:70]) topic2_subj = ",".join(vocab[71:80]) topic3_subj = ",".join(vocab[81:90]) topic4_subj = ",".join(vocab[91:100]) topic5_subj = ",".join(vocab[101:110]) topic6_subj = ",".join(vocab[111:120])

input_topic0 = topic0_subj.split(",")
input_topic1 = topic1_subj.split(",")
input_topic2 = topic2_subj.split(",")
input_topic3 = topic3_subj.split(",")
input_topic4 = topic4_subj.split(",")
input_topic5 = topic5_subj.split(",")
input_topic6 = topic6_subj.split(",")

topic_list = [input_topic0
               ,input_topic1
               ,input_topic2
               ,input_topic3
               ,input_topic4
               ,input_topic5]

seed_topic_list = []
for k,topic in enumerate(topic_list):
    if topic[0]=="":
        pass
    else:
        seed_topic_list.append(topic)

#topic number is the specified number of topics + 1 num_topic = len(seed_topic_list)+1

s_conf = 0.12 #@param {type:"slider", min:0, max:1, step:0.01}
model = guidedlda.GuidedLDA(n_topics=num_topic, n_iter=100, random_state=7, refresh=20)
seed_topics = {}
for t_id,st in enumerate(seed_topic_list):
    for word in st:
        seed_topics[word2id[word]] = t_id

model.fit(X,seed_topics=seed_topics,seed_confidence=s_conf)
docs = model.fit_transform(X,seed_topics={},seed_confidence=s_conf)
print(docs)

print ("Result --- Typical words for each topic after learning ------------------------------ ---------- ") print ("The last topic was automatically inserted" Other "topic ----------------------------") dic = out1(model,vocab)

print ("Result of topic assignment to each application ---------------------------------------- ---------------- ") print("") df["no"]=df.index.tolist() df ['LDA result_subj'] = df ["no"] .apply (lambda x: "topic" + str (docs [x] .argmax ())) df [["Application number", "LDA result_subj"]] df ['LDA result_subj'] = df ['LDA result_subj'] .replace (dic) image.png

Further, the axis of the solution is processed in the same manner.

5. Cross tabulation of grant results (application number is inserted in the square after tabulation)

ct = pd.crosstab (df ['LDA result_kai'], df ['LDA result_subj'], df ['application number'], aggfunc =','.join) ct

Result ↓ As a point that I devised, if the output is as it is, the axis name will appear as topic ●, so I tried to output the top 10 representative words included in the topic.

image.png

If you want to display the number of cases,

ct = pd.crosstab (df ['LDA result_kai'], df ['LDA result_subj'], df ['application number'], aggfunc = np.size)

image.png

4. Performance evaluation

~~ It was a mess ... ~~ I've put together the words properly, so next time I have to try to reproduce the map made by humans properly. I also feel that the code is redundant (for 2-axis processing), so I need to consider how to write it more concisely.

Recommended Posts