[PYTHON] Vectorize sentences and search for similar sentences

I searched for similar sentences with Doc2Vec, so I will introduce the implementation.

What is Doc2Vec?

In order for a computer to process natural language, it must first have human language values that can be handled by the computer. [Word2Vec] exists as a method for vectorizing the meaning of words. For details, the link destination is very easy to understand, but roughly speaking, that word is expressed by a list of n words before and after. By doing this, for example, "dog" and "cat" are used in similar contexts and can be thought of as having similar "meanings". Doc2Vec is an application of Word2Vec to vectorize sentences.

Implementation sample

The following two functions will be realized using Doc2Vec this time.

As a sample, I used the text from Aozora Bunko. The code used in this article is [published on GitHub] [GitHub]. (I also made the text used for learning into a zip, but please note that it is large in size)

environment

Please be able to use.

Learning

  1. Get the text file
  2. Get text from file
  3. Remove unnecessary parts from the text
  4. Break down into words
  5. Learn with Doc2Vec
  6. Output training data

Process according to the flow of.

1. Get the text file

import os
import sys
import MeCab
import collections
from gensim import models
from gensim.models.doc2vec import LabeledSentence

First, import the required libraries.

def get_all_files(directory):
    for root, dirs, files in os.walk(directory):
        for file in files:
            yield os.path.join(root, file)

Gets all files under the given directory.

2. Get text from file

def read_document(path):
    with open(path, 'r', encoding='sjis', errors='ignore') as f:
        return f.read()

3. Remove unnecessary parts from the text

def trim_doc(doc):
    lines = doc.splitlines()
    valid_lines = []
    is_valid = False
    horizontal_rule_cnt = 0
    break_cnt = 0
    for line in lines:
        if horizontal_rule_cnt < 2 and '-----' in line:
            horizontal_rule_cnt += 1
            is_valid = horizontal_rule_cnt == 2
            continue
        if not(is_valid):
            continue
        if line == '':
            break_cnt += 1
            is_valid = break_cnt != 3
            continue
        break_cnt = 0
        valid_lines.append(line)
    return ''.join(valid_lines)

I think that the processing here will change depending on the target sentence. This time, I ignored the explanation part of the text before and after the text. It is unclear to what extent this affects accuracy in the first place.

4. Break down into words

def split_into_words(doc, name=''):
    mecab = MeCab.Tagger("-Ochasen")
    valid_doc = trim_doc(doc)
    lines = mecab.parse(doc).splitlines()
    words = []
    for line in lines:
        chunks = line.split('\t')
        if len(chunks) > 3 and (chunks[3].startswith('verb') or chunks[3].startswith('adjective') or (chunks[3].startswith('noun') and not chunks[3].startswith('noun-number'))):
            words.append(chunks[0])
    return LabeledSentence(words=words, tags=[name])

def corpus_to_sentences(corpus):
    docs = [read_document(x) for x in corpus]
    for idx, (doc, name) in enumerate(zip(docs, corpus)):
        sys.stdout.write('\r Preprocessing{} / {}'.format(idx, len(corpus)))
        yield split_into_words(doc, name)

Takes a sentence from a file and breaks it down into words. In order to improve accuracy, it seems that the only words used for learning are nouns. This time, I used verbs, adjectives, and nouns (other than numbers).

5. Learn with Doc2Vec

def train(sentences):
    model = models.Doc2Vec(size=400, alpha=0.0015, sample=1e-4, min_count=1, workers=4)
    model.build_vocab(sentences)
    for x in range(30):
        print(x)
        model.train(sentences)
        ranks = []
        for doc_id in range(100):
            inferred_vector = model.infer_vector(sentences[doc_id].words)
            sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
            rank = [docid for docid, sim in sims].index(sentences[doc_id].tags[0])
            ranks.append(rank)
        print(collections.Counter(ranks))
        if collections.Counter(ranks)[0] >= PASSING_PRECISION:
            break
    return model

Parameters at the time of learning are set in the part of models.Doc2Vec.

alpha The higher it is, the faster it converges, but if it is too high, it diverges. The lower the value, the higher the accuracy, but the slower the convergence.

sample Words that appear too often are likely to be meaningless words and may be ignored. Set that threshold.

min_count Contrary to sample, words that are too infrequent may not be appropriate to describe the sentence and may be ignored. However, this time I targeted all words.

for x in range(30):
        print(x)
        model.train(sentences)
        ranks = []
        for doc_id in range(100):
            inferred_vector = model.infer_vector(sentences[doc_id].words)
            sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs))
            rank = [docid for docid, sim in sims].index(sentences[doc_id].tags[0])
            ranks.append(rank)
        print(collections.Counter(ranks))
        if collections.Counter(ranks)[0] >= PASSING_PRECISION:
            break
    return model

We are studying and evaluating in this part. The evaluation is based on the number of times that 100 of the learned sentences are searched for similar sentences, and the most similar sentence is oneself. This time, the learning is finished when 94 times or more. (Because the accuracy did not improve any more after turning it several times)

6. Output training data

model.save(OUTPUT_MODEL)

OUTPUT_MODEL contains the output path.

Search sentences by word

model = models.Doc2Vec.load('doc2vec.model')

def search_similar_texts(words):
    x = model.infer_vector(words)
    most_similar_texts = model.docvecs.most_similar([x])
    for similar_text in most_similar_texts:
        print(similar_text[0])

Since Doc2Vec also vectorizes words (Word2Vec) at the same time, I tried to search for similar words.

def search_similar_words(words):
    for word in words:
        print()
        print(word + ':')
        for result in model.most_similar(positive=word, topn=10):
            print(result[0])

Example of searching for "cat"

猫.PNG

Example of searching for "snow"

雪.PNG

Search for similar sentences

model = models.Doc2Vec.load('doc2vec.model')

def search_similar_texts(path):
    most_similar_texts = model.docvecs.most_similar(path)
    for similar_text in most_similar_texts:
        print(similar_text[0])

An example of searching for "I am a cat" by Soseki Natsume

夏目漱石.PNG

An example of searching for "No Longer Human" by Osamu Dazai

太宰治.PNG

Summary

I implemented a search for similar sentences in Doc2Vec. I hope you find it helpful.

If an error occurs

Only the error that occurred in my environment, but I will post the solution.

reference

[Word2Vec: The amazing power of word vectors that surprises the inventor] [Word2Vec] [How Doc2Vec works and document similarity calculation tutorial using gensim] [Tutorial] [What happens if you do machine learning with a pixiv novel [learned model data is distributed]] [pixiv] [Use TensorFlow to check the difference in movement depending on the learning rate] [Learning rate] [models.doc2vec – Deep learning with paragraph2vec][doc2vec]

<!-Link-> [Word2Vec]:https://deepage.net/bigdata/machine_learning/2016/09/02/word2vec_power_of_word_vector.html [GitHub]:https://github.com/Foo-x/doc2vec-sample [Tutorial]: https://deepage.net/machine_learning/2017/01/08/doc2vec.html [pixiv]:http://inside.pixiv.net/entry/2016/09/13/161454 [Learning rate]: http://qiita.com/isaac-otao/items/6d44fdc0cfc8fed53657 [doc2vec]:https://radimrehurek.com/gensim/models/doc2vec.html

Recommended Posts

Vectorize sentences and search for similar sentences
Causal reasoning and causal search with Python (for beginners)
Approximate Nearest Neighbor Search for Similar Image Analysis (For Beginners) (1)
Recursively search for files and directories in Python and output
Interactive console application for address and zip code search
Try a similar search for Image Search using the Python SDK [Search]
Search for variables in pandas.DataFrame and get the corresponding row.
Search list for duplicate elements
Search for strings in Python
Search for OpenCV function names
Search for strings in files
WordNet structure and synonym search
Pitfalls and workarounds for pandas.DataFrame.to_sql
Search numpy.array for consecutive True
[Python] Depth-first search and breadth-first search
Search for character strings in files [Comparison between Bash and PowerShell]
Dump, restore and query search for Python class instances using mongodb