Overview

I made a search algorithm using word2vec with the expectation that not only exact word matches but also synonyms will be hit. As a result, the accuracy of word2vec information alone was not satisfactory, but it seems that the expected effect will be achieved, so I felt that it would be useful to combine it with other methods.

environment

macOS Catalina 10.15.4 python 3.8.0

Learning word2vec

I will leave the detailed method to other articles. I downloaded the full text of wikipedia, divided it with mecab-ipadic-neologd, and learned word2vec with gensim. The code for learning is below. I borrowed the code from Creating a word2vec model using the Wikipedia corpus | Adhesive devotion diary.

from gensim.models import word2vec
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = word2vec.Text8Corpus('./wiki_wakati_neologd.txt')

model = word2vec.Word2Vec(sentences, size=200, min_count=20, window=15)
model.save("./wiki_neologd.model")

Search algorithm

Calculate the similarity between the input sentence and the document list to be searched by the following method to obtain the sentence with the highest similarity.

Extract independent words (non-independent nouns, non-independent verbs, adjectives, adverbs) from Document A and Document B
For each independent word in document A, search for the most similar independent word contained in document B.
Let the average value of each highest similarity be the similarity between document A and document B
Document A is the input sentence, document B is the sentence of the search target document, and the sentence with the highest degree of similarity to the input sentence is output as a result.

The code that achieves the above is below.

import re
import neologdn
import MeCab, jaconv
import pandas as pd
from gensim.models import word2vec

#loading word2vec model
MODEL_PATH = "wiki_neologd.model"
model = word2vec.Word2Vec.load(MODEL_PATH)
#Object for morphological analysis
m = MeCab.Tagger()
#Object for regular expression processing
re_kana = re.compile('[a-zA-Z\u3041-\u309F]')
re_num = re.compile('[0-9]+')

#Document normalization function before morphological analysis
def format_text(text):
  text = neologdn.normalize(text)
  return text

#Function to extract the prototype of the independent word of the sentence, excluding one alphanumeric character
def extract_words(text):
  words = []
  for token in m.parse(text).splitlines()[:-1]:
    if '\t' not in token: continue

    surface = token.split('\t')[0]
    pos = token.split('\t')[1].split(',')

    ok = (pos[0]=='noun' and pos[1] in ['General','固有noun','Change connection','Adjectival noun stem'])
    ok = (ok or (pos[0]=='adjective' and pos[1] == 'Independence'))
    ok = (ok or (pos[0]=='adverb' and pos[1] == 'General'))
    ok = (ok or (pos[0]=='verb' and pos[1] == 'Independence'))

    if ok == False: continue

    stem = pos[-3]

    if stem == '*': stem = surface
    if stem == '-': continue
    if re_kana.fullmatch(stem) and len(stem)==1: continue
    if re_num.fullmatch(stem): continue
    if stem == '': continue

    words.append(stem)
  return words
#Extract independent words from a set of documents and create a database.
def get_document_db(documents):
  id_list = list(range(len(documents)))
  words_list = []
  for d in documents:
    d = format_text(d)
    words = extract_words(d)
    #Exclude duplicates
    words = set(words)
    #Exclude words that are not in word2vec
    words = [w for w in words if w in model.wv.vocab]
    words_list.append(words)
  db = pd.DataFrame({"id":id_list, "text":documents, "word":words_list})
  return db

#Returns the similarity between word list A and word list B
def calc_similarity(words1, words2):
  total = 0
  for w1 in words1:
    if w1 not in model.wv.vocab: continue
    max_sim = 0
    for w2 in words2:
      if w2 not in model.wv.vocab: continue
      sim = model.wv.similarity(w1=w1, w2=w2)
      max_sim = max(max_sim, sim)
    total += max_sim
  return total

#Calculates and returns the similarity between the input statement and each document in db
def add_similarity_to_db(text, db):
  text = format_text(text)
  words1 = extract_words(text)
  #Exclude duplicates
  words1 = set(words1)
  words1 = [w for w in words1 if w in model.wv.vocab]
  similarity = []
  for words2 in db.word:
    sim = calc_similarity(words1, words2)
    similarity.append(sim)

  similarity_df = pd.DataFrame({"similarity": similarity})
  db2 = pd.concat([db,similarity_df], axis=1)
  
  db2 = db2.sort_values(["similarity"], ascending=False)
  db2 = db2.reset_index()
  return db2

#Output check
#Document set to be searched
#Make a list of documents, one element at a time. Here, as an example, I am using sentences that are appropriately extracted from my past articles.
documents = [
  'I made a function to get Japanese pronunciation in kana with MeCab and Python. For example, if you enter "I slept well today", it will return "Kyowayokunemashita".',
  'I made a python function that divides Japanese (katakana character string) in mora units (mora division). Mora and syllable are typical division units of Japanese phonology. Mora is a delimiter when counting the so-called "5, 7, 5" haiku, and long vowels (-), sokuon (tsu), and nasal (n) are also counted as one beat. On the other hand, in syllables, long vowels, sokuons, and nasals are not counted alone, but are regarded as one beat together with the kana that can be a syllable in the previous single.',
  'I made a python function that divides Japanese (katakana character string) into syllable units (syllable division). Mora and syllable are typical division units of Japanese phonology. Mora is a delimiter when counting the so-called "5, 7, 5" haiku, and long vowels (-), sokuon (tsu), and nasal (n) are also counted as one beat. On the other hand, in syllables, long vowels, sokuons, and nasals are not counted alone, but are regarded as one beat together with the kana that can be a syllable in the previous single. When long vowels, sokuons, and nasals are continuous like "burn", a mora number of 3 or more makes one syllable.',
  'In Python, I created a function that divides words (the bottom line of the table below) by phrase, not by word (part of speech).',
  '"Find the 25 trillion digit number of the factorial of 100 trillion(I tried to solve the problem "notation is decimal)" with python. (The original story is the famous question "Is the 25 trillionth number from the right of the factorial of 100 trillion an even number or an odd number?")',
  'MeCab's default dictionary is mecab-ipadic-This is a memo of what I did in my environment (macOS) to change to NEologd. As many people have already written, the default dictionary can be changed by editing a file called mecabrc, but in my environment there are multiple mecabrc and I was wondering which one to edit, so which one I'm writing about how it actually worked, including how to find out if a file should be edited.'
]

db = get_document_db(documents)
input_texts = ["Separate Japanese","Factorial calculation","Separate Japanese with mora", "Separate Japanese by syllables"]
for text in input_texts:
  print(text)
  result = add_similarity_to_db(text, db)
  for sim, d in list(zip(result.similarity, result.text))[:10]:
    disp = d
    if len(d)>20: disp = d[:20]+"..."
    print(sim, disp)

The output is below. The number of documents is small and it is difficult to understand as an example, but the first two results are like that. In the latter two, the scores are exactly the same for mora and syllable word division. It seems that mora and syllables have co-occurred in this document set. There is a possibility that it will be better to include the weight of the word in each document in the calculation using tfidf etc., but I would like to verify it as a future task.

Separate Japanese
2.593316972255707 Japanese (Katakana character string) minutes in mora units...
2.593316972255707 Japanese (Katakana character string) divided by syllable unit...
1.6599590480327606 Japanese pronunciation with MeCab and Python...
1.5144233107566833 MeCab's default dictionary is mecab-...
1.4240807592868805 In Python, not a unit of words (part of speech)...
1.18932443857193 "Find the 25 trillion digit number of the factorial of 100 trillion...
Factorial calculation
1.4738755226135254 "Find the 25 trillion digit number of the factorial of 100 trillion...
1.1860262751579285 Japanese pronunciation with MeCab and Python...
1.1831795573234558 Japanese (Katakana character string) minutes in mora units...
1.1831795573234558 Japanese (Katakana character string) divided by syllable unit...
1.1831795573234558 In Python, not a unit of words (part of speech)...
0.7110081613063812 MeCab default dictionary mecab-...
Separate Japanese with mora
3.0 Japanese (Katakana character string) minutes in mora units...
3.0 Is Japanese (Katakana character string) divided in syllable units?...
1.754945456981659 Japanese pronunciation with MeCab and Python...
1.6068530082702637 In Python, not a unit of words (part of speech)...
1.226668268442154 MeCab's default dictionary is mecab-...
1.1506744921207428 "Find the 25 trillion digit number of the factorial of 100 trillion...
Separate Japanese by syllables
3.0 Japanese (Katakana character string) minutes in mora units...
3.0 Is Japanese (Katakana character string) divided in syllable units?...
1.862914353609085 Japanese pronunciation with MeCab and Python...
1.6907644867897034 In Python, not a unit of words (part of speech)...
1.2761026918888092 MeCab default dictionary mecab-...
1.2211730182170868 "Find the 25 trillion digit number of the factorial of 100 trillion...

Search algorithm using word2vec [python]

Overview

environment

Learning word2vec

Search algorithm