[PYTHON] Word2Vec with BoUoW

With natural language processing Language vectorization, Cos similarity, word-separation, machine learning ... I have a lot of interests and try, but I always get frustrated on the way. .. .. ..

Even if I get frustrated, I want to write it again, and I'm thinking of writing it as a memorandum for that time.

Let's start with why you need to use BUW. Because, it's vague, but I've been doing various things with Jupyter Notebook on Windows 10 and Anaconda, but Mecab didn't work well. .. .. .. I've forgotten how it didn't work. .. But anyway, I just remember that I didn't understand.

Maybe Mecab-Python for Windows wasn't working well.

On the net Rewriting the Mecab-Python source code for Linux for Windows and compiling it may work! !! Even if I get the information, I don't know how to rewrite it for Windows. .. .. ..

Is it frustrated this time as well? .. .. .. .. ..

I saw an article when I was getting tired of sitting in front of my computer while I was passing through my head.

「bash on ubuntu on Windows」

Well. Moreover, it seems to be a standard function of Windows 10. I feel more motivated to sit in front of my computer again. In this case, you can use Mecab-Python for Linux as it is! !!

I would like to paste the results of doing various things like this.

For the time being, I've succeeded below, so don't forget to paste the code. The environment is as follows. ・ 64bit Windows10 Creative update ・ Bash on ubuntu on Windows ・ Jupyter Notebook (anaconda3-4.4.0) ・ Python3.6

In addition, most of them use the code pasted on the page of "@ enta0701" as a reference. Thank you_(..) http://qiita.com/enta0701/items/87cbe783aeb44ddf41ce

I couldn't determine if it was because I was trying to run it on a BUW instead of a Mac, or if there was something else, but with the original source code I couldn't read the ".txt" file well.

(2017/11/16 postscript start) It felt like the area around "`` __file__``" wasn't working well. (End of postscript)

(2017/11/19 postscript start) The problem that an error occurs in "`` __file__``" has been addressed as follows. https://qiita.com/Chizizii/items/42a78754aa2fe6b6a29f (End of postscript)

After trying various things, I wonder if the contents of the ".txt" file will not work if it is in Japanese. After that, I tried to think that it was related to UTF-8 etc., but it didn't work. .. ..

By putting ".txt" in the deep hierarchy of Windows, I wondered if the path was too long, gave up getting the dynamic path, relatively shallow hierarchy, ".txt" in the place specified by the fixed path I decided to put the file. Along with that, I also cut some things to import.

So, I managed to get it to work! !!

In addition, we use it almost as it is, including the text data to be input.

Before I forget, I will paste only the source code.

If you find something strange, please kindly point it out. I don't have much knowledge. ..

# -*- coding: utf-8 -*-
#Language specification py3
#Reference source http://qiita.com/enta0701/items/87cbe783aeb44ddf41ce

import os
import sys
import numpy as np
import MeCab
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

input_text = open('/mnt/c/BOW/documents.txt', 'r').read()
print(input_text)

documents = input_text.split("|")

def words(text):
    out_words = []
    tagger = MeCab.Tagger('-Ochasen')
    tagger.parse('')
    node = tagger.parseToNode(text)

    while node:
        word_type = node.feature.split(",")[0]
        if word_type in ["noun"]:
            out_words.append(node.surface)
        node = node.next
    return out_words

def vecs_array(documents):
    docs = np.array(documents)
    vectorizer = TfidfVectorizer(
        analyzer=words,
        stop_words='|',
        min_df=1,
        token_pattern='(?u)\\b\\w+\\b'
    )
    vecs = vectorizer.fit_transform(docs)
    return vecs.toarray()

tag = ["Article A", "Article B", "Article C", "Article D", "Article E", "Article F"]
cs_array = cosine_similarity(vecs_array(documents), vecs_array(documents))

for i, cs_item in enumerate(cs_array):
    print("[" + tag[i] + "]")
    cs_dic = {}
    for j, cs in enumerate(cs_item):
        if round(cs - 1.0, 5) != 0:
            cs_dic[tag[j]] = cs
    for k, v in sorted(cs_dic.items(), key=lambda x:x[1], reverse=True):
        print("\t" + k + " : " + str(v))

(Addition) "py3" was added as a language specification to the source code part. Thank you, shiracamus.

that's all.