[PYTHON] I made a learning kit for word2vec / doc2vec / GloVe / fastText

This article is the 21st day of Natural Language Processing # 2 Advent Calendar 2019. By the way, it's my birthday today. Please celebrate. ~~ M's book, such as setting a deadline for a birthday ~~

Introduction

In the word embedding world, BERT has been rampant in the past year, and even ELMo is becoming less common. You may still want to use legacy distributed representations such as word2vec and GloVe. In addition, you may want to learn with the data you have (at least for me). So, I made a learning kit for word2vec / doc2vec / GloVe / fastText for myself, so I will publish it.

word2vec / doc2vec / fastText can learn the model of gensim and GloVe can learn the model of official implementation ..

I wrote how to use it in the README of each package, so here I will write about the design concept of the learning kit.

1. API standardization of model learning functions

There are various libraries / packages of word distributed expressions, The assumed format of the dataset is different for each library, Whenever I write a preprocessing script that shapes it into a form suitable for the library The code gets dirty more and more. Therefore, we shared the iterator for reading the text data set. I try to format the data format suitable for each library in the function.

def train_*****_model(
    output_model_path,
    iter_docs,
    **kwargs
)

For word2vec:

def train_word2vec_model(
    output_model_path,
    iter_docs,
    size=300,
    window=8,
    min_count=5,
    sg=1,
    epoch=5
):
    """
    Parameters
    ----------
    output_model_path : string
        path of Word2Vec model
    iter_docs : iterator
        iterator of documents, which are raw texts
    size : int
        size of word vector
    window : int
        window size of word2vec
    min_count : int
        minimum word count
    sg : int
        word2vec training algorithm (1: skip-gram other:CBOW)
    epoch : int
        number of epochs
    """

iter_docs is an iterator of word lists for each document.

2. Allow learning from any text dataset

Prepare an abstract class TextDatasetBase that defines a dataset read API. Arbitrary data set can be handled by implementing the read class of the data set that the user wants to use in a form that inherits this class.

class TextDatasetBase(ABC):
    """
    a bass class for text dataset
    
    Attributes
    ----------
    """
    @abstractmethod
    def iter_docs(self):
        """
        iterator of documents
        
        Parameters
        ----------
        """
        yield None

Example dataset class for MARD

class MARDDataset(TextDatasetBase):
    def __init__(self, word_tokenizer):
        self.root_path = None
        self.word_tokenizer = word_tokenizer

    def iter_docs(self, dataset_path):
        """
        get iterator of texts in one document
        
        Parameters
        ----------
        dataset_path: string
            path to dataset
        """
        self.root_path = Path(dataset_path)
        reviews_json_fn = self.root_path / "mard_reviews.json"
        with open(reviews_json_fn, "r") as fi:
            for line in fi:
                review_dict = json.loads(line, encoding="utf-8")
                title = review_dict["reviewerID"]
                text = review_dict["reviewText"]
                yield self.word_tokenizer.tokenize(text)

I feel that pytorch's DataLoader is about 200 million times more sophisticated than this, but this is what I came up with. Please let me know if you have a better design.

How to use

Installation

Take word2vec as an example

git clone [email protected]:stfate/word2vec-trainer.git
cd word2vec-trainer

git submodule init
git submodule update
pip install -r requirements.txt

Execution of learning

python train_text_dataset.py -o $OUTPUT_PATH --dictionary-path=$DIC_PATH --corpus-path=$CORPUS_PATH --size=100 --window=8 --min-count=5

How to use the model

model_path = "model/word2vec.gensim.model"
model = Word2Vec.load(model_path)

important point

When training with a large data set such as Wikipedia, it may eat up memory and fall. investigation in progress.

in conclusion

It's fun to think about the API of the library

Recommended Posts

I made a learning kit for word2vec / doc2vec / GloVe / fastText
[Python] I made a classifier for irises [Machine learning]
I made a C ++ learning site
I made a dash docset for Holoviews
I made a library for actuarial science
[Updated Ver1.3.1] I made a data preprocessing library DataLiner for machine learning.
I made a python dictionary file for Neocomplete
I made a spare2 cheaper algorithm for uWSGI
I made a useful tool for Digital Ocean
I made a downloader for word distributed expression
I made a peeping prevention product for telework.
I installed Chainer, a framework for deep learning
I made a user management tool for Let's Chat
I made a vim learning game "PacVim" with Go
I made a window for Log output with Tkinter
I made a cleaning tool for Google Container Registry
[VSCode] I made a user snippet for Python print f-string
I tried using Tensorboard, a visualization tool for machine learning
I made a resource monitor for Raspberry Pi with a spreadsheet
I made a face diagnosis AI for a female professional golfer ③
I made a python text
Made a command for FizzBuzz
I made a tool that makes it convenient to set parameters for machine learning models.
I made a discord bot
I made Word2Vec with Pytorch
Python> I made a test code for my own external file
I searched for a similar card of Hearthstone with Deep Learning
I made a client / server CLI tool for WebSocket (like Netcat for WebSocket)
I made a lot of files for RDP connection with Python
I made a development environment for Django 3.0 with Docker, Docker-compose, Poetry
I made a scaffolding tool for the Python web framework Bottle
I made a Python wrapper library for docomo image recognition API.
I touched PyAutoIt for a moment
I made a Line-bot using Python!
I made a CUI-based translation script (2)
I made a wikipedia gacha bot
I made a fortune with Python.
I made an Angular starter kit
I made a daemon with Python
I made a Docker container to use JUMAN ++, KNP, python (for pyKNP).
I made a Dir en gray face classifier using TensorFlow --⑥ Learning program
I made a Dir en gray face classifier using TensorFlow --- ⑧ Learning execution
I made a GAN with Keras, so I made a video of the learning process.
I made a Dir en gray face classifier using TensorFlow --- ⑦ Learning model
I made a payroll program in Python!
I touched "Orator" so I made a note
I made a character counter with Python
I made a conversation partner like Siri
I made a script to display emoji
I made a Hex map with Python
I made a life game with Numpy
I made a stamp generator with GAN
I made a browser automatic stamping tool.
After studying Python3, I made a Slackbot
I made a roguelike game with Python
Creating a development environment for machine learning
I made a simple blackjack with Python
I made a configuration file with Python
I made a WEB application with Django
A textbook for beginners made by Python beginners
I made a neuron simulator with Python