This article is the 21st day of Natural Language Processing # 2 Advent Calendar 2019. By the way, it's my birthday today. Please celebrate. ~~ M's book, such as setting a deadline for a birthday ~~
In the word embedding world, BERT has been rampant in the past year, and even ELMo is becoming less common. You may still want to use legacy distributed representations such as word2vec and GloVe. In addition, you may want to learn with the data you have (at least for me). So, I made a learning kit for word2vec / doc2vec / GloVe / fastText for myself, so I will publish it.
word2vec / doc2vec / fastText can learn the model of gensim and GloVe can learn the model of official implementation ..
I wrote how to use it in the README of each package, so here I will write about the design concept of the learning kit.
There are various libraries / packages of word distributed expressions, The assumed format of the dataset is different for each library, Whenever I write a preprocessing script that shapes it into a form suitable for the library The code gets dirty more and more. Therefore, we shared the iterator for reading the text data set. I try to format the data format suitable for each library in the function.
def train_*****_model(
    output_model_path,
    iter_docs,
    **kwargs
)
For word2vec:
def train_word2vec_model(
    output_model_path,
    iter_docs,
    size=300,
    window=8,
    min_count=5,
    sg=1,
    epoch=5
):
    """
    Parameters
    ----------
    output_model_path : string
        path of Word2Vec model
    iter_docs : iterator
        iterator of documents, which are raw texts
    size : int
        size of word vector
    window : int
        window size of word2vec
    min_count : int
        minimum word count
    sg : int
        word2vec training algorithm (1: skip-gram other:CBOW)
    epoch : int
        number of epochs
    """
iter_docs is an iterator of word lists for each document.
Prepare an abstract class TextDatasetBase that defines a dataset read API.
Arbitrary data set can be handled by implementing the read class of the data set that the user wants to use in a form that inherits this class.
class TextDatasetBase(ABC):
    """
    a bass class for text dataset
    
    Attributes
    ----------
    """
    @abstractmethod
    def iter_docs(self):
        """
        iterator of documents
        
        Parameters
        ----------
        """
        yield None
Example dataset class for MARD
class MARDDataset(TextDatasetBase):
    def __init__(self, word_tokenizer):
        self.root_path = None
        self.word_tokenizer = word_tokenizer
    def iter_docs(self, dataset_path):
        """
        get iterator of texts in one document
        
        Parameters
        ----------
        dataset_path: string
            path to dataset
        """
        self.root_path = Path(dataset_path)
        reviews_json_fn = self.root_path / "mard_reviews.json"
        with open(reviews_json_fn, "r") as fi:
            for line in fi:
                review_dict = json.loads(line, encoding="utf-8")
                title = review_dict["reviewerID"]
                text = review_dict["reviewText"]
                yield self.word_tokenizer.tokenize(text)
I feel that pytorch's DataLoader is about 200 million times more sophisticated than this, but this is what I came up with.
Please let me know if you have a better design.
Take word2vec as an example
git clone [email protected]:stfate/word2vec-trainer.git
cd word2vec-trainer
git submodule init
git submodule update
pip install -r requirements.txt
python train_text_dataset.py -o $OUTPUT_PATH --dictionary-path=$DIC_PATH --corpus-path=$CORPUS_PATH --size=100 --window=8 --min-count=5
model_path = "model/word2vec.gensim.model"
model = Word2Vec.load(model_path)
When training with a large data set such as Wikipedia, it may eat up memory and fall. investigation in progress.
It's fun to think about the API of the library
Recommended Posts