[PYTHON] Natural Language: Doc2Vec Part1 --livedoor NEWS Corpus

Target

We have summarized the document classification using the Microsoft Cognitive Toolkit (CNTK).

In Part 1, we will prepare for document classification using CNTK.

I will introduce them in the following order.

  1. Download livedoor news corpus
  2. Preprocessing of text data, creation of word dictionary
  3. Create a file to be read by the built-in reader provided by CNTK

Introduction

Download livedoor news corpus

livedoor news corpus

・ German communication ・ IT life hack ・ Home appliances channel ・ Livedoor HOMME ・ MOVIE ENTER ・ Peachy ・ Esmax ・ Sports Watch ・ Topic news

This is a corpus consisting of 9 types of articles. Each article file is covered by a Creative Commons license that is prohibited from being displayed or modified.

livedoor news corpus

Access the above page and download / unzip ldcc-20140209.tar.gz.

The directory structure this time is as follows.

Doc2Vec  |―text   |―...  doc2vec_corpus.py Word2Vec

Preprocessing of text data, creation of word dictionary

The text data preprocessing reuses the functions implemented in Natural Language: Word2Vec Part1 --Japanese Corpus.

For word splitting, use Mecab, which is based on the NEologd dictionary, to perform stopword removal.

Also, for model performance evaluation, 10 documents are separated from each category as verification data.

This time, following Computer Vision: Image Caption Part1 --STAIR Captions, words that did not appear more than once were replaced with UNK.

Creating a file to be read by the built-in reader provided by CNTK

During training, we will use CTFDeserializer, one of CNTK's built-in leaders. This time, one category label is assigned to one document consisting of many words.

The general processing flow of the program preparing for Doc2Vec is as follows.

  1. Preparation of training data and verification data
  2. Preprocessing of text data and creation of word dictionary
  3. Writing documents and category labels

Implementation

Execution environment

hardware

-CPU Intel (R) Core (TM) i7-6700K 4.00GHz

software

・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Mecab 0.996

Program to run

The implemented program is published on GitHub.

doc2vec_corpus.py


Commentary

I will extract and supplement some parts of the program to be executed.

Doc2Vec input and output

The contents of the CTFDeserializer used for this training are as follows.

0 |word 346:1	|label 0:1
0 |word 535:1
0 |word 6880:1
...
1 |word 209:1	|label 0:1
1 |word 21218:1
1 |word 6301:1
...

The number on the far left represents one document, many words|One category label for documents consisting of words|Label is assigned.

result

When you run the program, the word dictionary is created and saved as follows.

Number of total words: 73794
Number of words: 45044

Saved word2id.
Saved id2word.

Now 1000 samples...
Now 2000 samples...
...
Now 7000 samples...

Number of training samples 7277
Number of validation samples 90

Now that you're ready to train, Part 2 will use CNTK to train Doc2Vec.

reference

livedoor news corpus

Computer Vision : Image Caption Part1 - STAIR Captions Natural Language : Word2Vec Part1 - Japanese Corpus

Recommended Posts

Natural Language: Doc2Vec Part1 --livedoor NEWS Corpus
Natural Language: Word2Vec Part1 --Japanese Corpus
Natural Language: Doc2Vec Part2 --Document Classification
Natural Language: BERT Part1 --Japanese Wikipedia Corpus
Natural Language: Word2Vec Part3 --CBOW model
Natural Language: Word2Vec Part2 --Skip-gram model
Natural Language: ChatBot Part1-Twitter API Corpus
Natural Language: BERT Part2 --Unsupervised pretraining ALBERT
Principal component analysis with Livedoor News Corpus --Preparation--
Python: Natural language processing
RNN_LSTM2 Natural language processing