Target

We have summarized the document classification using the Microsoft Cognitive Toolkit (CNTK).

In Part 1, we will prepare for document classification using CNTK.

I will introduce them in the following order.

Download livedoor news corpus
Preprocessing of text data, creation of word dictionary
Create a file to be read by the built-in reader provided by CNTK

Introduction

Download livedoor news corpus

livedoor news corpus

・ German communication ・ IT life hack ・ Home appliances channel ・ Livedoor HOMME ・ MOVIE ENTER ・ Peachy ・ Esmax ・ Sports Watch ・ Topic news

This is a corpus consisting of 9 types of articles. Each article file is covered by a Creative Commons license that is prohibited from being displayed or modified.

livedoor news corpus

Access the above page and download / unzip ldcc-20140209.tar.gz.

The directory structure this time is as follows.

Doc2Vec 　|―text 　　|―... 　doc2vec_corpus.py Word2Vec

Preprocessing of text data, creation of word dictionary

The text data preprocessing reuses the functions implemented in Natural Language: Word2Vec Part1 --Japanese Corpus.

For word splitting, use Mecab, which is based on the NEologd dictionary, to perform stopword removal.

Also, for model performance evaluation, 10 documents are separated from each category as verification data.

This time, following Computer Vision: Image Caption Part1 --STAIR Captions, words that did not appear more than once were replaced with UNK.

Creating a file to be read by the built-in reader provided by CNTK

During training, we will use CTFDeserializer, one of CNTK's built-in leaders. This time, one category label is assigned to one document consisting of many words.

The general processing flow of the program preparing for Doc2Vec is as follows.

Preparation of training data and verification data
Preprocessing of text data and creation of word dictionary
Writing documents and category labels

Implementation

Execution environment

hardware

-CPU Intel (R) Core (TM) i7-6700K 4.00GHz

software

・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Mecab 0.996

Program to run

The implemented program is published on GitHub.

`doc2vec_corpus.py`

Commentary

I will extract and supplement some parts of the program to be executed.

Doc2Vec input and output

The contents of the CTFDeserializer used for this training are as follows.

0 |word 346:1	|label 0:1
0 |word 535:1
0 |word 6880:1
...
1 |word 209:1	|label 0:1
1 |word 21218:1
1 |word 6301:1
...

The number on the far left represents one document, many words|One category label for documents consisting of words|Label is assigned.

result

When you run the program, the word dictionary is created and saved as follows.

Number of total words: 73794
Number of words: 45044

Saved word2id.
Saved id2word.

Now 1000 samples...
Now 2000 samples...
...
Now 7000 samples...

Number of training samples 7277
Number of validation samples 90

Now that you're ready to train, Part 2 will use CNTK to train Doc2Vec.

reference

livedoor news corpus

Computer Vision : Image Caption Part1 - STAIR Captions Natural Language : Word2Vec Part1 - Japanese Corpus

[PYTHON] Natural Language: Doc2Vec Part1 --livedoor NEWS Corpus

Target

Introduction

Download livedoor news corpus

Preprocessing of text data, creation of word dictionary

Creating a file to be read by the built-in reader provided by CNTK

Implementation

Execution environment

hardware

software

Program to run

doc2vec_corpus.py

Commentary

Doc2Vec input and output

result

reference

`doc2vec_corpus.py`