[PYTHON] Categorize news articles with deep learning

Introduction

From the news article Bag-of-Words (BoW), I tried something like predicting categories with Stacked Denoising Autoencoders.

data set

The dataset uses livedoor news corpus.

This corpus is a collection of news articles to which the following Creative Commons license is applied from "livedoor news" operated by NHN Japan Corporation, and is created by removing HTML tags as much as possible.

It seems.

--Topic News

Since there are a total of 9 categories, it is a 9-class classification problem.

Data preprocessing

First you need to make the article BoW, which is done by yasunori's Random-Forest-Example -Forest-Example) corpus.py was used.

python corpus.py

Will create a dictionary (livedoordic.txt) when you execute. Only run for the first time. Next, create the data to be given as input. I rewrote corpus.py only for the method called get_class_id as ↓, but I did it a long time ago, so I can't remember why it happened. .. ..

corpus.py


def get_class_id(file_name):
    '''
Determine the class ID from the file name.
I use it when creating training data.
    '''
    dir_list = get_dir_list()
    dir_name = filter(lambda x: x in file_name, dir_list)
    return dir_list.index(dir_name[0])
import corpus
import numpy as np

dictionary = corpus.get_dictionary(create_flg=False)
contents = corpus.get_contents()

data = []
target = []
for file_name, content in contents.items():
	data.append(corpus.get_vector(dictionary, content))
	target.append(corpus.get_class_id(file_name))
data = np.array(data, np.float32) #Data given as input
target = np.array(target, np.int32) #Correct answer data

Stacked Denoising Autoencoders This time, we will train Denoising Autoencoders with deepened Stacked Denoising Autoencoders (SDA). The implementation uses Chainer. For the explanation about Autoencoder, try Autoencoder with [[Deep Learning] Chainer] by kenmatsu4 and visualize the result. ](Http://qiita.com/kenmatsu4/items/99d4a54d5a57405ecaf8) is very easy to understand personally. SDA uses three Autoencoders stacked, Dropout and Masking noise to prevent overfitting, and ReLU for the activation function. The ratio of training and evaluation data is 9: 1. The SDA code is named SDA.py in my GitHub deep-learning-chainer repository. The executed code looks like the one below.

import numpy as np
from SDA import SDA
from chainer import cuda

cuda.init(0)

rng = np.random.RandomState(1)
sda = SDA(rng=rng,
		  data=data,
		  target=target,
		  n_inputs=6974,
		  n_hidden=[500,500,500],
		  n_outputs=9,
		  gpu=0)
sda.pre_train(n_epoch=10)
sda.fine_tune(n_epoch=30)

Execution result

SDA

C:\Python27\lib\site-packages\skcuda\cublas.py:273: UserWarning: creating CUBLAS
 context to get version number
  warnings.warn('creating CUBLAS context to get version number')
--------First DA training has started!--------
epoch 1
train mean loss=0.106402929114
test mean loss=0.088471424426
epoch 2
train mean loss=0.0816160233447
test mean loss=0.0739360584434
--
Omission
--
epoch 9
train mean loss=0.0519113916775
test mean loss=0.0670968969548
epoch 10
train mean loss=0.0511762971061
test mean loss=0.0661109716832
--------Second DA training has started!--------
epoch 1
train mean loss=1.28116437635
test mean loss=0.924632857176
epoch 2
train mean loss=0.908878781048
test mean loss=0.763214301707
--
Omission
--
epoch 9
train mean loss=0.500251602623
test mean loss=0.55466137691
epoch 10
train mean loss=0.485327716237
test mean loss=0.517578341663
--------Third DA training has started!--------
epoch 1
train mean loss=1.0635086948
test mean loss=0.778134044507
epoch 2
train mean loss=0.656580147385
test mean loss=0.612065581324
--
Omission
--
epoch 9
train mean loss=0.433458953354
test mean loss=0.486904190264
epoch 10
train mean loss=0.400864538789
test mean loss=0.46137621372
fine tuning epoch  1
fine tuning train mean loss=1.33540507985, accuracy=0.614027133827
fine tuning test mean loss=0.363009182577, accuracy=0.902306635635
fine tuning epoch  2
fine tuning train mean loss=0.451324046692, accuracy=0.869683239884
fine tuning test mean loss=0.235001576683, accuracy=0.945725910052
fine tuning epoch  3
fine tuning train mean loss=0.233203321021, accuracy=0.937104056863
fine tuning test mean loss=0.172718693961, accuracy=0.952510164098
fine tuning epoch  4
fine tuning train mean loss=0.156541177815, accuracy=0.957164381244
fine tuning test mean loss=0.167446922435, accuracy=0.962008120247
--
Omission
--
fine tuning epoch  27
fine tuning train mean loss=0.0105007310127, accuracy=0.997586716714
fine tuning test mean loss=0.217954038866, accuracy=0.960651269438
fine tuning epoch  28
fine tuning train mean loss=0.00783754364192, accuracy=0.998340867404
fine tuning test mean loss=0.206009919964, accuracy=0.957937559732
fine tuning epoch  29
fine tuning train mean loss=0.00473990425367, accuracy=0.998491696822
fine tuning test mean loss=0.245603679721, accuracy=0.95793756782
fine tuning epoch  30
fine tuning train mean loss=0.00755465408512, accuracy=0.998190036187
fine tuning test mean loss=0.228568312999, accuracy=0.962008120247

The transition graph of classification accuracy is as follows.

Multilayer perceptron

I wanted to find out what would happen without pre-learning, so I experimented with a multi-layer perceptron with the same structure. Like SDA, it uses Dropout to prevent overfitting and ReLU as an activation function.

in conclusion

As for the final classification accuracy, SDA was about 95%, which was a very good result. The multi-layer perceptron is about 92%, which shows that the generalization performance is worse than that of SDA, but the credibility of the result is doubtful because the experiment was performed only once.

I would appreciate it if you could point out any strange points.

Recommended Posts

Categorize news articles with deep learning
Deep Kernel Learning with Pyro
Try Deep Learning with FPGA
Generate Pokemon with Deep Learning
Cat breed identification with deep learning
Make ASCII art with deep learning
Try deep learning with TensorFlow Part 2
Solve three-dimensional PDEs with deep learning.
Check squat forms with deep learning
Forecasting Snack Sales with Deep Learning
Deep Learning
Make people smile with Deep Learning
[Machine learning] Cluster Yahoo News articles with MLlib's topic model (LDA).
Classify anime faces with deep learning with Chainer
Try Bitcoin Price Forecasting with Deep Learning
Try with Chainer Deep Q Learning --Launch
Try deep learning of genomics with Kipoi
Sentiment analysis of tweets with deep learning
Deep Learning Memorandum
Start Deep learning
Deep learning × Python
The story of doing deep learning with TPU
99.78% accuracy with deep learning by recognizing handwritten hiragana
First Deep Learning ~ Struggle ~
Learning Python with ChemTHEATER 03
"Object-oriented" learning with python
A story about predicting exchange rates with Deep Learning
Learning Python with ChemTHEATER 05-1
Python: Deep Learning Practices
Deep learning / activation functions
Deep Learning from scratch
Deep learning image analysis starting with Kaggle and Keras
Deep learning 1 Practice of deep learning
First Deep Learning ~ Solution ~
[AI] Deep Metric Learning
Learning Python with ChemTHEATER 02
I tried deep learning
Classify articles with tags specified by Qiita by unsupervised learning
Learning Python with ChemTHEATER 01
Classify anime faces by sequel / deep learning with Keras
Python: Deep Learning Tuning
Deep learning large-scale technology
Deep learning / softmax function
Try to build a deep learning / neural network with scratch
[Evangelion] Try to automatically generate Asuka-like lines with Deep Learning
Create an environment for "Deep Learning from scratch" with Docker
Recognize your boss and hide the screen with Deep Learning
I captured the Touhou Project with Deep Learning ... I wanted to.
Deep Learning with Shogi AI on Mac and Google Colab
I tried to divide with a deep learning language model
HIKAKIN and Max Murai with live game video and deep learning
Easy deep learning web app with NNC and Python + Flask
Sine curve estimation with self-made deep learning module (python) + LSTM
Machine learning learned with Pokemon
Deep Learning from scratch 1-3 chapters
Play with reinforcement learning with MuZero
Deep Learning Gaiden ~ GPU Programming ~
<Course> Deep Learning: Day2 CNN
Ensemble learning summary! !! (With implementation)
Deep running 2 Tuning of deep learning
About learning with google colab