[PYTHON] Text filtering with naive bayes in sklearn

from sklearn.datasets import fetch_20newsgroups Let's use the dataset of.

Code to output the category

from pprint import pprint

This seems to be an article category, but at the beginning it was unclear what was written.


Upon investigation, it turned out to be a net news protocol.

Newsgroup reading fj.comp.applications.excel, fj.comp.oldies, fj.comp.misc, fj.os.ms-windows.win95, fj.os.msdos, fj.net.providers, fj.net.words, fj.life.hometown.hokkaido, fj.jokes.d, fj.rec.autos, fj.rec.motorcycles, fj.news.group.*, fj.news.policy, fj.news.misc, fj.news.adm, fj.news.net-abuse, fj.questions.fj, fj.questions.internet, fj.questions, misc, fj.sci.chem, fj.engr.misc http://www2s.biglobe.ne.jp/~kyashiki/fj/arukikata/WonderfulFj.html

Network News Transfer Protocol was news that used fj (news group).


import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from pprint import pprint

def stopwords():
    symbols = ["'", '"', '`', '.', ',', '-', '!', '?', ':', ';', '(', ')', '*', '--', '\\']
    stopwords = nltk.corpus.stopwords.words('english')
    return stopwords + symbols

newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test  = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

#Show news category
#Article data

#Create CountVectorizer class
vectorizer = CountVectorizer(stop_words=stopwords())
#Make a vocabulary dictionary

# Train
#Substitute document term matrix for X
X = vectorizer.transform(newsgroups_train.data)
# print(newsgroups_train.target)
y = newsgroups_train.target
# print(X.shape)

clf = MultinomialNB()
clf.fit(X, y)

# Test
X_test = vectorizer.transform(newsgroups_test.data) 
y_test = newsgroups_test.target

print(clf.score(X_test, y_test))

スクリーンショット 2016-06-24 23.23.53.png

Data: Correct answer rate 60% Test data: Correct answer rate 80%

It seems that.

I used it as a reference. It was very helpful. http://qiita.com/kotaroito/items/76a505a88390c5593eba

Recommended Posts

Text filtering with naive bayes in sklearn
Challenge text classification by Naive Bayes with sklearn
Implement naive bayes in Python 3.3
GOTO in Python with Sublime Text 3
Introduction to Nonparametric Bayes
Deep Kernel Learning with Pyro
[Python] Bayesian inference with Pyro
[Python] Mixed Gauss model with Pyro
Text filtering with naive bayes in sklearn
Read text in images with python OCR
I tried to judge Tsundere with Naive Bayes
Clustering text in Python
Text processing in Python
Naive Bayes (multiclass classification)
Text mining with Python-Scraping-
Pythonbrew with Sublime Text
Collaborative filtering with PySpark
Flow of extracting text in PDF with Cloud Vision API
4. Bayesian statistics in Python 1-1. Emotional judgment by naive Bayes [Bayes' theorem]