[PYTHON] [Machine learning] LDA topic classification using scikit-learn

About LDA topic classification

--LDA = latent dirichelet allocation

In LDA, each word in a sentence belongs to a hidden topic (topic, category), and it is assumed that the sentence is generated from that topic according to some probability distribution, and the topic to which it belongs is inferred.

--Papers http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf ldapic.png

--alpha;: Parameters to get the topic --beta;: Parameters to get the words in the topic --theta;: Multinomial distribution parameter --w: word --z: topic

This time, we will use this LDA to see if sentences can be categorized by topic.

data set

20 Validated using Newsgroups

--Approximately 20000 documents, 20 categories of datasets --The following 20 categories

comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
talk.politics.misc
talk.politics.guns
talk.politics.mideast
talk.religion.misc
alt.atheism
misc.forsale
soc.religion.christian

--This time, we use the following 4 types

--'rec.sport.baseball': Baseball --'rec.sport.hockey': Hockey --'comp.sys.mac.hardware': mac computer --'comp.windows.x': windows computer

Learning

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import mglearn
import numpy as np

#data 
categories = ['rec.sport.baseball', 'rec.sport.hockey', \
                'comp.sys.mac.hardware', 'comp.windows.x']
twenty_train = fetch_20newsgroups(subset='train',categories=categories, \
                                            shuffle=True, random_state=42)
twenty_test = fetch_20newsgroups(subset='test',categories=categories, \
                                            shuffle=True, random_state=42)
tfidf_vec = TfidfVectorizer(lowercase=True, stop_words='english', \
                            max_df = 0.1, min_df = 5).fit(twenty_train.data)
X_train = tfidf_vec.transform(twenty_train.data)
X_test = tfidf_vec.transform(twenty_test.data)

feature_names = tfidf_vec.get_feature_names()
#print(feature_names[1000:1050])
#print()

# train
topic_num=4
lda =LatentDirichletAllocation(n_components=topic_num,  max_iter=50, \
                        learning_method='batch', random_state=0, n_jobs=-1)
lda.fit(X_train)

Check the status of confirmation below

sorting = np.argsort(lda.components_, axis=1)[:, ::-1]
mglearn.tools.print_topics(topics=range(topic_num),
                           feature_names=np.array(feature_names),
                           topics_per_chunk=topic_num,
                           sorting=sorting,n_words=10)
topic 0       topic 1       topic 2       topic 3       
--------      --------      --------      --------      
nhl           window        mac           wpi           
toronto       mit           apple         nada          
teams         motif         drive         kth           
league        uk            monitor       hcf           
player        server        quadra        jhunix        
roger         windows       se            jhu           
pittsburgh    program       scsi          unm           
cmu           widget        card          admiral       
runs          ac            simms         liu           
fan           file          centris       carina 

--topic1: windows computer --topic2: mac computer --topic0: Baseball or hockey cannot be classified as expected --topic3: Computer related? I couldn't classify as expected

It is considered that topic1 and topic2 could be classified neatly at the learning stage.

inference

For the data for inference, I borrowed an English article from apple on wikipedia. Set some wikipedia articles to text11 and text12.

text11="an American multinational technology company headquartered in Cupertino, "+ \
        "California, that designs, develops, and sells consumer electronics,"+ \
        "computer software, and online services."
text12="The company's hardware products include the iPhone smartphone,"+ \
        "the iPad tablet computer, the Mac personal computer,"+ \
        "the iPod portable media player, the Apple Watch smartwatch,"+ \
        "the Apple TV digital media player, and the HomePod smart speaker."

Perform inference below

# predict
test1=[text11,text12]
X_test1 = tfidf_vec.transform(test1)
lda_test1 = lda.transform(X_test1)
for i,lda in enumerate(lda_test1):
    print("### ",i)
    topicid=[i for i, x in enumerate(lda) if x == max(lda)]
    print(text11)
    print(lda," >>> topic",topicid)
    print("")

result

###  0
an American multinational technology company headquartered in Cupertino, California, that designs, develops, and sells consumer electronics,computer software, and online services.
[0.06391161 0.06149079 0.81545564 0.05914196]  >>> topic [2]

###  1
an American multinational technology company headquartered in Cupertino, California, that designs, develops, and sells consumer electronics,computer software, and online services.
[0.34345051 0.05899806 0.54454404 0.05300738]  >>> topic [2]

It can be said that all the sentences related to MAC (apple) were correctly classified because it was inferred that they were likely to belong to topic2 (mac computer).

Recommended Posts

[Machine learning] LDA topic classification using scikit-learn
Machine learning classification
Stock price forecast using machine learning (scikit-learn)
Machine learning / classification related techniques
Supervised machine learning (classification / regression)
[Machine learning] Text classification using Transformer model (Attention-based classifier)
Machine learning with python (1) Overall classification
Classification and regression in machine learning
Try machine learning with scikit-learn SVM
100 language processing knock-73 (using scikit-learn): learning
Application development using Azure Machine Learning
Machine learning
[Machine learning] Cluster Yahoo News articles with MLlib's topic model (LDA).
scikit-learn How to use summary (machine learning)
[Machine learning] FX prediction using decision trees
Machine learning algorithm (implementation of multi-class classification)
[Machine learning] Supervised learning using kernel density estimation
Machine learning algorithm classification and implementation summary
Stock price forecast using machine learning (regression)
[Machine learning] Regression analysis using scikit learn
EV3 x Pyrhon Machine Learning Part 3 Classification
Classification of guitar images by machine learning Part 1
A story about simple machine learning using TensorFlow
Python & Machine Learning Study Memo ⑤: Classification of irises
[Machine learning] Supervised learning using kernel density estimation Part 2
Machine learning algorithms (from two-class classification to multi-class classification)
[Machine learning] Supervised learning using kernel density estimation Part 3
Face image dataset sorting using machine learning model (# 3)
Overview of machine learning techniques learned from scikit-learn
[Python3] Let's analyze data using machine learning! (Regression)
Classify machine learning related information by topic model
I tried to implement various methods for machine learning (prediction model) using scikit-learn.
Supervised learning (classification)
[Translation] scikit-learn 0.18 Tutorial Introduction of machine learning by scikit-learn
[Memo] Machine learning
Classification of guitar images by machine learning Part 2
Try using Jupyter Notebook of Azure Machine Learning
A memorandum of method often used in machine learning using scikit-learn (for beginners)
Machine Learning sample
[Machine learning] Extract similar words mechanically using WordNet
Causal reasoning using machine learning (organization of causal reasoning methods)
What I learned about AI / machine learning using Python (1)
Create machine learning projects at explosive speed using templates
[Machine learning] Understanding SVM from both scikit-learn and mathematics
Easy machine learning with scikit-learn and flask ✕ Web app
Python Scikit-learn Linear Regression Analysis Nonlinear Simple Regression Analysis Machine Learning
What I learned about AI / machine learning using Python (3)
Machine Learning with Caffe -1-Category images using reference model
Tech-Circle Let's start application development using machine learning (self-study)
[Machine learning] Try to detect objects using Selective Search
Practical machine learning with Scikit-Learn and TensorFlow-TensorFlow gave up-
Memo for building a machine learning environment using Python
What I learned about AI / machine learning using Python (2)
I tried to compress the image using machine learning
About machine learning overfitting
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost
Machine learning logistic regression
Machine learning support vector machine
Studying Machine Learning ~ matplotlib ~
Machine learning linear regression