[PYTHON] I tried machine learning with liblinear

liblinear is an opportunity learning library for linearly separating data with 1 million digits of instance and features.   https://www.csie.ntu.edu.tw/~cjlin/liblinear/ The feature is that it can process a huge number of features quickly, which is different from others.

Anyway, put it on your desktop and hit it. First, put what you want to import. Instead of opening a terminal (or cmd) and typing the ipython command right away, go to the working directory (folder) and then launch ipython

import sys
sys.path.append('/Users/xxxx/Desktop/liblinear-2.1/python')
from liblinearutil import *
from liblinear import *

The data used is the following news20 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html The form of liblinear data is special. When actually using it, it is necessary to process it into this shape

y, x = svm_read_problem('news20.txt')

train → predict

In [20]: len(y)
Out[20]: 15935
#Since there are 15935 pieces of data, up to 5000 pieces can be learned as teacher data.
m = train(y[:5000], x[:5000])
save_model('news20.model', m)

optimization finished, #iter = 1000

WARNING: reaching max number of iterations Using -s 2 may be faster (also see FAQ)

Objective value = -38.201637 nSV = 1028 .* optimization finished, #iter = 17 Objective value = -18.665411 nSV = 903

I don't know what it is, but it looks like I was able to learn Save the created model and take a look inside. Generated in a folder as news20.model

weight = open('news20.model').readlines()
weight[:10]

['solver_type L2R_L2LOSS_SVC_DUAL\n', 'nr_class 20\n', 'label 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20\n', 'nr_feature 62017\n', 'bias -1\n', 'w\n', '-0.339495505987624 -0.2729835882642053 -0.1726449590446147 -0.2479101793530862 -0.4274669775000416 -0.2412066297893888 -0.2293917779069297 -0.1540898055174211 -0.215426735582579 -0.2955027766952972 -0.07665514316560521 -0.2067955978156952 -0.2129682323900661 -0.3178416173675406 -0.1100450398128613 -0.1089058297966 0.2118441015471185 -0.1789025390838444 -0.2308991526979358 -0.3216302447541755 \n', '0.03464116990799743 0.03296276686709169 -0.005516289618528965 0 0 8.487270131488089e-19 -0.03693284638681263 0 0 0 -0.0005436471560843025 0 4.336808689942018e-19 0 0 0 -1.355252715606881e-20 0.005881877772996123 0.0004078249397363432 -0.005592803559260878 \n', '0 0 0 0 -0.006337527074141217 0 -0.01043809306013021 -0.02848401075118318 -0.02192217208113558 0 -0.002743696876587976 -0.002823046244597745 5.421010862427522e-19 0 -0.01184141317622985 -0.00327656833111874 -0.00300798970221013 0.07620931881353635 0.07709902339068471 -0.007496992406231962 \n', '0 0.000336438903090087 -0.002105522336459381 -0.003408253600602967 0.04532864192038737 0.00358490636419236 -0.01288493688454648 -0.03829009043077678 -0.02192217208113558 0 -0.002743696876587976 -0.006148372938504376 0.04416917489366715 0 -0.03749035441444219 0.00486249738297638 -0.003188508027714593 0.1323725656877747 0.09645265180639011 -0.01123137774909418 \n']

There are 20 labels, each of which has a weight. http://qwone.com/~jason/20Newsgroups/  The vocabulary.txt of is the index. Now you can see which words are effective in the classification

predict

p_label, p_acc, p_val = predict(y[5000:], x[5000:], m)

Accuracy = 74.3576% (8131/10935) (classification)

The accuracy is 74%, which seems to be fair.

Let's look at the result of the prediction. First, put the correct answer y, the prediction label p_label, and the score p_val of each label in the data frame for easy understanding.

import pandas as pd
a=pd.DataFrame([y[5000:],p_label,p_val])
a[0]
a[0][2]

0 1 1 1 2 [-0.434941406833, -2.4992939688, -1.9156773889... Name: 0, dtype: object

[-0.43494140683299093, -2.499293968803961, -1.9156773889387406, -1.652996684855934, -0.64663025115734, -1.981531321375946, -2.0506304515990794, -1.9845217707935987, -1.816531448715213, -1.9993917151454117, -2.6192052686130403, -2.375782174561902, -2.1841316767499994, -2.787946449405093, -1.981463462884227, -2.4769599630955956, -1.3508140247538216, -1.7235783924583472, -1.7785165908522975, -2.2096245620379604]

1 is the closest and is the same as the correct answer data

Let's see how much it matches

b=pd.DataFrame([y[5000:],p_label])
b

0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 10 1 1 2 2 4 5 7 4 8 9 4

... 10925 10926 10927 10928 10929 10930 10931 10932 10933 10934 0 ... 18 19 7 9 15 16 17 18 19 17 1 ... 18 18 7 9 15 16 17 15 19 17

It seems that it can be classified so well

By the way, I don't know the parameters well ... And the question is how to make this data ...

https://github.com/zygmuntz/phraug https://github.com/zygmuntz/phraug/blob/master/csv2libsvm.py Convert from CSV found this

Please let me know if you would like to do this

Recommended Posts

I tried machine learning with liblinear
I tried learning LightGBM with Yellowbrick
I tried learning with Kaggle's Titanic (kaggle②)
I tried deep learning
I started machine learning with Python Data preprocessing
[Mac] I tried reinforcement learning with OpenAI Baselines
(Machine learning) I tried to understand Bayesian linear regression carefully with implementation.
I tried to visualize the model with the low-code machine learning library "PyCaret"
I tried fp-growth with python
I tried scraping with Python
Machine learning learned with Pokemon
I tried Learning-to-Rank with Elasticsearch!
Machine learning beginners tried RBM
I tried clustering with PyCaret
Machine learning with Python! Preparation
Machine learning Minesweeper with PyTorch
I implemented Extreme learning machine
Beginning with Python machine learning
I tried gRPC with Python
I tried scraping with python
Try machine learning with Kaggle
Mayungo's Python Learning Episode 1: I tried printing with print
I tried to compress the image using machine learning
Uncle SE with hardened brain tried to study machine learning
I tried to make a real-time sound source separation mock with Python machine learning
I tried trimming efficiently with OpenCV
I tried summarizing sentences with summpy
I started machine learning with Python Clustering & Dimension Compression & Visualization
Machine learning with python (1) Overall classification
I tried web scraping with python.
I tried moving food with SinGAN
I tried using Tensorboard, a visualization tool for machine learning
Machine learning
I tried machine learning to convert sentences into XX style
I tried implementing DeepPose with PyTorch
Try machine learning with scikit-learn SVM
Mayungo's Python Learning Episode 3: I tried to print numbers with print
I tried face detection with MTCNN
I tried to implement ListNet of rank learning with Chainer
[Machine learning] I tried to summarize the theory of Adaboost
I tried reinforcement learning using PyBrain
I tried deep learning using Theano
Code review with machine learning Amazon Code Guru now supports Python so I tried it
Quantum-inspired machine learning with tensor networks
Get started with machine learning with SageMaker
I tried running prolog with python 3.8.2.
"Scraping & machine learning with Python" Learning memo
I tried SMTP communication with Python
I tried sentence generation with GPT-2
I tried to divide with a deep learning language model
I tried face recognition with OpenCV
I tried to build an environment for machine learning with Python (Mac OS X)
PySpark learning record ② Kaggle I tried the Titanic competition with PySpark binding
I tried to make deep learning scalable with Spark × Keras × Docker
I tried deep reinforcement learning (Double DQN) for tic-tac-toe with ChainerRL
[Machine learning] I tried to do something like passing an image
Mayungo's Python Learning Episode 7: I tried printing with if, elif, else
I tried multiple regression analysis with polynomial regression
I tried sending an SMS with Twilio
I tried using Amazon SQS with django-celery
Predict power demand with machine learning Part 2