I wonder what machine learning is, and if I write it as I can think of it, it looks like this.
--Classification --Regression --Discrimination
There may be others, but I can't think of them. When I recommend a tutorial around here in R, I use iris data, but I wonder if it is in Python. → There seems to be.
First, try R. Making a classifier using SVM, NeuralNet, NaiveBayse, RandomForest, etc. in R is almost nothing to think about. Using each clustering method, a training set is used to create a classifier, and the test set is evaluated to see if it can be correctly classified by the classifier.
In R, if you do data (iris), you can use the data of iris, iris, or iris.
This is a data.table of all 150 samples with calyx length, calyx width, petal length, petal width as explanatory variables and iris varieties (setosa, versicolor, virginica) as objective variables.
First, use sample () to randomly divide half of the training set and the rest into the test set.
##Reading iris data
data(iris)
##Randomly choose to make half a training set
train_ids <- sample(nrow(iris), nrow(iris)*0.5)
##Creating a training set
iris.train <- iris[train_ids,]
##Use the other half as a test set
iris.test <- iris[-train_ids,]
By the way, the three iris varieties, setosa, versicolor and virignica, have the following flowers. Everyone is the same! It's a punishment game, such as distinguishing this by the shape of the petals.
###SVM execution
library(kernlab)
iris.svm <- ksvm(Species~., data=iris.train)
svm.predict <- predict(iris.svm, iris.test)
###Result display
table(svm.predict, iris.test$Species)
###run neuralnet
library(nnet)
iris.nnet<-nnet(Species ~ ., data = iris.train, size = 3)
nnet.predict <- predict(iris.nnet, iris.test, type="class")
###Result display
table( nnet.predict, iris.test$Species)
###run naive bayes
library(e1071)
iris.nb <- naiveBayes(Species~., iris.train)
nbayes.predict <- predict(iris.nb, iris.test)
###Result display
table(nbayes.predict, iris.test$Species)
###Random forest execution
library(randomForest)
iris.rf <- randomForest(Species~., iris.train)
rf.predict <- predict(iris.rf, iris.test)
###Result display
table(rf.predict, iris.test$Species)
When you try it, you will be able to identify it with a correct answer rate of about 73/75 no matter what method you use. It's not a hundred shots, but it may (or may not) be managed by adjusting parameters.
Now, to try the same thing in Python, use scikit-learn. More details can be found in scikit-learn tutorial.
In the first place, why do you do "what you could easily do with R as above" with Python, which is unknown to you? I can only say that I want to try it in Python. If there is a mountain in Soko, I will climb it, if there is a puddle, I will be addicted to it, and if there is a set table, I will eat it.
If Anaconda is already installed, scikit-learn is already installed, so import it. I import it, but the library name is sklearn. Iris can also be loaded in this library as datasets.
from sklearn import svm, datasets
iris = datasets.load_iris()
As for the contents, you can tell by printing (iris.data) or print (iris.target), but iris.data has an explanatory variable and iris.target has an objective variable. Divide this into a training dataset and a test dataset. In R, I used the sample () function, but in scikit-learn, there is a method called train_test_split () in sklearn.cross_varidation. Now, as I did in R, I split half into a training set (iris_data_train, iris_target_train) and a test set (iris_data_test, iris_target_test).
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
import numpy as np
iris = datasets.load_iris()
iris_data_train, iris_data_test, iris_target_train, iris_target_test = train_test_split(iris.data, iris.target, test_size=0.5)
Make a classifier using this training set. There are various SVMs, is it linear? Is it non-linear? I need to set various parameters, but this is all by default.
svc = svm.SVC()
svc.fit(iris_data_train, iris_target_train)
svc.predict(iris_data_test)
I wrote fit () and predict () in two lines, but I can just initialize them, feed the training data, and feed the test data, so I can write them in one line.
iris_predict = svm.SVC().fit(iris_data_train, iris_target_train).predict(iris_data_test)
You can see the match rate between the result of this svc.predict () (iris_predict) and iris_target_test. In R, it is output in tabular format like table (svm.predict, iris.test $ Species), so try the same. This table is called confusion_matrix. With accuracy_score, you can quickly get the correct answer rate.
from sklearn.metrics import confusion_matrix, accuracy_score
print (confusion_matrix(iris_target_test, iris_predict))
print (accuracy_score(iris_target_test, iris_predict))
According to the official document, confusion_matrix () has the first argument as the true value and the second argument as the identification. It seems that it is a discriminant value by the vessel, so pass it as such.
It's simply more typed than R, but I've come to get similar results.
According to the above document, this can also be illustrated in the figure. First, normalize each line of confusion_matrix so that it totals 1. I can't write this formula myself.
cm = confusion_matrix(iris_target_test, iris_predict)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
Using this normalized confusion_matrix, heatmap it using the function of matplotlib. By the way, since svc.predict () is done many times and the result is slightly different each time, the contents of confusion_matrix are slightly different between the above and below figures.
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
''' confusion_Function to display matrix as heatmap
Keyword arguments:
cm -- confusion_matrix
title --Figure title
cmap --Color map to use
'''
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(iris.target_names))
plt.xticks(tick_marks, iris.target_names, rotation=45)
plt.yticks(tick_marks, iris.target_names)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
This time, everything was default, such as svc = svm.SVC (), but of course you should be able to use various kernels for classification. The Examples page of scikit-learn's official website has various demos and tutorials, so if you need it, here are some tips. I think I can find it.
As I did in R, I'm wondering if NeuralNet is naive Bayes and random Forest, but maybe it's about the same (I haven't checked it).
I feel like I've touched a part of machine learning, but in the end I didn't know if iris was a iris, an iris, or a iris.
Initially, the title was "Trying to touch machine learning with Python", but it was said that "touch is the most exciting part, so the title is not suitable", so I corrected it.
I personally think that "creating a discriminator with SVM in Python" is an exciting point, but it seems that he was selfish. I'm sorry to talk about machine learning like SVM. I will come back again.
Recommended Posts