[PYTHON] How to use shogun

A memorandum of SHOGUN, a machine learning library available at here. The installation method is described here.

1. Label

An example of a Binary Label that is a binary label. Represented by -1 or 1. It can be created from an array or a CSV file.

from modshogun import BinaryLabels

#Randomly generate 5 labels
label = BinaryLabels(5)

label.get_num_labels() 
→ 5

label.get_values()
→ array([  2.00000000e+000,   2.00000000e+000,   1.38338381e-322,0.00000000e+000,   0.00000000e+000])

from modshogun import CSVFile

#Can be created from a CSV file prepared in advance
label_from_csv = BinaryLabels(CSVFile(file_path))

2. Features

It can be created from a numpy matrix or a CSV file. Note that one feature is represented by one column, not one row.

from modshogun import RealFeatures
import numpy as np

#3x3 random matrix
feat_arr = np.random.rand(3, 3)
→ array([[ 0.02818103,  0.72093824,  0.92727711],
       [ 0.66853622,  0.14594737,  0.90522684],
       [ 0.97941639,  0.14188234,  0.80854797]])

#Initialization of Real Features
features = RealFeatures(feat_arr)

#Display of features
features.get_feature_matrix(features)
→ array([[ 0.02818103,  0.72093824,  0.92727711],
       [ 0.66853622,  0.14594737,  0.90522684],
       [ 0.97941639,  0.14188234,  0.80854797]])

#Get features for a particular column
features.get_feature_vector(1)
→array([ 0.72093824,  0.14594737,  0.14188234])

#Types of features(Number of rows)
features.get_num_features()
→3

#Number of features(Number of columns)
features.get_num_vectors()
→3

from modshogun import CSVFile

#Of course, this can also be read from a CSV file.
feats_from_csv = RealFeatures(CSVFile(file_path))

3. Kernel

An example with a chi-square kernel.

from modshogun import Chi2Kernel, RealFeatures, CSVFile

#Training data
feats_train = RealFeatures(CSVFile(file_path))

#Test data
feats_test = RealFeatures(CSVFile(file_path))

#Kernel width
width = 1.4

#size_cache settings
size_cache = 10

#Kernel generation
kernel = Chi2Kernel(feats_train, feats_train, width, size_cache)

#Kernel training
kernel.init(feats_train, feats_test)

4.SVMLight

Classification by support vector machine using SVMLight

from modshogun import SVMLight, CSVFile, BinaryLabels, RealFeatures, Chi2Kernel

feats_train = RealFeatures(CSVFile(train_data_file_path))
feats_test = RealFeatures(CSVFile(test_data_file_path))

kernel = Chi2Kernel(feats_train, feats_train, 1.4, 10)

labels = BinaryLabels(CSVFile(label_traindat_path))
 
C = 1.2
epsilon = 1e-5
num_threads = 1
svm = SVMLight(C, kernel, labels)
svm.set_epsilon(epsilon)
svm.parallel.set_num_threads(num_threads)
svm.train()

kernel.init(feats_train, feats_test)
res = svm.apply().get_labels()

res
→array(Result label)

5. Cross-validation

Import the CrossValidation class. To initialize CrossValidation

Classifier (CMachine class such as SVNLight or LibLinear)
Features (CFeatures classes such as RealFeatuers and DenseFeatures)
Labels (CLabels classes such as BinaryLabels and MultiClassLabels)
Data partitioning method (CSplittngStrategy class such as CrossValidationSplitting)
Evaluation criteria (CEvaluation class such as ContingencyTableEvaluation) Is passed as an argument.

from modshogun import LibLinear, BinalyLabels, RealFeatures, CrossValidationSplitting, ContingencyTableEvaluation, CSVFile, ACCURACY

#Classifier
classifier = LibLinear(L2R_L2LOSS_SVC)
#Feature value
features = RealFeatures(CSVFile(feature_file_path))
#label
labels = BinalyLabels(CSVFile(label_file_path))


#SplittingStrategy seems to be able to specify how to split the data. I don't know much about it. In this example, it is divided into five.
splitting_strategy = CrossValidationSplitting(labels, 5)

#Evaluation criteria class. ACCURACY is just a constant declared in E PontingencyTableMeasureType.
evaluation_criterium = ContingencyTableEvaluation(ACCURACY)

#Cross-validation class.
cross_validation = CrossValidation(classifier, features, labels. splitting_strategy, evaluation_criterium)
cross_validation.set_autolock(False)

#Setting the number of repetitions
cross_validation.set_num_runs(10)

#95%Confidence interval setting? I'm not sure
cross_validation.set_conf_int_alpha(0.05)

#The return value is the CEvaluationResult class
result = cross_validation.evaluate()

#You can get the average value of the cross-validation results.
print result.mean

#Click here if you want to output everything else
print result.print_result()

6. Grid search

If you can do so far, grid search can be done quite easily. GridSearchModelSelection of CModelSelection class

CrossValidation
Model Selection Parameters

If you pass and initialize it, you can already search the grid.

---Omit up to initialize CrossValidation class in LibLinear---

from modshogun import ModelSelectionParameters, R_EXP
from modsghoun import GridSearchModelSelection

#An object that stores parameters to change
param_tree_root = ModelSelectionParameters()

#Parameter C1
C1 = ModelSelectionParameters("C1")
param_tree_root.append_child(c1)

build_values()Minimum value, maximum value, step(Parameter increase)To set. R_EXP(index),R_LOG(Logarithm),R_LINEAR(linear)There are three types, but details are unknown.
c1.build_values(-1.0, 0.0, R_EXP)

c2 = ModelSelectionParameters("C2")
param_tree_root.append_child(c2)
c2.build_values(-1.0, 0.0, R_EXP)

#Print here_tree()When you execute param_tree_You can see that root has a tree structure.
param_tree_root.print_tree()
→root with
	 with values: vector=[0.5,1]
	 with values: vector=[0.5,1]

#Generate grid search class
model_selection = GridSearchModelSelection(cross_validation, param_tree_root)

#This will automatically determine the best parameters and return an object of class CParameterCombination. Also, if you pass True as an argument, the combination of each parameter and the result will be output.
best_parameters = model_selection.select_model()

#It is also possible to apply the best returned parameters as classifier or model parameters.
best_parameters.apply_to_machine(classifier)
result = cross_validation.evaluate()

7. Save and load the created model

Objects can be saved and loaded using the save_serializable () and load_serializable () functions of CSGObject, which is the basis of almost all classes.

from modshogun import SerializableAsciiFile
from modshogun import MulticlassLabels
from numpy import array

save_labels = MulticlassLabels(array([1.0, 2, 3]))

#File name setting Supports csv and asc
save_file = SerializableAsciiFile("foo.csv", "w")
#Save file
save_labels.save_serializable(save_file)

load_file = SerializableAsciiFile("foo.csv", "r")
load_labels = MulticlassLabels()
load_labels.load_serializable(load_file)
→[ 1.  2.  3.]

8. Spit log

You can spit out logs for each object. Pass MSG_DEBUG for the debug log and MSG_ERROR for the error log only. Declared with EMessageType.

from modshogun import MSG_DEBUG, MSG_ERROR
from modshogun import Chi2Kernel
from modshogun import LibSVM

kernel = Chi2Kernel()
svm = LibSVM()

kernel.io.set_loglevel(MSG_DEBUG)
svm.io.set_loglevel(MSG_ERROR)

in conclusion

It's kind of messy, so if you have any requests, please comment.