[PYTHON] Introduction to Machine Learning Library SHOGUN

This post is a formatted version of my blog post for Qiita. If there are additional items, I will write them on the blog. __ "Introduction to Machine Learning Library SHOGUN" __ http://rest-term.com/archives/3090/

Introduction to Machine Learning Library SHOGUN

The machine learning toolbox's focus is on large scale kernel methods and especially on Support Vector Machines (SVM)

environment

Installation NOTE: Dependency library installation SWIG BLAS (ATLAS) / LAPACK / GLPK / Eigen3 NumPy Precautions when compiling hello, world (libshogun) Notes on memory management Python Modular

Installation

The official website describes the setup procedure etc. on the premise of debian OS, but you can install it on redhat without any trouble. If it is debian, the old version is distributed in the deb package, but since it is CentOS here, compile / install from the source. Since machine learning-related tasks often take a long time, we recommend that you build such software, not just SHOGUN, in the environment in which it actually operates and use it in the optimum state.

It seems that it was built with Autotools (./configure && make) before, but the latest version package was compatible with CMake. The number of CMake users such as OpenCV and MySQL has increased in the last few years.

$ git clone git://github.com/shogun-toolbox/shogun.git
$ cd shogun
$ mkdir build && cd build
$ cmake -DCMAKE_INSTALL_PREFIX=/usr/local/shogun-2.1.0 \
        -DCMAKE_BUILD_TYPE=Release \
        -DBUNDLE_EIGEN=ON \
        -DBUNDLE_JSON=ON \
        -DCmdLineStatic=ON \
        -DPythonModular=ON ..
##Dependent libraries are checked and the build configuration is displayed.
-- Summary of Configuration Variables
--
-- The following OPTIONAL packages have been found:

 * GDB
 * OpenMP
 * BLAS
 * Threads
 * LAPACK
 * Atlas
 * GLPK
 * Doxygen
 * LibXml2
 * CURL
 * ZLIB
 * BZip2
 * Spinlock

-- The following REQUIRED packages have been found:

 * SWIG (required version >= 2.0.4)
 * PythonLibs
 * PythonInterp
 * NumPy

-- The following OPTIONAL packages have not been found:

 * CCache
 * Mosek
 * CPLEX
 * ARPACK
 * NLopt
 * LpSolve
 * ColPack
 * ARPREC
 * HDF5
 * LibLZMA
 * SNAPPY
 * LZO

-- ==============================================================================================================
-- Enabled Interfaces
--   libshogun is ON
--   python modular is ON
--   octave modular is OFF       - enable with -DOctaveModular=ON
--   java modular is OFF         - enable with -DJavaModular=ON
--   perl modular is OFF         - enable with -DPerlModular=ON
--   ruby modular is OFF         - enable with -DRubyModular=ON
--   csharp modular is OFF       - enable with -DCSharpModular=ON
--   R modular is OFF            - enable with -DRModular=ON
--   lua modular is OFF          - enable with -DLuaModular=ON
--
-- Enabled legacy interfaces
--   cmdline static is ON
--   python static is OFF        - enable with -DPythonStatic=ON
--   octave static is OFF        - enable with -DOctaveStatic=ON
--   matlab static is OFF        - enable with -DMatlabStatic=ON
--   R static is OFF             - enable with -DRStatic=ON
-- ==============================================================================================================
##If there seems to be no problem, compile and install
$ make -j32
$ sudo make install

It seems that any compiler that supports C ++ 11 features will take advantage of some features (std :: atomic, etc.). Most of the functions of C ++ 11 are not supported by the redhat 6.x system GCC (v4.4.x).

In my environment, I installed the command line and the interface for Python in addition to libshogun in the library itself. If you want to have many script language interfaces, you need to install SWIG separately.

NOTE: Dependency library installation

SWIG SWIG is a tool that creates bindings for using modules (shared libraries) written in C / C ++ from high-level languages such as scripting languages. I also occasionally make PHP bindings for web interfaces using SWIG in my business. As of November 2013, the packages that can be installed with yum do not meet the SHOGUN version requirements, so compile / install them from source as well. For debian, $ apt-get install swig2.0 is OK.

##Dependent package PCRE(Perl Compatible Regular Expressions)If not included, enter
$ sudo yum pcre-devel.x86_64

##Erase old rpm packages if they are included
$ sudo yum remove swig

$ wget http://prdownloads.sourceforge.net/swig/swig-2.0.11.tar.gz
$ tar zxf swig-2.0.11.tar.gz
$ cd swig-2.0.11
$ ./configure --prefix=/usr/local/swig-2.0.11
$ make -j2
$ sudo make install
##Put a binary symbolic link in your PATH
$ sudo ln -s /usr/local/swig-2.0.11/bin/swig /usr/local/bin

Let's build SHOGUN again after aligning the processing system of the language you want to make the binding with SWIG.

BLAS(ATLAS)/LAPACK/GLPK/Eigen3 A set of libraries related to linear algebra. A package that can be installed with yum is OK. ATLAS is one of the optimized BLAS implementations, which is easy to install and should be included (BLAS is the reference implementation). In addition, SHOGUN seems to support CPLEX in addition to GLPK, so if you want to use it for business, CPLEX It seems that the performance will be further improved by introducing /). Let's do our best to write the approval form (it seems that it can be used free of charge for academic purposes). As for Eigen3, as of November 2013, the packages that can be installed with yum do not meet the SHOGUN version requirements, but if you do not have Eigen3, CMake will download the source code (header files because it is a template library). It seems that you can instruct. It's OK if you add -DBUNDLE_EIGEN = ON to the CMake option.

##Install all linear algebra related libraries
## atlas-lapack if you include devel-no devel required
$ sudo yum install blas-devel.x86_64 lapack-devel.x86_64 atlas-devel.x86_64 glpk-devel.x86_64

NumPy ** NumPy is required to install the Python interface **. I introduced NumPy on my blog before, so for reference. By the way, the Python interface of OpenCV (Computer Vision Library) also uses NumPy.

Installation is easy with pip (A tool for installing and managing Python packages.).

$ sudo pip install numpy

The overall picture of the SHOGUN library is as shown in the figure below, and I'm not sure. shogun_overview.jpg In addition to the interfaces shown in this figure, Java, Ruby, Lua, etc. are also supported as interfaces for scripting languages. As mentioned above, install SWIG and create a binding for the language you want to use.

Precautions when compiling

If you release and build SHOGUN in a cheap VPS environment with a small amount of physical memory / virtual memory, there is a high possibility that the cc1plus process will be forcibly killed by OOM Killer. I tried it in a virtual environment with 1GB RAM / 2GB Swap, but it was killed with a terrible score. ..

kernel: Out of memory: Kill process 30340 (cc1plus) score 723 or sacrifice child
kernel: Killed process 30340, UID 500, (cc1plus) total-vm:2468236kB, anon-rss:779716kB, file-rss:2516kB
kernel: cc1plus invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=0, oom_score_adj=0

In that case, consider increasing the swap capacity. If the physical memory is only 1GB, it is probably not enough to have about twice the actual memory capacity, so I think it is safe to temporarily secure about 4 times. It may be better to give up a little if it is an OpenVZ virtual environment. ..

Also, in a virtual environment like VPS, there should be few inodes, so I think that it is not possible to put a large amount of training data. It is harder to build it on a physical server obediently. I built it here in an environment with 32 cores / 96GB RAM, but the resources were sufficient and I was able to build smoothly.

By the way, the GCC optimization options in my environment are as follows.

-march=core2 -mcx16 -msahf -maes -mpclmul -mavx --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=15360 -mtune=generic

hello, world (libshogun) Let's start with a simple task using libshogun. [SVM (Support Vector Machine)](http://ja.wikipedia.org/wiki/%E3%82%B5%E3%83%9D%E3%83%BC%E3%83%88%E3%83% This is a sample to classify data using 99% E3% 82% AF% E3% 82% BF% E3% 83% BC% E3% 83% 9E% E3% 82% B7% E3% 83% B3). ..

/* hello_shogun.cpp */
#include <shogun/labels/BinaryLabels.h>
#include <shogun/features/DenseFeatures.h>
#include <shogun/kernel/GaussianKernel.h>
#include <shogun/classifier/svm/LibSVM.h>
#include <shogun/base/init.h>
#include <shogun/lib/common.h>
#include <shogun/io/SGIO.h>

using namespace shogun;

int main(int argc, char** argv) {
  // initialize
  init_shogun_with_defaults();

  // create some data
  SGMatrix<float64_t> matrix(2,3);
  for(int i=0; i<6; i++) {
    matrix.matrix[i] = i;
  }
  matrix.display_matrix();

  // create three 2-dimensional vectors
  CDenseFeatures<float64_t>* features = new CDenseFeatures<float64_t>();
  features->set_feature_matrix(matrix);

  // create three labels
  CBinaryLabels* labels = new CBinaryLabels(3);
  labels->set_label(0, -1);
  labels->set_label(1, +1);
  labels->set_label(2, -1);

  // create gaussian kernel(RBF) with cache 10MB, width 0.5
  CGaussianKernel* kernel = new CGaussianKernel(10, 0.5);
  kernel->init(features, features);

  // create libsvm with C=10 and train
  CLibSVM* svm = new CLibSVM(10, kernel, labels);
  svm->train();

  SG_SPRINT("total sv:%d, bias:%f\n", svm->get_num_support_vectors(), svm->get_bias());

  // classify on training examples
  for(int i=0; i<3; i++) {
    SG_SPRINT("output[%d]=%f\n", i, svm->apply_one(i));
  }

  // free up memory
  SG_UNREF(svm);

  exit_shogun();
  return 0;
}
##C for a relatively new compiler++It is good to compile with 11 enabled.
$ g++ -g -Wall -std=c++0x -L/usr/local/lib64 -lshogun hello_shogun.cpp -o hello_shogun
$ ./hello_shogun
matrix=[
[       0,      2,      4],
[       1,      3,      5]
]
total sv:3, bias:-0.333333
output[0]=-0.999997
output[1]=1.000003
output[2]=-1.000005

Extract the feature vector (CDense Features) from the matrix, set the correct label (CBinaly Labels), and learn with SVM (CLib SVM) using the Gaussian kernel (CGaussian Kernel). Below are some features.

Feature vectors are read from a matrix with Column-Major, similar to OpenGL, CUBLAS, etc. (one column becomes one feature vector). External libraries such as LibSVM and SVMLight can be used internally from the SHOGUN interface for learning SVMs (see shogun / classifier / svm / below).

Notes on memory management

The above code clearly seems to be a memory leak, but when I checked it with valgrind, it seems that the memory is properly managed internally. SHOGUN internally manages objects by reference counting. I use this because a macro (SG_REF / SG_UNREF) that increments / decrements this reference count is defined, but I don't need to manually manipulate the count for all instances. I've read the implementation around memory management in SHOGUN and it decrements the reference count of the referenced instance when the referencing instance is released (that is, in the destructor). Since the instance is released when the reference count becomes 0 or less, in the above code, if you release the SVM instance, other instances will be released in a chain reaction.

Please note that ** nothing is done when the scope is out of scope **, so the decision to release the instance is only made when the reference count changes. It feels like a small kindness and a big help, but it can't be helped, so it seems necessary to deal with this mechanism well. I will write some policies.

// C++Manually increment the SHOGUN reference count after creating an instance using a standard smart pointer
std::unique_ptr<CDenseFeatures<float64_t> > features(new CDenseFeatures<float64_t>());
SG_REF(features);
//Do not decrement reference counts

Next, let's input unknown data. Use the training data / test data that is appropriately generated by Python and written to a file as shown below.

import numpy as np

def genexamples(n):
    class1 = 0.6*np.random.randn(n, 2)
    class2 = 1.2*np.random.randn(n, 2) + np.array([5, 1])
    labels = np.hstack((np.ones(n), -np.ones(n)))
    return (class1, class2, labels)

** Note) The following code has been confirmed to work below libshogun.so.14 **

#include <shogun/labels/BinaryLabels.h>
#include <shogun/features/DenseFeatures.h>
#include <shogun/kernel/GaussianKernel.h>
#include <shogun/classifier/svm/LibSVM.h>
#include <shogun/io/SGIO.h>
#include <shogun/io/CSVFile.h>
#include <shogun/evaluation/ContingencyTableEvaluation.h>
#include <shogun/base/init.h>
#include <shogun/lib/common.h>

using namespace std;
using namespace shogun;

int main(int argc, char** argv) {
  try {
    init_shogun_with_defaults();

    // training examples
    CCSVFile train_data_file("traindata.dat");
    // labels of the training examples
    CCSVFile train_labels_file("labeldata.dat");
    // test examples
    CCSVFile test_data_file("testdata.dat");

    SG_SPRINT("training ...\n");
    SGMatrix<float64_t> train_data;
    train_data.load(&train_data_file);

    CDenseFeatures<float64_t>* train_features = new CDenseFeatures<float64_t>(train_data);
    SG_REF(train_features);
    SG_SPRINT("num train vectors: %d\n", train_features->get_num_vectors());

    CBinaryLabels* train_labels = new CBinaryLabels();
    SG_REF(train_labels);
    train_labels->load(&train_labels_file);
    SG_SPRINT("num train labels: %d\n", train_labels->get_num_labels());

    float64_t width = 2.1;
    CGaussianKernel* kernel = new CGaussianKernel(10, width);
    SG_REF(kernel);
    kernel->init(train_features, train_features);

    int C = 1.0;
    CLibSVM* svm = new CLibSVM(C, kernel, train_labels);
    SG_REF(svm);
    svm->train();
    SG_SPRINT("total sv:%d, bias:%f\n", svm->get_num_support_vectors(), svm->get_bias());
    SG_UNREF(train_features);
    SG_UNREF(train_labels);
    SG_UNREF(kernel);

    CBinaryLabels* predict_labels = svm->apply_binary(train_features);
    SG_REF(predict_labels);

    CErrorRateMeasure* measure = new CErrorRateMeasure();
    SG_REF(measure);
    measure->evaluate(predict_labels, train_labels);
    float64_t accuracy = measure->get_accuracy()*100;
    SG_SPRINT("accuracy: %f\%\n", accuracy);
    SG_UNREF(predict_labels);
    SG_UNREF(measure);

    SG_SPRINT("testing ...\n");
    SGMatrix<float64_t> test_data;
    test_data.load(&test_data_file);

    CDenseFeatures<float64_t>* test_features = new CDenseFeatures<float64_t>(test_data);
    SG_REF(test_features);
    SG_SPRINT("num test vectors: %d\n", test_features->get_num_vectors());

    CBinaryLabels* test_labels = svm->apply_binary(test_features);
    SG_REF(test_labels);
    SG_SPRINT("num test labels: %d\n", test_labels->get_num_labels());
    SG_SPRINT("test labels: ");
    test_labels->get_labels().display_vector();
    CCSVFile test_labels_file("test_labels_file.dat", 'w');
    test_labels->save(&test_labels_file);

    SG_UNREF(svm);
    SG_UNREF(test_features);
    SG_UNREF(test_labels);

    exit_shogun();
  } catch(ShogunException& e) {
    SG_SPRINT(e.get_exception_string());
    return - 1;
  }
  return 0;
}
training ...
num train vectors: 400
num train labels: 400
total sv:37, bias:-0.428868
accuracy: 99.750000%
testing ...
num test vectors: 400
num test labels: 400
test labels: vector=[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1]

SHOGUN has just the right level of class design granularity, and the API is well abstracted, but the code gets dirty when reference counting operations are included. .. Also, the training data is read from the file using the CCSVFile class. Although it is named CSV, it can also read files in which 2D data is written separated by spaces instead of CSV format.

To get along with SHOGUN and OpenCV, it may be useful to create an adapter that can convert between shogun :: SGMatrix and cv :: Mat. Also, OpenCV is also advancing CUDA support, so I would like SHOGUN to support it as well.

Python Modular Next, let's use SHOGUN's Python binding. Two are provided, python_static and python_modular, but I will use python_modular because it has a smarter interface.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import modshogun as sg
import numpy as np
import matplotlib.pyplot as plt

def classifier():
    train_datafile = sg.CSVFile('traindata.dat')
    train_labelsfile = sg.CSVFile('labeldata.dat')
    test_datafile = sg.CSVFile('testdata.dat')

    train_features = sg.RealFeatures(train_datafile)
    train_labels = sg.BinaryLabels(train_labelsfile)
    test_features = sg.RealFeatures(test_datafile)

    print('training ...')
    width = 2.1
    kernel = sg.GaussianKernel(train_features, train_features, width)

    C = 1.0
    svm = sg.LibSVM(C, kernel, train_labels)
    svm.train()
    sv = svm.get_support_vectors()
    bias = svm.get_bias()
    print('total sv:%s, bias:%s' % (len(sv), bias))

    predict_labels = svm.apply(train_features)
    measure = sg.ErrorRateMeasure()
    measure.evaluate(predict_labels, train_labels)
    print('accuracy: %s%%' % (measure.get_accuracy()*100))

    print('testing ...')
    test_labels = svm.apply(test_features)
    print(test_labels.get_labels())

if __name__=='__main__':
    classifier()
training ...
total sv:37, bias:-0.428868128708
accuracy: 99.75%
testing ...
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.
...abridgement

I got the same result as the C ++ (libshogun) version. Let's visualize the classification result using matplotlib.

import numpy as np
import matplotlib.pyplot as plt

##Get classification boundaries
def getboundary(plotrange, classifier):
    x = np.arange(plotrange[0], plotrange[1], .1)
    y = np.arange(plotrange[2], plotrange[3], .1)
    xx, yy = np.meshgrid(x, y)
    gridmatrix = np.vstack((xx.flatten(), yy.flatten()))
    gridfeatures = sg.RealFeatures(gridmatrix)
    gridlabels = classifier.apply(gridfeatures)
    zz = gridlabels.get_labels().reshape(xx.shape)
    return (xx, yy, zz)

##Get the classification boundary by specifying the drawing range and classifier
xx, yy, zz = getboundary([-4,8,-4,5], svm)
##Draw classification boundaries
plt.contour(xx, yy, zz, [1,-1])

shogun_svm.png

For the time being, I've organized a simple usage in C ++ and Python. Since SHOGUN implements not only SVM but also various machine learning algorithms, I would like to proceed with verification using more practical tasks.

Recommended Posts

Introduction to Machine Learning Library SHOGUN
Machine learning library Shogun
Introduction to machine learning
Super introduction to machine learning
Introduction to machine learning Note writing
Introduction to Machine Learning: How Models Work
An introduction to OpenCV for machine learning
An introduction to Python for machine learning
Machine learning library dlib
[Python] Easy introduction to machine learning with python (SVM)
[Super Introduction to Machine Learning] Learn Pytorch tutorials
An introduction to machine learning for bot developers
Python & Machine Learning Study Memo ②: Introduction of Library
[Super Introduction to Machine Learning] Learn Pytorch tutorials
[For beginners] Introduction to vectorization in machine learning
Introduction to Deep Learning ~ Learning Rules ~
Deep Reinforcement Learning 1 Introduction to Reinforcement Learning
Introduction to Deep Learning ~ Backpropagation ~
A quick introduction to the neural machine translation library
An introduction to machine learning from a simple perceptron
Free version of DataRobot! ?? Introduction to "PyCaret", a library that automates machine learning
Introduction to Python Numerical Library NumPy
Introduction to Machine Learning with scikit-learn-From data acquisition to parameter optimization
Introduction to Deep Learning ~ Function Approximation ~
Machine learning
Introduction to Deep Learning ~ Coding Preparation ~
<For beginners> python library <For machine learning>
Machine learning to learn with Nogizaka46 and Keyakizaka46 Part 1 Introduction
Introduction to Deep Learning ~ Dropout Edition ~
Introduction to Deep Learning ~ Forward Propagation ~
Introduction to Deep Learning ~ CNN Experiment ~
How to collect machine learning data
Python learning memo for machine learning by Chainer Chapter 8 Introduction to Numpy
Introduction to Python Basics of Machine Learning (Unsupervised Learning / Principal Component Analysis)
Python learning memo for machine learning by Chainer Chapter 10 Introduction to Cupy
[Introduction to StyleGAN] Unique learning of anime with your own machine ♬
Python learning memo for machine learning by Chainer Chapter 9 Introduction to scikit-learn
scikit-learn How to use summary (machine learning)
I installed Python 3.5.1 to study machine learning
How to enjoy Coursera / Machine Learning (Week 10)
Introduction to Machine Learning-Hard Margin SVM Edition-
Introduction to TensorFlow-Machine Learning Terminology / Concept Explanation
Introduction to Scrapy (1)
Introduction to Scrapy (3)
Introduction to Supervisor
Introduction to Tkinter 1: Introduction
[Introduction to machine learning] Until you run the sample code with chainer
Introduction to PyQt
Introduction to Scrapy (2)
[Linux] Introduction to Linux
Take the free "Introduction to Python for Machine Learning" online until 4/27 application
Introduction to Scrapy (4)
Python beginners publish web applications using machine learning [Part 2] Introduction to explosive Python !!
Introduction to discord.py (2)
[Memo] Machine learning
I tried to visualize the model with the low-code machine learning library "PyCaret"
Machine learning classification
Introduction to discord.py
Machine Learning sample
Try to forecast power demand by machine learning
[Introduction to StyleGAN2] Independent learning with 10 anime faces ♬