http://scikit-learn.org/0.18/modules/svm.html Google Translated Supervised Learning User Guide Table of Contents [scikit-learn 0.18 User Guide 1. Supervised Learning](http://qiita.com/nazoking@github/items/267f2371757516f8c168#1-%E6%95%99%E5%B8%AB%E4%BB%98 From% E3% 81% 8D% E5% AD% A6% E7% BF% 92)

1.4. Support vector machine

** Support Vector Machine (SVM) ** is "[Category](# 141-% E5% 88% 86% E9% A1% 9E)", "[Regression](# 142% E5% 9B% 9E% E5) % B8% B0) "and" [Outlier detection](# 143-% E5% AF% 86% E5% BA% A6% E6% 8E% A8% E5% AE% 9A% E6% 96% B0% E8% A6% 8F% E6% 80% A7% E6% A4% 9C% E5% 87% BA) ”is a set of supervised learning methods used.

The advantages of support vector machines are:

Effective in high-dimensional space.
Effective even if the number of dimensions is greater than the number of specimens.
Memory efficiency is also high because the decision function (called the support vector) uses a subset of training points.
General: Different for decision function [Kernel function](# 146-% E3% 82% AB% E3% 83% BC% E3% 83% 8D% E3% 83% AB% E9% 96% A2% E6% 95 % B0) can be specified. A common kernel is provided, but you can also specify a custom kernel.

The disadvantages of support vector machines are:

If the number of features is much larger than the number of samples, this method can reduce performance.
SVM does not provide probability estimates directly, but is calculated using an expensive 5x cross-validation (see below: Scores and Probabilities (# 1412% E3% 82% B9% E3%). 82% B3% E3% 82% A2% E3% 81% A8% E7% A2% BA% E7% 8E% 87)).

scikit-learn support The vector machine supports high density (which can be converted to that of numpy.ndarray and numpy.asarray) and sparse (any scipy.sparse) sample vectors as input. However, in order to use SVMs to make sparse data predictions, they must be compatible with such data. For best performance, use C-ordered numpy.ndarray (dense) or scipy.sparse.csr_matrix (sparse) with dtype = float64.

1.4.1. Classification

SVC, NuSVC /modules/generated/sklearn.svm.NuSVC.html#sklearn.svm.NuSVC), and [LinearSVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn .svm.LinearSVC) is a class that can perform multi-class classification on datasets.

SVC and NuSVC are similar, but accept slightly different parameter sets and have different mathematical formulations (see section [Mathematical Formulations](# 147% E6% 95% B0% E5% AD% A6). % E7% 9A% 84% E5% 87% A6% E6% 96% B9)). On the other hand, LinearSVC is another support vector classifier for linear kernels. It is an implementation. Note that LinearSVC does not accept the keyword kernel. This is considered linear. Also, some members of SVC and NuSVC do not have something like support_. SVC, NuSVC, and LinearSVC take two arrays of size [n_samples, n_features] to hold training samples, array X of size [n_samples, n_features], array y of class labels (strings or integers), and size [n_samples] as inputs. I will take it.

>>> from sklearn import svm
>>> X = [[0, 0], [1, 1]]
>>> y = [0, 1]
>>> clf = svm.SVC()
>>> clf.fit(X, y)  
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

After fitting, you can use the model to predict new values.

>>> clf.predict([[2., 2.]])
array([1])

The SVM decision function relies on several subsets of training data called support vectors. Some properties of these support vectors are in the members support_vectors_, support_ and n_support.

>>> # get support vectors
>>> clf.support_vectors_
array([[ 0.,  0.],
       [ 1.,  1.]])
>>> # get indices of support vectors
>>> clf.support_ 
array([0, 1]...)
>>> # get number of support vectors for each class
>>> clf.n_support_ 
array([1, 1]...)

1.4.1.1. Multi-class classification

SVC and NuSVC implement a "one-against-one" approach for multiclass classification (Knerr et al., 1990). If n_class is the number of classes, a classifier of n_class * (n_class -1) / 2 is built, each training data from two classes. To provide a consistent interface with other classifiers, the decision_function_shape option aggregates the results of the" one-against-one "classifier into the decision function of shape(n_samples, n_classes). make it possible.

>>> X = [[0], [1], [2], [3]]
>>> Y = [0, 1, 2, 3]
>>> clf = svm.SVC(decision_function_shape='ovo')
>>> clf.fit(X, Y) 
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovo', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes: 4*3/2 = 6
6
>>> clf.decision_function_shape = "ovr"
>>> dec = clf.decision_function([[1]])
>>> dec.shape[1] # 4 classes
4

LinearSVC, on the other hand, implements a "one-vs-rest" multi-class strategy and trains the n_class model. If there are only two classes, only one model will be trained.

>>>
>>> lin_clf = svm.LinearSVC()
>>> lin_clf.fit(X, Y) 
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
>>> dec = lin_clf.decision_function([[1]])
>>> dec.shape[1]
4

See Formulas (# svm-mathematical-formulation) for a complete description of the decision function. LinearSVC also implements an alternative multiclass strategy, the so-called multiclass SVM, developed by Crammer and Singer using the multi_class ='crammer_singer' option. This method is consistent. This does not apply to the "one-vs-rest" classification. In practice, the results are pretty much the same, but the execution time is significantly shorter, so the "one-vs-rest" classification is usually preferred. For "one-vs-rest" LinearSVC, the attributes of coef_ and ʻintercept_ are of the form [n_class, n_features] and [n_class] , respectively. Each row of coefficients corresponds to one of many "one-vs-rest" classifiers, n_class, and is similar to intercept in the order of the" one "class. For the "one-vs-one" SVC, the layout of the attributes is a bit more complicated. For linear kernels, the layout of coef_ and ʻintercept_ is as described in LinearSVC above, except that the shape of coef_ is [n_class * (n_class -1) / 2, n_features]. Is the same as. Many binary classifiers. The order from class 0 to class n is "0 to 1", "0 to 2", ... "0 to n", "1 to 2", "1 to 3", "1 to n". .. .. "n-1 vs. n". The shape of dual_coef_ is[n_class-1, n_SV], which is a rather difficult layout. The column corresponds to a support vector contained in one of the n_class * (n_class --1) / 2 "one-vs-one" classifiers. Each support vector is used in the n_class --1 classifier. The n_class -1 entry on each row corresponds to the double coefficients of these classifiers. This may be clearer in the example: Class 0 with three support vectors $ v ^ {0} \ _ 0, v ^ {1} \ _ 0, v ^ {2} \ _ 0 $, and two support vectors $ v ^ {0} \ _1, respectively Consider three class problems, class 1 and class 2 with v ^ {1} \ _1 $, $ v ^ {0} \ _2, v ^ {1} \ _2 $. Each support vector $ v ^ {j} \ _i $ has two double coefficients. Let's call the coefficient $ v ^ {j} \ _ i $ of the support vector in the classifier between the classes $ i $ and $ k $ $ \ alpha ^ {j} \ _ {i, k} $. Then dual_coef_ becomes:

$\alpha^{0}_{0,1}$	$\alpha^{0}_{0,2}$	Class 0 support vector coefficients
$\alpha^{1}_{0,1}$	$\alpha^{1}_{0,2}$
$\alpha^{2}_{0,1} $	$\alpha^{2}_{0,2}$
$\alpha^{0}_{1,0}$	$\alpha^{0}_{1,2}$	Class 1 support vector coefficients
$\alpha^{1}_{1,0}$	$\alpha^{1}_{1,2}$	Class 1 support vector coefficients
$\alpha^{0}_{2,0}$	$\alpha^{0}_{2,1}$	Class 2 support vector coefficients
$\alpha^{1}_{2,0}$	$\alpha^{1}_{2,1}$	Class 2 support vector coefficients

1.4.1.2. Score and probability

The SVC method decision_function gives a per-class score for each sample (or one score per sample in the binary case). When the constructor option probability is set to True, the class membership probability estimation (from the methods of predict_proba and predict_log_proba) is valid. For binaries, the probabilities are calibrated using Platt scaling: logistic regression on SVM scores, fit by additional cross-validation on training data. For multiclass, this is extended as in Wu et al. (2004). Needless to say, the mutual validation involved in Platt scaling is an expensive operation for large datasets. In addition, the probability estimates may be inconsistent with the score in the sense that the "argmax" of the score is not the probability argmax. (For example, in binary classification, a sample can be labeled by predict if it belongs to a class with a probability of <½ according to predict_proba). Platt's method is also known to have theoretical problems. If you need confidence scores, but these don't have to be probabilities, it's a good idea to set probability = False and use decision_function instead of predict_proba.

--Reference: Wu, Lin and Weng, "Probability estimation of multiclass classification by pairwise combination", JMLR 5: 975 -1005, 2004.

1.4.1.3. Imbalance problem

You can use the keywords class_weight and sample_weight in issues where you want more emphasis on a particular class or a particular individual sample. SVC (not NuSVC) implements the keyword class_weight in the fit method. This is a dictionary of the form {class_label: value}, where value is a floating point number> 0 that sets the parameter C of class_label to C * value.

SVC, NuSVC, SVR, NuSVR, and OneClassSVM also implement weights for individual samples of method fit using the keyword sample_weight. Like class_weight, they set the parameter C in the i-th example to C * sample_weight [i].

--Example: -Plot different SVM classifiers in iris dataset -SVM: Maximum margin to separate hyperplanes -[SVM: Hyperplane Separation for Unbalanced Classes](http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html#sphx-glr-auto-examples-svm-plot-separating -hyperplane-unbalanced-py) -[SVM-Anova: SVM with univariate feature selection](http://scikit-learn.org/stable/auto_examples/svm/plot_svm_anova.html#sphx-glr-auto-examples-svm-plot-svm- anova-py) -Nonlinear SVM -SVM: Weighted Samples

1.4.2. Regression

The Support Vector Classification method can be extended to solve regression problems. This method is called support vector regression. The model generated by the support vector classification (as described above) depends only on a subset of the training data, as the cost function for building the model does not care about training points beyond the margin. Similarly, the model generated by support vector regression depends only on a subset of the training data because the cost function that builds the model ignores training data that is close to the model prediction. There are three types of support vector regression implementations: SVR, NuSVR, and LinearSVR. LinearSVR offers a faster implementation than SVR, while NuSVR implements a slightly different format than SVR and LinearSVR, but only considers the linear kernel. For more information, see Implementation Details (# svm-implementation-details). Like the classification class, the fit method is taken as the argument vectors X, y. In this case, y is expected to have a floating point value instead of an integer value.

>>>
>>> from sklearn import svm
>>> X = [[0, 0], [2, 2]]
>>> y = [0.5, 2.5]
>>> clf = svm.SVR()
>>> clf.fit(X, y) 
SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)
>>> clf.predict([[1, 1]])
array([ 1.5])

--Example: -[Support Vector Regression with Linear and Nonlinear Kernels (SVR)](http://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html#sphx-glr-auto-examples-svm-plot- svm-regression-py)

1.4.3. Density estimation, novelty detection

One class of SVM is used for novelty detection. That is, given a set of samples, it detects the soft boundaries of the set so that it classifies the new set as belonging to that set. The class that implements this is OneClassSVM. In this case, because it is a type of unsupervised learning, the fit method has no class label and takes an array X as input. For more information on this usage, see the section "Novelty and Outlier Detection" (http://qiita.com/nazoking@github/items/e3c4d203abafb8accbab#271-%E6%96%B0%E8%A6%8F% E6% 80% A7% E3% 81% AE% E6% A4% 9C% E5% 87% BA) ".

--Example: -1 class SVM with non-linear kernel (RBF) -Species Distribution Model

1.4.4. Complex

Support vector machines are a powerful tool, but as the number of training vectors grows, their computing and storage requirements grow rapidly. The core of the SVM is a quadratic programming problem (QP) that separates the support vector from the rest of the training vector. The QP solver used in this libsvm based implementation depends on the efficiency of the libsvm cache, $ O (n \ _ {features} \ times n \ _ {samples} ^ 2) $ and $ O (n \ _ {features} \ times n \ _ {samples} ^ 3) $ Actually used (depending on the dataset). If the data is very sparse, replace $ n_ {features} $ with the average number of nonzero features in the sample vector. Also, for linear, the algorithm used by the LinearSVC liblinear implementation is much more efficient than the libsvm-based SVC compiler. It can scale almost linearly to millions of samples and features.

1.4.5. Practical tips

-** Avoid data copy : For SVC, SVR, NuSVC, NuSVR, if the data passed to a particular method is not in continuous C order but in double precision, it will be copied before calling the original C implementation. I will. You can check if a given array is C continuous by looking at its flags attribute. -LinearSVC (and LogisticRegression In the case of /stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression)), the input passed as a numpy array is copied and liblinear's internal sparse data representation (double precision of non-zero components). It is converted to a floating point number and an int32 index). If you want to fit a large linear classifier without copying a dense C continuous double precision array as input, instead SGDClassifier It is recommended to use the sklearn.linear_model.SGDClassifier.html # sklearn.linear_model.SGDClassifier) class. The objective function can be configured much like the LinearSVC model. - Kernel cache size : For SVC, SVR, nuSVC, and NuSVR, the size of the kernel cache has a significant impact on the execution time of larger issues. If you have enough RAM, we recommend that you set cache_size to a value greater than the default of 200 (MB), for example 500 (MB) or 1000 (MB). - C settings **: C is set to 1 by default. This is the default choice. If you have a lot of noisy observations, you need to reduce them. It corresponds to normalizing more estimates. --Support vector machine algorithms are not scale invariant, so it is highly recommended to ** scale the data **. For example, scale each attribute of the input vector X to [0,1] or [-1, +1], or standardize it to mean 0 and variance 1. The same scaling must be applied to the test vector. Get meaningful results. For more information on scaling and normalization, see Preprocessing. --The NuSVC / OneClassSVM / NuSVR parameter nu approximates the training error rate and the support vector. --In SVC, if the data for classification is unbalanced (eg, many positive and negative numbers), set class_weight ='balanced' and / or try a different penalty parameter C. --The underlying LinearSVC implementation uses a random number generator to select features when fitting the model. Therefore, it is not uncommon to get slightly different results for the same input data. If that happens, try with smaller tol parameters. --Using the L1 penalty provided by LinearSVC (loss ='l2', penalty ='l1', dual = False) gives a sparse solution. That is, only a subset of feature weights differ from zero and contribute to the determinant function. Increasing C will generate a more complex model (more features will be selected). The C value that produces the" null "model (all weights are equal to zero) is [l1_min_c](http://scikit-learn.org/stable/modules/generated/sklearn.svm.l1_min_c.html# It can be calculated using sklearn.svm.l1_min_c).

1.4.6. Kernel functions

The kernel function can be one of the following:

--Linear: $ \ langle x, x'\ rangle $. --Polynomial: $ (\ gamma \ langle x, x'\ rangle + r) ^ d $. $ D $ specifies the keyword degree and $ r $ with coef0.

rbf： \exp(-\gamma |x-x'|^2). $\gamma Is a keyword`gamma`Specified by and must be greater than 0. --sigmoid ( \ tanh (\ gamma \ langle x, x'\ rangle + r) $). Where $ r $ is specified by coef0.

Different kernels are specified at initialization by the keyword kernel:

>>> linear_svc = svm.SVC(kernel='linear')
>>> linear_svc.kernel
'linear'
>>> rbf_svc = svm.SVC(kernel='rbf')
>>> rbf_svc.kernel
'rbf'

1.4.6.1. Custom kernel

You can define your own kernel by giving the kernel a Python function or by pre-computing the Gram matrix. A classifier with a custom kernel behaves like any other classifier except for the following:

--The field support_vectors_ is empty. Only the index of the support vector is stored in support_ --A reference (and not a copy) of the first argument of the fit () method is saved for later reference. If that array is modified between the use of fit () and predict (), you will get unexpected results.

1.4.6.1.1. Use Python functions as kernel

You can also use your own predefined kernel by passing a function to the constructor keyword kernel. The kernel must take two matrix of shapes (n_samples_1, n_features) and (n_samples_2, n_features) as arguments and return a kernel matrix of shapes(n_samples_1, n_samples_2). The following code defines a linear kernel and creates a classifier instance that uses that kernel.

>>> import numpy as np
>>> from sklearn import svm
>>> def my_kernel(X, Y):
...     return np.dot(X, Y.T)
...
>>> clf = svm.SVC(kernel=my_kernel)

--Example: --SVM with custom kernel.

1.4.6.1.2. Use of gram matrix

Set kernel ='precomputed' and pass the Gram matrix instead of X in the fit method. At this time, kernel values between all training and test vectors must be provided.

>>> import numpy as np
>>> from sklearn import svm
>>> X = np.array([[0, 0], [1, 1]])
>>> y = [0, 1]
>>> clf = svm.SVC(kernel='precomputed')
>>> # linear kernel computation
>>> gram = np.dot(X, X.T)
>>> clf.fit(gram, y) 
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape=None, degree=3, gamma='auto',
    kernel='precomputed', max_iter=-1, probability=False,
    random_state=None, shrinking=True, tol=0.001, verbose=False)
>>> # predict on training examples
>>> clf.predict(gram)
array([0, 1])

1.4.6.1.3. RBF kernel parameters

When training SVMs with the Radial Basis Function (RBF) kernel, there are two parameters to consider: C and gamma. The parameter C, common to all SVM kernels, eliminates misclassification of training examples for the simplicity of the decision table. A low C aims to smooth the decision table and a high C aims to correctly classify all training examples. gamma defines how much impact a training example will have. The larger the gamma, the more affected the other examples. Proper selection of C and gamma is important for SVM performance. One uses sklearn.model_selection.GridSearchCV with C and gamma It is recommended to use exponentially far away.

--Example: -RBF SVM Parameters

1.4.7. Mathematical prescription

Support vector machines build hyperplanes or sets of hyperplanes in higher or infinite dimensional space used for classification, regression or other tasks. Intuitively, in general, the larger the margin, the smaller the generalization error of the classifier, so good separation is achieved by the hyperplane with the largest distance to the closest training data point (so-called function margin) of which class. Achieved.

1.4.7.1. SVC

Given the two classes of training vectors $ x_i \ in \ mathbb {R} ^ p $, $ i = 1 ... n $ and $ y \ in \ mathbb {R} ^ n $ $ \ varepsilon $ vector y The SVC was a problem:

\min_ {w, b, \zeta} \frac{1}{2} w^T w + C \sum_{i=1}^{n} \zeta_i

\begin{align}
\textrm {subject to } & y_i (w^T \phi (x_i) + b) \geq 1 - \zeta_i,\\
 & \zeta_i \geq 0, i=1, ..., n
\end{align}

That double

\min_{\alpha} \frac{1}{2} \alpha^T Q \alpha - e^T \alpha

\begin{align}
\textrm {subject to } & y^T \alpha = 0\\
& 0 \leq \alpha_i \leq C, i=1, ..., n
\end{align}

Where $ e $ is a vector of all 1s, $ C> 0 $ is the upper bound, $ Q $ is an n × n positive semi-definite matrix, $ Q_ {ij} \ equiv y_i y_j K (x_i, x_j) $. Where $ K (x_i, x_j) = \ phi (x_i) ^ T \ phi (x_j) $ is the kernel. Here, the training vector is implicitly mapped to a higher dimensional (possibly infinite dimensional) space by the function $ \ phi $. The decision function is:

\operatorname{sgn}(\sum_{i=1}^n y_i \alpha_i K(x_i, x) + \rho)

** Note **: SVM models derived from libsvm and liblinear use C as a regularization parameter, but most other estimators use alpha. The relationship between the two is $ C = \ frac {n \ samples} {alpha}. $. This parameter is accessible through the member dual_coef_, which holds the product $ y_i \ alpha_i $, support_vectors_, which holds the support vectors, and ʻintercept`, which holds the independent term $ \ rho $.

--Reference: -"Automatic capacity adjustment of very large VC classifier", I. Guyon, B. Boser, V. Vapnik-Advances in Neuro-Information Processing 1993. - "Support-vector networks"、C. Cortes、V. Vapnik - Machine Learning、20,273-297（1995）。

1.4.7.2. NuSVC

Introducing a new parameter $ \ nu $ that controls the number of support vectors and training errors. Parameters $ \ nu \ in (0, 1] $ is the upper limit of the training error rate and the lower limit of the support vector rate. It turns out that the $ \ nu $ -SVC formulation is a C-SVC reparameterization and is therefore mathematically equivalent.

1.4.7.3. SVR

Given training vector $ x_i \ mathbb {R} ^ p $, i = 1, ..., n and vector $ \ mathbb {R} ^ n \ varepsilon $ -SVR solves the following problems To do.

    \min_ {w, b, \zeta, \zeta^*} \frac{1}{2} w^T w + C \sum_{i=1}^{n} (\zeta_i + \zeta_i^*)

\begin{align}
    \textrm {subject to } & y_i - w^T \phi (x_i) - b \leq \varepsilon + \zeta_i,\\
                          & w^T \phi (x_i) + b - y_i \leq \varepsilon + \zeta_i^*,\\
                          & \zeta_i, \zeta_i^* \geq 0, i=1, ..., n

\end{align}

That double

   \min_{\alpha, \alpha^*} \frac{1}{2} (\alpha - \alpha^*)^T Q (\alpha - \alpha^*) + \varepsilon e^T (\alpha + \alpha^*) - y^T (\alpha - \alpha^*)

\begin{align}
   \textrm {subject to } & e^T (\alpha - \alpha^*) = 0\\
   & 0 \leq \alpha_i, \alpha_i^* \leq C, i=1, ..., n
\end{align}

Where $ e $ is the vector of all 1s, $ C> 0 $ is the upper bound, $ Q $ is the $ n $ by $ n $ positive semi-definite matrix, $ Q_ {ij} \ equiv K (x_i,, x_j) = \ phi (x_i) ^ T \ phi (x_j) $ is the kernel. Here, the training vector is implicitly mapped to a higher dimensional (possibly infinite dimensional) space by the function $ \ phi $. The decision function is:

 \sum_{i=1}^n (\alpha_i - \alpha_i^*) K(x_i, x) + \rho

These parameters are accessible through the member dual_coef_, which holds the difference $ \ alpha_i-\ alpha_i ^ * $, support_vectors_, which holds the support vectors, and ʻintercept_`, which holds the independent term $ \ ρ $.

--Reference: -"Tutorial on Support Vector Regression", Alex J. Smola, Bernhard Schölkopf --Statistics and Computing Archive # 14 Volume 3, August 2004, p. 199-222.

1.4.8. Implementation details

Internally, libsvm and liblinear To handle all calculations. Use ntu.edu.tw / ~ cjlin / liblinear /). These libraries are wrapped using C and Cython.

--Reference: For the implementation and details of the algorithm used, see: -LIBSVM: Library for Support Vector Machines. -LIBLINEAR-Library for large-scale linear classification.

[scikit-learn 0.18 User Guide 1. Supervised Learning](http://qiita.com/nazoking@github/items/267f2371757516f8c168#1-%E6%95%99%E5%B8%AB%E4%BB%98 From% E3% 81% 8D% E5% AD% A6% E7% BF% 92)

[PYTHON] [Translation] scikit-learn 0.18 User Guide 1.4. Support Vector Machine