[PYTHON] Über die Verarbeitungsgeschwindigkeit von SVM (SVC) von Scikit-Learn

2016.09.14 Hinzugefügt über die Variation der Verarbeitungszeit </ Font>

Ich habe die Verarbeitungsgeschwindigkeiten von SVC (rbf-Kernel und linearer Kernel) und LinearSVC von scikit-learn verglichen.

Die verwendeten Daten sind die Spam-Daten, die im R-Kernlab-Paket enthalten sind. Erklärende Variablen sind 4601 Stichproben, 57 Dimensionen, Die Etiketten sind Spam: 1813 Proben und Nicht-Spam: 2788 Proben.

Die Ergebnisse, wenn die Anzahl der Proben und die Anzahl der Dimensionen geändert werden, sind wie folgt.

Der lineare Kernel von SVC ist zu langsam. Ich möchte nur eine Rastersuche mit dem Kerneltyp durchführen. Es scheint besser, LinearSVC richtig zu verwenden.

Der Bestätigungscode ist unten. Der Parameter C wird zur Vereinfachung der Messung der Verarbeitungszeit zugewiesen. Für die Merkmalsauswahl (Dimensionsreduzierung) haben wir die Merkmalsbedeutung von Random Forest verwendet. Dies liegt daran, dass die Verarbeitungszeit länger wurde, als die Auswahl entsprechend getroffen wurde.

`test_svm.py`


# -*- coding: utf-8 -*-

import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
from sklearn import cross_validation
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier
from scipy.stats.mstats import mquantiles


def grid_search(X, y, estimator, params, cv, n_jobs=3):
    mdl = GridSearchCV(estimator, params, cv=cv, n_jobs=n_jobs)
    t1 = time.clock()
    mdl.fit(X, y)
    t2 = time.clock()
    return t2 - t1


if __name__=="__main__":
    data = pd.read_csv('spam.txt', header=0)
    y = data['type']
    del data['type']
    
    data, y = shuffle(data, y, random_state=0)
    data = StandardScaler().fit_transform(data)
    
    clf = RandomForestClassifier(n_estimators=100)
    clf.fit(data, y)

    ndim, elp_rbf, elp_lnr, elp_lsvm = [], [], [], []
    for thr in mquantiles(clf.feature_importances_, prob=np.linspace(1., 0., 5)):
        print thr,
        X = data[:,clf.feature_importances_ >= thr]
        ndim.append(X.shape[1])
        
        cv = cross_validation.StratifiedShuffleSplit(y, test_size=0.2, random_state=0)

        print 'rbf',
        elp_rbf.append(grid_search(X, y, SVC(random_state=0),
            [{'kernel': ['rbf'], 'C': [1, 10, 100]}], cv))

        print 'linear',
        elp_lnr.append(grid_search(X, y, SVC(random_state=0),
            [{'kernel': ['linear'], 'C': [1, 10, 100]}], cv))

        print 'lsvm'
        elp_lsvm.append(grid_search(X, y, LinearSVC(random_state=0),
            [{'C': [1, 10, 100]}], cv))

    plt.figure()
    plt.title('Elapsed time - # of dimensions')
    plt.ylabel('Elapsed time [sec]')
    plt.xlabel('# of dimensions')
    plt.grid()
    plt.plot(ndim, elp_rbf, 'o-', color='r',
             label='SVM(rbf)')
    plt.plot(ndim, elp_lnr, 'o-', color='g',
             label='SVM(linear)')
    plt.plot(ndim, elp_lsvm, 'o-', color='b',
             label='LinearSVM')
    plt.legend(loc='best')
    plt.savefig('dimensions.png', bbox_inches='tight')
    plt.close()


    nrow, elp_rbf, elp_lnr, elp_lsvm = [], [], [], []
    for r in np.linspace(0.1, 1., 5):
        print r,
        X = data[:(r*data.shape[0]),:]
        yy = y[:(r*data.shape[0])]
        nrow.append(X.shape[0])
        
        cv = cross_validation.StratifiedShuffleSplit(yy, test_size=0.2, random_state=0)

        print 'rbf',
        elp_rbf.append(grid_search(X, yy, SVC(random_state=0),
            [{'kernel': ['rbf'], 'C': [1, 10, 100]}], cv))

        print 'linear',
        elp_lnr.append(grid_search(X, yy, SVC(random_state=0),
            [{'kernel': ['linear'], 'C': [1, 10, 100]}], cv))

        print 'lsvm'
        elp_lsvm.append(grid_search(X, yy, LinearSVC(random_state=0),
            [{'C': [1, 10, 100]}], cv))

    plt.figure()
    plt.title('Elapsed time - # of samples')
    plt.ylabel('Elapsed time [sec]')
    plt.xlabel('# of samples')
    plt.grid()
    plt.plot(nrow, elp_rbf, 'o-', color='r',
             label='SVM(rbf)')
    plt.plot(nrow, elp_lnr, 'o-', color='g',
             label='SVM(linear)')
    plt.plot(nrow, elp_lsvm, 'o-', color='b',
             label='LinearSVM')
    plt.legend(loc='best')
    plt.savefig('samples.png', bbox_inches='tight')
    plt.close()

Nachtrag

Ich habe einen Kommentar zur Verarbeitungszeit von SVM (linear) erhalten und ihn überprüft. Mit Python2.7.12, scikit-learn0.17.1, Die folgende Abbildung zeigt die Variation der Verarbeitungszeit, wenn die Anzahl der Daten 1000 beträgt, die Anzahl der Merkmale 29 beträgt und 200 Versuche durchgeführt werden.

SVM (linear) ist verdächtig ...