Modelle für maschinelles Lernen neigen dazu, Trainingsdaten zu übertrainieren. Daher ist es üblich, die vorliegenden Daten in Trainingsdaten und Testdaten für die Leistungsbewertung (Validierung) zu unterteilen. Die Erklärung zur Validierung von Scicit-Learn ist über verschiedene Methoden dieser Unterteilung leicht zu verstehen. https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation
Bei der Klassifizierung möchte ich so teilen, dass die Verteilung der korrekten Beschriftungen gleich ist. Daher denke ich, dass häufig die geschichtete K-Falte des Scikit-Lernens verwendet wird.
python
import numpy as np
from sklearn.model_selection import StratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
skf = StratifiedKFold(n_splits=2)
for train_index, test_index in skf.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Ergebnis
TRAIN: [1 3] TEST: [0 2]
TRAIN: [0 2] TEST: [1 3]
Dies ist ausreichend, wenn Sie nur ein Etikett haben, jedoch nicht mehrere Etiketten unterstützen.
python
import numpy as np
from sklearn.model_selection import StratifiedKFold
X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])
skf = StratifiedKFold(n_splits=2)
for train_index, test_index in skf.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Ergebnis
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'multilabel-indicator' instead.
iterstrat unterstützt mehrere Labels https://github.com/trent-b/iterative-stratification
terminal
pip install iterative-stratification
python
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
import numpy as np
X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])
mskf = MultilabelStratifiedKFold(n_splits=2, shuffle=True, random_state=0)
for train_index, test_index in mskf.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Ergebnis
TRAIN: [0 3 4 6] TEST: [1 2 5 7]
TRAIN: [1 2 5 7] TEST: [0 3 4 6]
python
from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
import numpy as np
X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])
msss = MultilabelStratifiedShuffleSplit(n_splits=3, test_size=0.5, random_state=0)
for train_index, test_index in msss.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Ergebnis
TRAIN: [1 2 5 7] TEST: [0 3 4 6]
TRAIN: [2 3 6 7] TEST: [0 1 4 5]
TRAIN: [1 2 5 6] TEST: [0 3 4 7]
python
from iterstrat.ml_stratifiers import RepeatedMultilabelStratifiedKFold
import numpy as np
X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])
rmskf = RepeatedMultilabelStratifiedKFold(n_splits=2, n_repeats=2, random_state=0)
for train_index, test_index in rmskf.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
Ergebnis
TRAIN: [0 3 4 6] TEST: [1 2 5 7]
TRAIN: [1 2 5 7] TEST: [0 3 4 6]
TRAIN: [0 1 4 5] TEST: [2 3 6 7]
TRAIN: [2 3 6 7] TEST: [0 1 4 5]