[PYTHON] Feature-Auswahldatensätze

Feature Selection Datasets

Feature Selection Datasets ist ein Datensatz, der anscheinend für das Studium von Methoden des maschinellen Lernens und des Benchmarking gesammelt wurde.

http://featureselection.asu.edu/datasets.php

Da es so viele Daten gibt, wollte ich den Inhalt auflisten und die richtigen Daten finden, also habe ich sie leicht analysiert.

Neben dem Abrufen der Daten und dem Betrachten der Datenstruktur habe ich auch den RandomForestClassifier von scikit-learn verwendet, um die Schwierigkeit des Klassifizierungsproblems zu untersuchen.

Code

import os
import timeit
from scipy import io
import pandas as pd
import urllib.request
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

Klicken Sie hier für eine Liste der erfassten Daten. Ich habe ein paar falsche URLs behoben.

dataset_url = [
        "http://featureselection.asu.edu/files/datasets/BASEHOCK.mat",
        "http://featureselection.asu.edu/files/datasets/PCMAC.mat",
        "http://featureselection.asu.edu/files/datasets/RELATHE.mat",
        "http://featureselection.asu.edu/files/datasets/COIL20.mat",
        "http://featureselection.asu.edu/files/datasets/ORL.mat",
        "http://featureselection.asu.edu/files/datasets/orlraws10P.mat",
        "http://featureselection.asu.edu/files/datasets/pixraw10P.mat",
        "http://featureselection.asu.edu/files/datasets/warpAR10P.mat",
        "http://featureselection.asu.edu/files/datasets/warpPIE10P.mat",
        "http://featureselection.asu.edu/files/datasets/Yale.mat",
        "http://featureselection.asu.edu/files/datasets/USPS.mat",
        "http://featureselection.asu.edu/files/datasets/ALLAML.mat",
        "http://featureselection.asu.edu/files/datasets/Carcinom.mat",
        "http://featureselection.asu.edu/files/datasets/CLL_SUB_111.mat",
        "http://featureselection.asu.edu/files/datasets/colon.mat",
        "http://featureselection.asu.edu/files/datasets/GLI_85.mat",
        "http://featureselection.asu.edu/files/datasets/GLIOMA.mat",
        "http://featureselection.asu.edu/files/datasets/leukemia.mat",
        "http://featureselection.asu.edu/files/datasets/lung.mat",
        "http://featureselection.asu.edu/files/datasets/lung_discrete.mat",
        "http://featureselection.asu.edu/files/datasets/lymphoma.mat",
        "http://featureselection.asu.edu/files/datasets/nci9.mat",
        "http://featureselection.asu.edu/files/datasets/Prostate_GE.mat",
        "http://featureselection.asu.edu/files/datasets/SMK_CAN_187.mat",
        "http://featureselection.asu.edu/files/datasets/TOX_171.mat",
        "http://featureselection.asu.edu/files/datasets/arcene.mat",
        "http://featureselection.asu.edu/files/datasets/gisette.mat",
        "http://featureselection.asu.edu/files/datasets/Isolet.mat",
        "http://featureselection.asu.edu/files/datasets/madelon.mat"
]
result = {
    'dataset':[],
    'byte':[],
    'X.shape':[],
    'X_type':[],
    'y.shape':[],
    'n_class':[],
    'RF_max':[],
    'RF_mean':[],
    'RF_min':[],
    'sec':[],
    }

for url in dataset_url:
    result['dataset'].append(url.split("/")[-1])

    filename = 'dataset.mat'
    urllib.request.urlretrieve(url, filename)
    result['byte'].append(os.path.getsize(filename))

    matdata = io.loadmat(filename, squeeze_me=True)
    X = matdata['X']
    y = matdata['Y'].flatten()
    result['X.shape'].append(X.shape)
    result['X_type'].append(pd.DataFrame(X).nunique()[0])

    result['y.shape'].append(y.shape)
    result['n_class'].append(pd.DataFrame(y).nunique()[0])

    scores = []
    times = []
    for _ in range(10):
        X_train, X_test, y_train, y_test = train_test_split(X, y)
        model = RandomForestClassifier()
        times.append(timeit.timeit(lambda: model.fit(X_train, y_train), number=1))
        scores.append(model.score(X_test,y_test))

    result['RF_max'].append(max(scores))
    result['RF_mean'].append(sum(scores) / len(scores))
    result['RF_min'].append(min(scores))

    result['sec'].append(sum(times) / len(times))

Ergebnis

pd.DataFrame(result).sort_values("RF_max")
dataset byte X.shape X_type y.shape n_class RF_max RF_mean RF_min sec
21 nci9.mat 169288 (60, 9712) 3 (60,) 9 0.666667 0.433333 0.266667 0.183649
23 SMK_CAN_187.mat 11861244 (187, 19993) 171 (187,) 2 0.723404 0.655319 0.574468 0.670948
28 madelon.mat 1496573 (2600, 500) 40 (2600,) 2 0.733846 0.707385 0.680000 2.456003
13 CLL_SUB_111.mat 5875157 (111, 11340) 111 (111,) 3 0.750000 0.657143 0.464286 0.326307
24 TOX_171.mat 3470586 (171, 5748) 169 (171,) 4 0.813953 0.772093 0.697674 0.405085
16 GLIOMA.mat 1462087 (50, 4434) 50 (50,) 4 0.846154 0.669231 0.538462 0.154852
9 Yale.mat 161021 (165, 1024) 77 (165,) 15 0.857143 0.769048 0.595238 0.306511
25 arcene.mat 1900005 (200, 10000) 82 (200,) 2 0.900000 0.788000 0.680000 0.417719
20 lymphoma.mat 110185 (96, 4026) 3 (96,) 9 0.916667 0.829167 0.708333 0.169875
2 RELATHE.mat 226918 (1427, 4322) 2 (1427,) 2 0.921569 0.898880 0.876751 1.218853
14 colon.mat 36319 (62, 2000) 3 (62,) 2 0.937500 0.768750 0.687500 0.135427
7 warpAR10P.mat 279711 (130, 2400) 63 (130,) 10 0.939394 0.851515 0.757576 0.274956
1 PCMAC.mat 191131 (1943, 3289) 4 (1943,) 2 0.944444 0.922634 0.899177 1.491283
4 ORL.mat 376584 (400, 1024) 151 (400,) 40 0.950000 0.921000 0.830000 1.216780
15 GLI_85.mat 8743262 (85, 22283) 85 (85,) 2 0.954545 0.863636 0.772727 0.269521
27 Isolet.mat 3652673 (1560, 617) 1340 (1560,) 26 0.956410 0.938205 0.905128 2.222803
18 lung.mat 4762671 (203, 3312) 203 (203,) 5 0.960784 0.929412 0.882353 0.380843
22 Prostate_GE.mat 1524983 (102, 5966) 29 (102,) 2 0.961538 0.900000 0.807692 0.207986
10 USPS.mat 15138167 (9298, 256) 1617 (9298,) 10 0.965161 0.960258 0.955699 9.295629
26 gisette.mat 10619742 (7000, 5000) 345 (7000,) 2 0.974286 0.968971 0.961714 9.597926
12 Carcinom.mat 6917199 (174, 9182) 156 (174,) 11 0.977273 0.868182 0.772727 0.557979
0 BASEHOCK.mat 279059 (1993, 4862) 2 (1993,) 2 0.985972 0.974349 0.965932 1.789281
3 COIL20.mat 3024549 (1440, 1024) 10 (1440,) 20 1.000000 0.998889 0.994444 1.873450
11 ALLAML.mat 3639219 (72, 7129) 66 (72,) 2 1.000000 0.938889 0.833333 0.183536
6 pixraw10P.mat 520463 (100, 10000) 11 (100,) 10 1.000000 0.972000 0.920000 0.338596
17 leukemia.mat 154743 (72, 7070) 3 (72,) 2 1.000000 0.950000 0.777778 0.155346
8 warpPIE10P.mat 458267 (210, 2420) 36 (210,) 10 1.000000 0.962264 0.924528 0.410544
5 orlraws10P.mat 951783 (100, 10304) 46 (100,) 10 1.000000 0.988000 0.960000 0.415471
19 lung_discrete.mat 7516 (73, 325) 3 (73,) 7 1.000000 0.800000 0.526316 0.131734

Ich dachte, es wäre langweilig, ein zu einfaches Problem zu lösen, also ordnete ich sie in absteigender Reihenfolge von RF_max an.

Ich hoffe, es wird Ihnen bei der Auswahl eines Datensatzes helfen.

Recommended Posts

Feature-Auswahldatensätze
Funktionsauswahl durch sklearn.feature_selection
Merkmalsauswahl durch genetischen Algorithmus
Funktionsauswahl durch Null-Wichtigkeiten
[Übersetzung] scikit-learn 0.18 Benutzerhandbuch 1.13 Funktionsauswahl
Predictive Power Score für die Funktionsauswahl
Unterstützung der Vektorregression und Merkmalsauswahl
Mehrstufige Auswahl
SelectionSort
Einheit 5 Feature Engineering für die Auswahl maschineller Lernfunktionen