Feature Selection Datasets
Feature Selection Datasets ist ein Datensatz, der anscheinend für das Studium von Methoden des maschinellen Lernens und des Benchmarking gesammelt wurde.
http://featureselection.asu.edu/datasets.php
Da es so viele Daten gibt, wollte ich den Inhalt auflisten und die richtigen Daten finden, also habe ich sie leicht analysiert.
Neben dem Abrufen der Daten und dem Betrachten der Datenstruktur habe ich auch den RandomForestClassifier von scikit-learn verwendet, um die Schwierigkeit des Klassifizierungsproblems zu untersuchen.
import os
import timeit
from scipy import io
import pandas as pd
import urllib.request
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
Klicken Sie hier für eine Liste der erfassten Daten. Ich habe ein paar falsche URLs behoben.
dataset_url = [
"http://featureselection.asu.edu/files/datasets/BASEHOCK.mat",
"http://featureselection.asu.edu/files/datasets/PCMAC.mat",
"http://featureselection.asu.edu/files/datasets/RELATHE.mat",
"http://featureselection.asu.edu/files/datasets/COIL20.mat",
"http://featureselection.asu.edu/files/datasets/ORL.mat",
"http://featureselection.asu.edu/files/datasets/orlraws10P.mat",
"http://featureselection.asu.edu/files/datasets/pixraw10P.mat",
"http://featureselection.asu.edu/files/datasets/warpAR10P.mat",
"http://featureselection.asu.edu/files/datasets/warpPIE10P.mat",
"http://featureselection.asu.edu/files/datasets/Yale.mat",
"http://featureselection.asu.edu/files/datasets/USPS.mat",
"http://featureselection.asu.edu/files/datasets/ALLAML.mat",
"http://featureselection.asu.edu/files/datasets/Carcinom.mat",
"http://featureselection.asu.edu/files/datasets/CLL_SUB_111.mat",
"http://featureselection.asu.edu/files/datasets/colon.mat",
"http://featureselection.asu.edu/files/datasets/GLI_85.mat",
"http://featureselection.asu.edu/files/datasets/GLIOMA.mat",
"http://featureselection.asu.edu/files/datasets/leukemia.mat",
"http://featureselection.asu.edu/files/datasets/lung.mat",
"http://featureselection.asu.edu/files/datasets/lung_discrete.mat",
"http://featureselection.asu.edu/files/datasets/lymphoma.mat",
"http://featureselection.asu.edu/files/datasets/nci9.mat",
"http://featureselection.asu.edu/files/datasets/Prostate_GE.mat",
"http://featureselection.asu.edu/files/datasets/SMK_CAN_187.mat",
"http://featureselection.asu.edu/files/datasets/TOX_171.mat",
"http://featureselection.asu.edu/files/datasets/arcene.mat",
"http://featureselection.asu.edu/files/datasets/gisette.mat",
"http://featureselection.asu.edu/files/datasets/Isolet.mat",
"http://featureselection.asu.edu/files/datasets/madelon.mat"
]
result = {
'dataset':[],
'byte':[],
'X.shape':[],
'X_type':[],
'y.shape':[],
'n_class':[],
'RF_max':[],
'RF_mean':[],
'RF_min':[],
'sec':[],
}
for url in dataset_url:
result['dataset'].append(url.split("/")[-1])
filename = 'dataset.mat'
urllib.request.urlretrieve(url, filename)
result['byte'].append(os.path.getsize(filename))
matdata = io.loadmat(filename, squeeze_me=True)
X = matdata['X']
y = matdata['Y'].flatten()
result['X.shape'].append(X.shape)
result['X_type'].append(pd.DataFrame(X).nunique()[0])
result['y.shape'].append(y.shape)
result['n_class'].append(pd.DataFrame(y).nunique()[0])
scores = []
times = []
for _ in range(10):
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = RandomForestClassifier()
times.append(timeit.timeit(lambda: model.fit(X_train, y_train), number=1))
scores.append(model.score(X_test,y_test))
result['RF_max'].append(max(scores))
result['RF_mean'].append(sum(scores) / len(scores))
result['RF_min'].append(min(scores))
result['sec'].append(sum(times) / len(times))
pd.DataFrame(result).sort_values("RF_max")
dataset | byte | X.shape | X_type | y.shape | n_class | RF_max | RF_mean | RF_min | sec | |
---|---|---|---|---|---|---|---|---|---|---|
21 | nci9.mat | 169288 | (60, 9712) | 3 | (60,) | 9 | 0.666667 | 0.433333 | 0.266667 | 0.183649 |
23 | SMK_CAN_187.mat | 11861244 | (187, 19993) | 171 | (187,) | 2 | 0.723404 | 0.655319 | 0.574468 | 0.670948 |
28 | madelon.mat | 1496573 | (2600, 500) | 40 | (2600,) | 2 | 0.733846 | 0.707385 | 0.680000 | 2.456003 |
13 | CLL_SUB_111.mat | 5875157 | (111, 11340) | 111 | (111,) | 3 | 0.750000 | 0.657143 | 0.464286 | 0.326307 |
24 | TOX_171.mat | 3470586 | (171, 5748) | 169 | (171,) | 4 | 0.813953 | 0.772093 | 0.697674 | 0.405085 |
16 | GLIOMA.mat | 1462087 | (50, 4434) | 50 | (50,) | 4 | 0.846154 | 0.669231 | 0.538462 | 0.154852 |
9 | Yale.mat | 161021 | (165, 1024) | 77 | (165,) | 15 | 0.857143 | 0.769048 | 0.595238 | 0.306511 |
25 | arcene.mat | 1900005 | (200, 10000) | 82 | (200,) | 2 | 0.900000 | 0.788000 | 0.680000 | 0.417719 |
20 | lymphoma.mat | 110185 | (96, 4026) | 3 | (96,) | 9 | 0.916667 | 0.829167 | 0.708333 | 0.169875 |
2 | RELATHE.mat | 226918 | (1427, 4322) | 2 | (1427,) | 2 | 0.921569 | 0.898880 | 0.876751 | 1.218853 |
14 | colon.mat | 36319 | (62, 2000) | 3 | (62,) | 2 | 0.937500 | 0.768750 | 0.687500 | 0.135427 |
7 | warpAR10P.mat | 279711 | (130, 2400) | 63 | (130,) | 10 | 0.939394 | 0.851515 | 0.757576 | 0.274956 |
1 | PCMAC.mat | 191131 | (1943, 3289) | 4 | (1943,) | 2 | 0.944444 | 0.922634 | 0.899177 | 1.491283 |
4 | ORL.mat | 376584 | (400, 1024) | 151 | (400,) | 40 | 0.950000 | 0.921000 | 0.830000 | 1.216780 |
15 | GLI_85.mat | 8743262 | (85, 22283) | 85 | (85,) | 2 | 0.954545 | 0.863636 | 0.772727 | 0.269521 |
27 | Isolet.mat | 3652673 | (1560, 617) | 1340 | (1560,) | 26 | 0.956410 | 0.938205 | 0.905128 | 2.222803 |
18 | lung.mat | 4762671 | (203, 3312) | 203 | (203,) | 5 | 0.960784 | 0.929412 | 0.882353 | 0.380843 |
22 | Prostate_GE.mat | 1524983 | (102, 5966) | 29 | (102,) | 2 | 0.961538 | 0.900000 | 0.807692 | 0.207986 |
10 | USPS.mat | 15138167 | (9298, 256) | 1617 | (9298,) | 10 | 0.965161 | 0.960258 | 0.955699 | 9.295629 |
26 | gisette.mat | 10619742 | (7000, 5000) | 345 | (7000,) | 2 | 0.974286 | 0.968971 | 0.961714 | 9.597926 |
12 | Carcinom.mat | 6917199 | (174, 9182) | 156 | (174,) | 11 | 0.977273 | 0.868182 | 0.772727 | 0.557979 |
0 | BASEHOCK.mat | 279059 | (1993, 4862) | 2 | (1993,) | 2 | 0.985972 | 0.974349 | 0.965932 | 1.789281 |
3 | COIL20.mat | 3024549 | (1440, 1024) | 10 | (1440,) | 20 | 1.000000 | 0.998889 | 0.994444 | 1.873450 |
11 | ALLAML.mat | 3639219 | (72, 7129) | 66 | (72,) | 2 | 1.000000 | 0.938889 | 0.833333 | 0.183536 |
6 | pixraw10P.mat | 520463 | (100, 10000) | 11 | (100,) | 10 | 1.000000 | 0.972000 | 0.920000 | 0.338596 |
17 | leukemia.mat | 154743 | (72, 7070) | 3 | (72,) | 2 | 1.000000 | 0.950000 | 0.777778 | 0.155346 |
8 | warpPIE10P.mat | 458267 | (210, 2420) | 36 | (210,) | 10 | 1.000000 | 0.962264 | 0.924528 | 0.410544 |
5 | orlraws10P.mat | 951783 | (100, 10304) | 46 | (100,) | 10 | 1.000000 | 0.988000 | 0.960000 | 0.415471 |
19 | lung_discrete.mat | 7516 | (73, 325) | 3 | (73,) | 7 | 1.000000 | 0.800000 | 0.526316 | 0.131734 |
Ich dachte, es wäre langweilig, ein zu einfaches Problem zu lösen, also ordnete ich sie in absteigender Reihenfolge von RF_max an.
Ich hoffe, es wird Ihnen bei der Auswahl eines Datensatzes helfen.
Recommended Posts