[PYTHON] Feature Selection Datasets

Feature Selection Datasets

Feature Selection Datasets is a dataset that seems to have been collected for machine learning studies and method benchmarking.

http://featureselection.asu.edu/datasets.php

Since there is so much data, I wanted to list the contents and find the right data, so I analyzed it lightly.

In addition to retrieving the data and looking at the data structure, I also used scikit-learn's RandomForestClassifier to look at the difficulty of the classification problem.

code

import os
import timeit
from scipy import io
import pandas as pd
import urllib.request
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

Click here for a list of acquired data. I fixed two wrong URLs.

dataset_url = [
        "http://featureselection.asu.edu/files/datasets/BASEHOCK.mat",
        "http://featureselection.asu.edu/files/datasets/PCMAC.mat",
        "http://featureselection.asu.edu/files/datasets/RELATHE.mat",
        "http://featureselection.asu.edu/files/datasets/COIL20.mat",
        "http://featureselection.asu.edu/files/datasets/ORL.mat",
        "http://featureselection.asu.edu/files/datasets/orlraws10P.mat",
        "http://featureselection.asu.edu/files/datasets/pixraw10P.mat",
        "http://featureselection.asu.edu/files/datasets/warpAR10P.mat",
        "http://featureselection.asu.edu/files/datasets/warpPIE10P.mat",
        "http://featureselection.asu.edu/files/datasets/Yale.mat",
        "http://featureselection.asu.edu/files/datasets/USPS.mat",
        "http://featureselection.asu.edu/files/datasets/ALLAML.mat",
        "http://featureselection.asu.edu/files/datasets/Carcinom.mat",
        "http://featureselection.asu.edu/files/datasets/CLL_SUB_111.mat",
        "http://featureselection.asu.edu/files/datasets/colon.mat",
        "http://featureselection.asu.edu/files/datasets/GLI_85.mat",
        "http://featureselection.asu.edu/files/datasets/GLIOMA.mat",
        "http://featureselection.asu.edu/files/datasets/leukemia.mat",
        "http://featureselection.asu.edu/files/datasets/lung.mat",
        "http://featureselection.asu.edu/files/datasets/lung_discrete.mat",
        "http://featureselection.asu.edu/files/datasets/lymphoma.mat",
        "http://featureselection.asu.edu/files/datasets/nci9.mat",
        "http://featureselection.asu.edu/files/datasets/Prostate_GE.mat",
        "http://featureselection.asu.edu/files/datasets/SMK_CAN_187.mat",
        "http://featureselection.asu.edu/files/datasets/TOX_171.mat",
        "http://featureselection.asu.edu/files/datasets/arcene.mat",
        "http://featureselection.asu.edu/files/datasets/gisette.mat",
        "http://featureselection.asu.edu/files/datasets/Isolet.mat",
        "http://featureselection.asu.edu/files/datasets/madelon.mat"
]
result = {
    'dataset':[],
    'byte':[],
    'X.shape':[],
    'X_type':[],
    'y.shape':[],
    'n_class':[],
    'RF_max':[],
    'RF_mean':[],
    'RF_min':[],
    'sec':[],
    }

for url in dataset_url:
    result['dataset'].append(url.split("/")[-1])

    filename = 'dataset.mat'
    urllib.request.urlretrieve(url, filename)
    result['byte'].append(os.path.getsize(filename))

    matdata = io.loadmat(filename, squeeze_me=True)
    X = matdata['X']
    y = matdata['Y'].flatten()
    result['X.shape'].append(X.shape)
    result['X_type'].append(pd.DataFrame(X).nunique()[0])

    result['y.shape'].append(y.shape)
    result['n_class'].append(pd.DataFrame(y).nunique()[0])

    scores = []
    times = []
    for _ in range(10):
        X_train, X_test, y_train, y_test = train_test_split(X, y)
        model = RandomForestClassifier()
        times.append(timeit.timeit(lambda: model.fit(X_train, y_train), number=1))
        scores.append(model.score(X_test,y_test))

    result['RF_max'].append(max(scores))
    result['RF_mean'].append(sum(scores) / len(scores))
    result['RF_min'].append(min(scores))

    result['sec'].append(sum(times) / len(times))

result

pd.DataFrame(result).sort_values("RF_max")
dataset byte X.shape X_type y.shape n_class RF_max RF_mean RF_min sec
21 nci9.mat 169288 (60, 9712) 3 (60,) 9 0.666667 0.433333 0.266667 0.183649
23 SMK_CAN_187.mat 11861244 (187, 19993) 171 (187,) 2 0.723404 0.655319 0.574468 0.670948
28 madelon.mat 1496573 (2600, 500) 40 (2600,) 2 0.733846 0.707385 0.680000 2.456003
13 CLL_SUB_111.mat 5875157 (111, 11340) 111 (111,) 3 0.750000 0.657143 0.464286 0.326307
24 TOX_171.mat 3470586 (171, 5748) 169 (171,) 4 0.813953 0.772093 0.697674 0.405085
16 GLIOMA.mat 1462087 (50, 4434) 50 (50,) 4 0.846154 0.669231 0.538462 0.154852
9 Yale.mat 161021 (165, 1024) 77 (165,) 15 0.857143 0.769048 0.595238 0.306511
25 arcene.mat 1900005 (200, 10000) 82 (200,) 2 0.900000 0.788000 0.680000 0.417719
20 lymphoma.mat 110185 (96, 4026) 3 (96,) 9 0.916667 0.829167 0.708333 0.169875
2 RELATHE.mat 226918 (1427, 4322) 2 (1427,) 2 0.921569 0.898880 0.876751 1.218853
14 colon.mat 36319 (62, 2000) 3 (62,) 2 0.937500 0.768750 0.687500 0.135427
7 warpAR10P.mat 279711 (130, 2400) 63 (130,) 10 0.939394 0.851515 0.757576 0.274956
1 PCMAC.mat 191131 (1943, 3289) 4 (1943,) 2 0.944444 0.922634 0.899177 1.491283
4 ORL.mat 376584 (400, 1024) 151 (400,) 40 0.950000 0.921000 0.830000 1.216780
15 GLI_85.mat 8743262 (85, 22283) 85 (85,) 2 0.954545 0.863636 0.772727 0.269521
27 Isolet.mat 3652673 (1560, 617) 1340 (1560,) 26 0.956410 0.938205 0.905128 2.222803
18 lung.mat 4762671 (203, 3312) 203 (203,) 5 0.960784 0.929412 0.882353 0.380843
22 Prostate_GE.mat 1524983 (102, 5966) 29 (102,) 2 0.961538 0.900000 0.807692 0.207986
10 USPS.mat 15138167 (9298, 256) 1617 (9298,) 10 0.965161 0.960258 0.955699 9.295629
26 gisette.mat 10619742 (7000, 5000) 345 (7000,) 2 0.974286 0.968971 0.961714 9.597926
12 Carcinom.mat 6917199 (174, 9182) 156 (174,) 11 0.977273 0.868182 0.772727 0.557979
0 BASEHOCK.mat 279059 (1993, 4862) 2 (1993,) 2 0.985972 0.974349 0.965932 1.789281
3 COIL20.mat 3024549 (1440, 1024) 10 (1440,) 20 1.000000 0.998889 0.994444 1.873450
11 ALLAML.mat 3639219 (72, 7129) 66 (72,) 2 1.000000 0.938889 0.833333 0.183536
6 pixraw10P.mat 520463 (100, 10000) 11 (100,) 10 1.000000 0.972000 0.920000 0.338596
17 leukemia.mat 154743 (72, 7070) 3 (72,) 2 1.000000 0.950000 0.777778 0.155346
8 warpPIE10P.mat 458267 (210, 2420) 36 (210,) 10 1.000000 0.962264 0.924528 0.410544
5 orlraws10P.mat 951783 (100, 10304) 46 (100,) 10 1.000000 0.988000 0.960000 0.415471
19 lung_discrete.mat 7516 (73, 325) 3 (73,) 7 1.000000 0.800000 0.526316 0.131734

I thought it would be boring to solve a problem that was too easy, so I arranged them in descending order of RF_max.

I hope it will be helpful when choosing a dataset.

Recommended Posts

Feature Selection Datasets
Feature selection by sklearn.feature_selection
Feature selection by genetic algorithm
Feature selection by Null importances
Organized feature selection using sklearn
[Translation] scikit-learn 0.18 User Guide 1.13 Feature selection
Predictive Power Score for feature selection
Support vector regression and feature selection
Multi-stage selection
Selection Sort
5th Feature Engineering for Machine Learning-Feature Selection