[PYTHON] Train_test_split of features held by dict

Thing you want to do

The features held in the dictionary type as shown below

{'field1': array([0, 1, 2, 3, 4, 5]), 'field2': array([5, 4, 3, 2, 1, 0]), 'label': array([1, 0, 1, 0, 0, 0])}

I want to divide while preserving the dictionary type as shown below

{'field1': array([4, 0, 3]), 'field2': array([1, 5, 2]), 'label': array([0, 1, 0])}
{'field1': array([5, 1, 2]), 'field2': array([0, 4, 3]), 'label': array([0, 0, 1])}

manner

Since it is quite difficult to implement options such as extraction considering the class distribution by yourself, create a process that wraps model_selection.train_test_split of scikit-learn. It was possible by creating an array of random indexes, splitting it with the array of labels with train_test_split, and extracting from each array by specifying the index.

class Splitter:
    def __init__(self, train_size, label_col: str):
        self.train_size = train_size
        self.label_col = label_col
        self.train_indices = None
        self.test_indices = None

    def set_split_indices(self, field_to_values):
        total_length = len(field_to_values[self.label_col])
        split_indices = np.array(range(total_length))
        labels = field_to_values[self.label_col]
        self.train_indices, self.test_indices, _, _ = train_test_split(
            split_indices, labels, train_size=self.train_size,stratify=labels)

    def split(self, field_to_values):
        train = {field: values[self.train_indices] for field, values in field_to_values.items()}
        test = {field: values[self.test_indices] for field, values in field_to_values.items()}
        return train, test

Execution result

>>> field_to_values = {"field1": np.array([0, 1, 2, 3, 4, 5]), "field2": np.array([5, 4, 3, 2, 1, 0]), "label": np.array([1, 0, 1, 0, 0, 0])}
>>> splitter = Splitter(train_size=0.5, label_col="label")
>>> splitter.set_split_indices(field_to_values)
>>> splitter.split(field_to_values)
({'field1': array([4, 0, 3]), 'field2': array([1, 5, 2]), 'label': array([0, 1, 0])}, {'field1': array([5, 1, 2]), 'field2': array([0, 4, 3]), 'label': array([0, 0, 1])})

This time I wanted to divide multiple dictionaries by the same index, so I classified it, but if I write it as a function, it will be as follows

def train_test_split_dict(field_to_values, train_size, label_col: str)
    total_length = len(field_to_values[label_col])
    split_indices = np.array(range(total_length))
    labels = field_to_values[label_col]
    train_indices, test_indices, _, _ = train_test_split(
        split_indices, labels, train_size=train_size, stratify=labels)
    train = {field: values[train_indices] for field, values in field_to_values.items()}
    test = {field: values[test_indices] for field, values in field_to_values.items()}
    return train, test

Recommended Posts

Train_test_split of features held by dict
Features of Go language
Main features of ChainMap
Sort by dict type value value
Features of programming languages [Memo]
Visualization of data by prefecture
Calculation of similarity by MinHash
About the features of Python