[Python] What is Pipeline ...

Hello.

Suddenly, I was interested in machine learning and deep learning, so I recently participated in a kaggle competition. Kaggle has a Notebook feature, so I was enthusiastic to understand the code!

"I don't know what this means at all"

I had no programming knowledge at all, so when I looked at the code in kaggle's notebook, it looked like a cipher (laughs). Therefore, I thought I would slowly understand each one, so I would like to write it here as if it were a diary.

This time, it is about "Pipeline".

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

iris_data = datasets.load_iris()
input_data = iris_data.data
correct = iris_data.target

For the time being, I accessed the following site. sklearn.pipeline.Pipeline — scikit-learn 0.23.2 documentation

According to this, the basic shape is

from sklearn.pipeline import Pipeline pipe = Pipeline ([(pretreatment method), (learning method)]) pipe.fit (explanatory variable, objective variable)

It seems that the code can be simplified.

Based on this, I tried to train iris data in a random forest.

from sklearn.ensemble import RandomForestClassifier as RFC 

X_train, X_test, y_train, y_test = train_test_split(input_data, correct)
pipe = Pipeline([('scaler', StandardScaler()), 
                 ('RandomForestClassifier', RFC())])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

# 0.9473684210526315

From the above, we have standardized the explanatory variables and are training in a random forest. By putting them together in Pipeline in this way, the code becomes "concise".

Below is the code for confirmation.

X_train, X_test, y_train, y_test = train_test_split(input_data, correct)
tr_x, te_x, tr_y, te_y = X_train.copy(), X_test.copy(), y_train.copy(), y_test.copy() #Copy for check

pipe = Pipeline([('scaler', StandardScaler()), 
                 ('Classifier', RFC())])
pipe.fit(X_train, y_train)
print("pipe score = " + str(pipe.score(X_test, y_test)))


from sklearn.preprocessing import StandardScaler

stdsc = StandardScaler()
tr_x = stdsc.fit(tr_x).transform(tr_x)
te_x = stdsc.fit(te_x).transform(te_x)

clf = RFC()
clf.fit(tr_x, tr_y)
print("RFC score = ", clf.score(te_x, te_y))

# pipe score = 0.9473684210526315
# RFC score =  0.9473684210526315

I was able to match the calculation, so I knew that Pipeline's preprocessing worked correctly.

I see, I somehow learned about Pipeline. But even if there are many pre-processes, can only one be executed?

Apparently, it seems that multiple processes can be combined.

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier as RFC 
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

preprocessing = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),  #Missing value removal process
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])                    #One hot encoding


rf = Pipeline([
    ('preprocess', preprocessing),
    ('classifier', RFC())])

rf.fit(X_train, y_train)

in this way, Basic form pipe = Pipeline ([(pretreatment method), (learning method)]) As for the (pre-processing method), it seems that one method is to overlap Pipelines as an image like BNF notation (it is just an image story).

Recommended Posts

[Python] What is Pipeline ...
What is python
What is Python
[Python] What is virtualenv
[Python] Python and security-① What is Python?
[Python] * args ** What is kwrgs?
What is a python map?
Python Basic Course (1 What is Python)
What is Python? What is it used for?
[Python] What is a zip function?
[Python] What is a with statement?
[Python] What is @? (About the decorator)
[python] What is the sorted key?
Python for statement ~ What is iterable ~
What is the python underscore (_) for?
Python> What is an extended slice?
What is namespace
What is copy.copy ()
Python is easy
What is Django? .. ..
What is dotenv?
What is POSIX?
What is Linux
What is klass?
What is SALOME?
What is Linux?
What is hyperopt?
Python is instance
What is Linux
What is pyvenv
What is __call__
What is Linux
[Python] What is pandas Series and DataFrame?
[Python] What is inherited by multiple inheritance?
What is NaN? NaN Zoya (Python) (394 days late)
What kind of programming language is Python?
Python learning basics ~ What is type conversion? ~
What is "mahjong" in the Python library? ??
What is a dog? Python installation volume
[What is an algorithm? Introduction to Search Algorithm] ~ Python ~
python int is infinite
What is a distribution?
What is Piotroski's F-Score?
What is "functional programming" and "object-oriented" in Python?
What is Raspberry Pi?
What is Calmar Ratio?
What is a terminal?
What is wheezy in the Docker Python image?
[PyTorch Tutorial ①] What is PyTorch?
I tried Python! ] I graduated today from "What is Python! Python!"!
What is hyperparameter tuning?
What is a hacker?
What is JSON? .. [Note]
What is Linux for?
What is a pointer?
What are you comparing with Python is and ==?
What is ensemble learning?
What is TCP / IP?
[Introduction to Udemy Python 3 + Application] 54. What is Docstrings?
What is Python's __init__.py?
What is an iterator?