With make_pipeline that appeared in Code of 1st place in Mercari competition I didn't really understand the Function Transformer.
Make_pipeline → Convert code such as [preprocessing + learning + estimation] into one estimator. Code reduction is possible.
Function Transformer → Convert any function to a transformer. Because the argument of Pipeline needs to be a transformer. The minimum requirement for any function is that fit and transform exist.
In the example below, SVC is executed after PCA () is performed. Preprocessing and classification can be executed in a series of operations.
Reference site for the example below
qiita.rb
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn import datasets
#Preparation of sample data
iris = datasets.load_iris()
X, y = iris.data, iris.target
#Creating a pipeline
estimators = [('reduce_dim', PCA()), ('clf', SVC())]
pipe = Pipeline(steps=estimators)
#Learning
pipe.fit(X, y)
#Forecast
pipe.predict(X)
Partial excerpt from Mercari Competition 1st Code
qiita.rb
from sklearn.pipeline import make_pipeline, make_union, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer as Tfidf
def on_field(f: str, *vec) -> Pipeline:
return make_pipeline(FunctionTransformer(itemgetter(f), validate=False), *vec)
vectorizer = make_union(
on_field('name', Tfidf(max_features=100000, token_pattern='\w+')),
on_field('text', Tfidf(max_features=100000, token_pattern='\w+', ngram_range=(1, 2))),
on_field(['shipping', 'item_condition_id'],
FunctionTransformer(to_records, validate=False), DictVectorizer()),
n_jobs=4)
I'm pipelined instances of itemgetter and Tfidf with make_pipeline. I am creating my own converter by converting itemxetter to a transformer with FunctionTransformer. This makes it possible to identify important character strings in itemgetter (extracting character strings) in a series of steps. Click here for item getter
Recommended Posts