A world where anyone can easily create AI

With the release of the Windows95 OS in 1995 and the spread of hardware products to the general public, the Internet has become a tool that anyone can easily use. I think this is described as "Internet infrastructure development."

The same thing is about to happen with machine learning technology. Services like DataRobot and Azure Machine Learning This is a typical example. In the past, data analysis by machine learning was a ** proprietary patent ** only for professional occupations such as engineers and data scientists. However, with the advent of Auto ML, the wave of "democratization of machine learning" has begun.

This time, the purpose is to make it (simple one).

What is ML?

Before talking about AML What is Machine Learning (ML)? I will talk from.

The English version of wikipedia had the following description.

Machine learning (ML) is the study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.

In other words, it can be said that "the future is ** predicted ** from past experience (data) without human intervention". See the figure below. Prepare raw data (row data) and predict the "target" through the entire flow such as "preprocessing" → "feature engineering" → "learning" → "model selection / scoring" ** It's called machine learning **.

For example, when you want to predict the weather tomorrow, it seems that you can predict it from information such as yesterday's and today's weather, temperature, humidity, and wind direction. At this time, "tomorrow's weather" is expressed as "target", and past information such as "yesterday's and today's information" is expressed as "raw data". (In this case, since it is time series data, there are many other things to consider.)

There are two tasks to be solved by machine learning: "classification problem" and "regression problem". This time, I will focus on the "classification problem".

Auto ML

So what is Auto ML? It refers to machine learning that automates "pre-processing"-> "feature engineering" in the above explanation.

There are many ways to say Auto ML in a nutshell, but the goal this time is to develop Auto ML that has the following functions and compares the accuracy of each model. Normally, it is necessary to automate tasks such as parameter tuning, but please forgive me this time> <

--Load data from the data path --onehot encoding --Complement missing values with one of "mean", "median", and "mode" --Selection of features --Grid Research --Random research --Mixed matrix --ROC curve

Scoring

Whole code

This code is on github

Data preparation

This time, we will use the familiar titanic dataset.

Directory structure

aml
|----data
 |      |---train.csv
 |      |---test.csv
|
|----model
 |      |---The model is saved here
|
|----myaml.inpynb

Pretreatment and feature engineering and learning

I will show you the usage code first. Corresponds to the API example.

model_data = 'data/train.csv'
scoring_data = 'data/test.csv'

aml = MyAML(model_data, scoring_data, onehot_columns=None)
 aml.drop_cols (['Name','Ticket','Cabin']) #Do not use Name, Ticket and Cabin information
# Pretreatment and feature engineering (feature selection)
aml.preprocessing(target_col='Survived', index_col='PassengerId', feature_selection=False)
# Learning and model comparison result display (adopts holdout method)
aml.holdout_method(pipelines=pipelines_pca, scoring='auc')

	test	train
gb	0.754200	0.930761
knn	0.751615	0.851893
logistic	0.780693	0.779796
rf	0.710520	0.981014
rsvc	0.766994	0.837220
tree	0.688162	1.000000

Preprocessing

The pre-processing here refers to the following two. In addition, I hope that the following code can be explained by scraping only the important parts.

--onehot encoding --Complement missing values with one of "mean", "median", and "mode"

onehot encoding

    def _one_hot_encoding(self, X: pd.DataFrame) -> pd.DataFrame:
        ...

        # one_hot_encoding
 if self.ohe_columns is None: # obejct or category columns only one_hot_encoding
            X_ohe = pd.get_dummies(X,
 dummy_na = True, # NULL is also made into a dummy variable
 drop_first = True) # Exclude the first category
        
 else: only the columns specified by # self.ohe_columns one_hot_encoding
            X_ohe = pd.get_dummies(X,
 dummy_na = True, # NULL is also made into a dummy variable
 drop_first = True, # Exclude the first category
                                   columns=self.ohe_columns)
        ...

In the initialization of MyAML class, ʻonehot_columns` stored in the instance variable receives "column names to be encoded in onehot" in a list. If nothing is specified, the columns of the received data frame of type obejct or category will be onehot encoded.

Missing value completion

    def _impute_null(self, impute_null_strategy):
        """
 Complement missing values with impute_null_strategy
 Types of impute_null_strategy
 mean ... complemented by mean
 median ... complemented by median
 most_frequent ... complement with mode
        """
        self.imp = SimpleImputer(strategy=impute_null_strategy)
        self.X_model_columns = self.X_model.columns.values
        self.X_model = pd.DataFrame(self.imp.fit_transform(self.X_model),
                                    columns=self.X_model_columns)

Use the SimpleImputer class of scikit-learn to perform missing value completion. ʻImpute_null_strategy` is an argument that indicates what to complete. The corresponding complement method is as follows.

--mean ... Complemented by average value -- median ... Complemented by median --most_frequent ... Complement with mode

Feature engineering

Feature engineering is also deep, but this time we will simplify it and consider "feature selection by ** random forest **".

    def _feature_selection(self, estimator=RandomForestClassifier(n_estimators=100, random_state=0), cv=5):
        """
 Feature selection
 @param estimator: Learner for performing feature selection
        """
        self.selector = RFECV(estimator=estimator, step=.05, cv=cv)
        self.X_model = pd.DataFrame(self.selector.fit_transform(self.X_model, self.y_model),
                              columns=self.X_model_columns[self.selector.support_])
        self.selected_columns = self.X_model_columns[self.selector.support_]

The first line initializes the RFECV class. In this case, the estimator specifies RandomForestClassifier as the default. In the next line, select the features that are most important. Finally, store the ** selected ones ** in the instance variable selected_columns.

Learning

The holdout method compares the compatibility of the model with the data. The holdout method is a method of dividing training data (data used for training the model) and test data (data for verification not used for training). In this way, the training data is always training data and the test data is always test data.

Cross-validation is also implemented as another way to compare the compatibility of the model with the data, but I will omit the explanation.

    def holdout_method(self, pipelines=pipelines_pca, scoring='acc'):
        """
 Check the accuracy of the model by the holdout method
 @param piplines: Pipeline (dictionary of models to try)
 @param scoring: Evaluation index
 acc: Correct answer rate
 auc: ROC curve area
        """
        X_train, X_test, y_train, y_test = train_test_split(self.X_model,
                                                            self.y_model,
                                                            test_size=.2,
                                                            random_state=1)
        y_train=np.reshape(y_train,(-1))
        y_test=np.reshape(y_test,(-1))

        scores={}
        for pipe_name, pipeline in pipelines.items():
            pipeline.fit(X_train, y_train)
            joblib.dump(pipeline, './model/'+ pipe_name + '.pkl')
            if scoring == 'acc':
                scoring_method = accuracy_score
            elif scoring == 'auc':
                scoring_method = roc_auc_score
            scores[(pipe_name, 'train')] = scoring_method(y_train, pipeline.predict(X_train))
            scores[(pipe_name, 'test')] = scoring_method(y_test, pipeline.predict(X_test))
        display(pd.Series(scores).unstack())

Here, the variable piplines has the following format.

 make pipelines for PCA
pipelines_pca={
    """
 'Model name': Pipeline ([('scl', standardized class))
 , ('pca', class for principal component analysis)
 , ('est', model)])
    """
    'knn': Pipeline([('scl', StandardScaler())
                      , ('pca', PCA(random_state=1))
                      , ('est', KNeighborsClassifier())]),
    
    'logistic': Pipeline([('scl', StandardScaler())
                            , ('pca', PCA(random_state=1))
                            , ('est', LogisticRegression(random_state=1))]), 
    
     ...
}

Each of the three classes confined to the Pipeline class performs the following functions.

--'scl': Standardize --'pca': Principal component analysis --'est': Model

Therefore, at the time of pipeline.fit (X_train, y_train), a series of flow of" standardization "→" feature analysis by principal component analysis "→" learning "is performed.

My Dream

I have a dream. "It is the realization of a society where anyone can easily create models for machine learning and deep learning so that the Internet can be used by anyone." As the first step in developing the AI infrastructure, we have implemented a "system in which a series of machine learning processes can be operated simply by passing through a path". There are still many places that I haven't reached yet, but I will continue to do my best.

[PYTHON] I made my own AML