Supports Python3 (2016.01.25) </ font>

I made a tool called MALSS (Machine Learning Support System) to support machine learning in Python (PyPI/[GitHub](https: /) /github.com/canard0328/malss)). Last time wrote about the introduction method, so this time it is about the basic usage.

Package import

First, import MALSS.

`python`


from malss import MALSS

Data preparation

Next, prepare the data. This time, we will use the Heart Disease Data used in the book here. The AHD column indicates whether or not you have heart disease, so the purpose is to predict this. In this data, the explanatory variables (values used for prediction) include categorical variables (non-numeric values). The pandas library is used for loading. If it is all numerical data, you can use numpy's loadtxt method.

`python`


import pandas as pd
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Heart.csv',
                   index_col=0, na_values=[''])
y = data['AHD']
del data['AHD']

analysis

We will finally analyze it. All you have to do is create an instance and execute the fit method. The analysis takes several minutes to several tens of minutes, depending on the amount of data and machine specifications.

`python`


cls = MALSS('classification',
            shuffle=True, standardize=True, n_jobs=3,
            random_state=0, lang='jp', verbose=True)
cls.fit(data, y, 'result_classification')

The only required argument to the constructor MALSS is the analysis task. ** This time, it is "classification" because it is a classification (identification) task that predicts the label (Yes / No). It is "regression" in the regression task that predicts the value. ** **

Input is not required for the other options because the initial values are set.

shuffle * shuffles data when performing machine learning (initial value: True)
standardize * standardizes the data (each column has an average of 0 and a variance of 1) (initial value: True),
scoring * is the evaluation index name or evaluation function (initial value: None (classification: F1 value, regression: MSE)),
n_jobs * is the number of parallel executions (initial value: 1),
random_state * is the seed when machine learning uses random numbers (initial value: 0),
lang * is the language used in the analysis report (initial value: en (English)),
verbose * outputs the progress to the console (initial value: True).

The arguments of the fit method are input data (features / explanatory variables), output data (objective variables), and output destination directory (default: None). If the output destination directory is passed, the analysis result report will be output.

Check the report

Open report.html in the directory specified in ↑ with a browser. (* Note: The content of the screenshot may be from the old version)

result of analysis

分析結果.png

Several machine learning algorithms are automatically selected according to the analysis task and data (mainly size), and the cross-validation score of each algorithm is displayed. Looking at this table, we can see that the score for this data is the highest when using Logistic Regression. In machine learning analysis, a technique called ** cross-validation ** is used to prevent the model from over-adapting to given data only. This is explained below, and links to explanation pages (Wikipedia, etc.) are provided for technical terms. Once you get used to it, you can hide this comment by setting the verbose option to False.

Data summary

データ概要.png

Next, the summary of the data is displayed. MALSS has a categorical data conversion function using ** dummy variables ** and a ** missing value interpolation ** function, and explanations about them are also included.

Analysis results by algorithm

After that, the analysis results for each algorithm are displayed.

Parameter tuning result

The performance of most machine learning algorithms changes by adjusting the parameters. If there are multiple parameters to be adjusted, use ** grid search ** to tune the parameters. As described in the comment part, if the optimum parameter (red part) is the value at the end of the parameter change width, it is necessary to change the width (the change method will be described next time). In the case of this figure, it is good to consider the case where the parameter * C * is 1.

For details on the parameters, jump to the scikit-learn document from the algorithm name link and check.

Classification result

分類結果.png

In the classification / identification task, if the label ratio is biased, it may not be appropriate to simply evaluate with accuracy (the ratio that the prediction was correct) (for example, the data with AHD Yes is 1% of the total. In that case, the accuracy of the model that always predicts No is 99%). In such cases, ** F value ** (F1 value) is a more appropriate evaluation index.

Learning curve

学習曲線.png

Finally, the ** learning curve ** is displayed. The learning curve illustrates how the evaluation score changes when the data size is changed. By looking at the learning curve, you can see what the model is doing (** high variance ** or ** high bias **) and get hints on what to do to improve performance. ..

Module sample output

Even if it can be analyzed automatically, it is meaningless unless it can be incorporated into the system. MALSS can output a sample module using the learning algorithm with the best evaluation score.

`python`


cls.generate_module_sample('sample_code.py')

The output module sample can be learned by fit and predicted by predict, similar to the machine learning algorithm of scikit-learn.

`python`


from sample_code import SampleClass
from sklearn.metrics import f1_score

cls = SampleClass()
cls.fit(X_train, y_train)  # X_train, y_train is training data (same as the one passed to MALSS is OK)
pred = cls.predict(X_test)  # X_test is unknown data input to the real system
print f1_score(y_test, pred)

Actually, I don't think that the model is trained every time, so I think it is better to dump the trained model with pickle.

in conclusion

I explained the basic usage of MALSS. We would be grateful if you could give us your opinions on things that are difficult to understand.

Next time will write about applied usage such as adding algorithms by yourself.

Tool MALSS (basic) that supports machine learning in Python

Package import

python

Data preparation

python

analysis

python

Check the report

result of analysis

Data summary

Analysis results by algorithm

Parameter tuning result

Classification result

Learning curve

Module sample output

python

python

in conclusion

`python`

`python`

`python`

`python`

`python`