Tool MALSS (basic) that supports machine learning in Python

Supports Python3 (2016.01.25) </ font>

I made a tool called MALSS (Machine Learning Support System) to support machine learning in Python (PyPI/[GitHub](https: /) /github.com/canard0328/malss)). Last time wrote about the introduction method, so this time it is about the basic usage.

Package import

First, import MALSS.

python


from malss import MALSS

Data preparation

Next, prepare the data. This time, we will use the Heart Disease Data used in the book here. The AHD column indicates whether or not you have heart disease, so the purpose is to predict this. In this data, the explanatory variables (values used for prediction) include categorical variables (non-numeric values). The pandas library is used for loading. If it is all numerical data, you can use numpy's loadtxt method.

python


import pandas as pd
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Heart.csv',
                   index_col=0, na_values=[''])
y = data['AHD']
del data['AHD']

analysis

We will finally analyze it. All you have to do is create an instance and execute the fit method. The analysis takes several minutes to several tens of minutes, depending on the amount of data and machine specifications.

python


cls = MALSS('classification',
            shuffle=True, standardize=True, n_jobs=3,
            random_state=0, lang='jp', verbose=True)
cls.fit(data, y, 'result_classification')

The only required argument to the constructor MALSS is the analysis task. ** This time, it is "classification" because it is a classification (identification) task that predicts the label (Yes / No). It is "regression" in the regression task that predicts the value. ** **

Input is not required for the other options because the initial values are set.

  • shuffle * shuffles data when performing machine learning (initial value: True)
  • standardize * standardizes the data (each column has an average of 0 and a variance of 1) (initial value: True),
  • scoring * is the evaluation index name or evaluation function (initial value: None (classification: F1 value, regression: MSE)),
  • n_jobs * is the number of parallel executions (initial value: 1),
  • random_state * is the seed when machine learning uses random numbers (initial value: 0),
  • lang * is the language used in the analysis report (initial value: en (English)),
  • verbose * outputs the progress to the console (initial value: True).

The arguments of the fit method are input data (features / explanatory variables), output data (objective variables), and output destination directory (default: None). If the output destination directory is passed, the analysis result report will be output.

Check the report

Open report.html in the directory specified in ↑ with a browser. (* Note: The content of the screenshot may be from the old version)

result of analysis

分析結果.png

Several machine learning algorithms are automatically selected according to the analysis task and data (mainly size), and the cross-validation score of each algorithm is displayed. Looking at this table, we can see that the score for this data is the highest when using Logistic Regression. In machine learning analysis, a technique called ** cross-validation ** is used to prevent the model from over-adapting to given data only. This is explained below, and links to explanation pages (Wikipedia, etc.) are provided for technical terms. Once you get used to it, you can hide this comment by setting the verbose option to False.

Data summary

データ概要.png

Next, the summary of the data is displayed. MALSS has a categorical data conversion function using ** dummy variables ** and a ** missing value interpolation ** function, and explanations about them are also included.

Analysis results by algorithm

After that, the analysis results for each algorithm are displayed.

Parameter tuning result

SVM.png

The performance of most machine learning algorithms changes by adjusting the parameters. If there are multiple parameters to be adjusted, use ** grid search ** to tune the parameters. As described in the comment part, if the optimum parameter (red part) is the value at the end of the parameter change width, it is necessary to change the width (the change method will be described next time). In the case of this figure, it is good to consider the case where the parameter * C * is 1.

For details on the parameters, jump to the scikit-learn document from the algorithm name link and check.

Classification result

分類結果.png

In the classification / identification task, if the label ratio is biased, it may not be appropriate to simply evaluate with accuracy (the ratio that the prediction was correct) (for example, the data with AHD Yes is 1% of the total. In that case, the accuracy of the model that always predicts No is 99%). In such cases, ** F value ** (F1 value) is a more appropriate evaluation index.

Learning curve

学習曲線.png

Finally, the ** learning curve ** is displayed. The learning curve illustrates how the evaluation score changes when the data size is changed. By looking at the learning curve, you can see what the model is doing (** high variance ** or ** high bias **) and get hints on what to do to improve performance. ..

Module sample output

Even if it can be analyzed automatically, it is meaningless unless it can be incorporated into the system. MALSS can output a sample module using the learning algorithm with the best evaluation score.

python


cls.generate_module_sample('sample_code.py')

The output module sample can be learned by fit and predicted by predict, similar to the machine learning algorithm of scikit-learn.

python


from sample_code import SampleClass
from sklearn.metrics import f1_score

cls = SampleClass()
cls.fit(X_train, y_train)  # X_train, y_train is training data (same as the one passed to MALSS is OK)
pred = cls.predict(X_test)  # X_test is unknown data input to the real system
print f1_score(y_test, pred)

Actually, I don't think that the model is trained every time, so I think it is better to dump the trained model with pickle.

in conclusion

I explained the basic usage of MALSS. We would be grateful if you could give us your opinions on things that are difficult to understand.

Next time will write about applied usage such as adding algorithms by yourself.

Recommended Posts

Tool MALSS (basic) that supports machine learning in Python
Tool MALSS (application) that supports machine learning in Python
MALSS, a tool that supports machine learning in Python
Python: Preprocessing in Machine Learning: Overview
[python] Frequently used techniques in machine learning
Python: Preprocessing in machine learning: Data acquisition
[Python] Saving learning results (models) in machine learning
Python: Preprocessing in machine learning: Data conversion
Get a glimpse of machine learning in Python
Basic sorting in Python
Build an interactive environment for machine learning in Python
Coursera Machine Learning Challenges in Python: ex2 (Logistic Regression)
Coursera Machine Learning Challenges in Python: ex1 (Linear Regression)
Summary of the basic flow of machine learning with Python
Attempt to include machine learning model in python package
Machine learning in Delemas (practice)
Refactoring Learned in Python (Basic)
Machine learning with Python! Preparation
Python Machine Learning Programming> Keywords
Used in machine learning EDA
Beginning with Python machine learning
Note that it supports Python 3
The result of Java engineers learning machine learning in Python www
Coursera Machine Learning Challenges in Python: ex7-2 (Principal Component Analysis)
Implement stacking learning in Python [Kaggle]
How about Anaconda for building a machine learning environment in Python?
Scraping with Selenium in Python (Basic)
Coursera Machine Learning Challenges in Python: ex5 (Adjustment of Regularization Parameters)
Automate routine tasks in machine learning
Basic machine learning procedure: ④ Classifier learning + ensemble learning
[Python] Basic knowledge used in AtCoder
Widrow-Hoff learning rules implemented in Python
Classification and regression in machine learning
<For beginners> python library <For machine learning>
Machine learning in Delemas (data acquisition)
Implemented Perceptron learning rules in Python
Preprocessing in machine learning 2 Data acquisition
Random seed research in machine learning
"Scraping & machine learning with Python" Learning memo
Preprocessing in machine learning 4 Data conversion
Basic machine learning procedure: ② Prepare data
Note that I understand the algorithm of the machine learning naive Bayes classifier. And I wrote it in Python.
Coursera Machine Learning Challenges in Python: ex6 (How to Adjust SVM Parameters)
Coursera Machine Learning Challenges in Python: ex7-1 (Image compression with K-means clustering)
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
Python & Machine Learning Study Memo: Environment Preparation
Basic Linear Algebra Learned in Python (Part 1)
Amplify images for machine learning with python
Use machine learning APIs A3RT from Python
Machine learning with python (2) Simple regression analysis
I installed Python 3.5.1 to study machine learning
Why Python is chosen for machine learning
"Python Machine Learning Programming" Summary Note (Jupyter)
[Shakyo] Encounter with Python for machine learning
[Python] First data analysis / machine learning (Kaggle)
[Python] When an amateur starts machine learning
[Python] Web application design for machine learning
Python and machine learning environment construction (macOS)
Technology that supports Python Descriptor edition #pyconjp
An introduction to Python for machine learning
Python & Machine Learning Study Memo ③: Neural Network