Supports Python3 (2016.01.25) </ font>
I made a tool called MALSS (Machine Learning Support System) to support machine learning in Python (PyPI/[GitHub](https: /) /github.com/canard0328/malss)). Last time wrote about the introduction method, so this time it is about the basic usage.
First, import MALSS.
python
from malss import MALSS
Next, prepare the data. This time, we will use the Heart Disease Data used in the book here. The AHD column indicates whether or not you have heart disease, so the purpose is to predict this. In this data, the explanatory variables (values used for prediction) include categorical variables (non-numeric values). The pandas library is used for loading. If it is all numerical data, you can use numpy's loadtxt method.
python
import pandas as pd
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Heart.csv',
index_col=0, na_values=[''])
y = data['AHD']
del data['AHD']
We will finally analyze it. All you have to do is create an instance and execute the fit method. The analysis takes several minutes to several tens of minutes, depending on the amount of data and machine specifications.
python
cls = MALSS('classification',
shuffle=True, standardize=True, n_jobs=3,
random_state=0, lang='jp', verbose=True)
cls.fit(data, y, 'result_classification')
The only required argument to the constructor MALSS is the analysis task. ** This time, it is "classification" because it is a classification (identification) task that predicts the label (Yes / No). It is "regression" in the regression task that predicts the value. ** **
Input is not required for the other options because the initial values are set.
The arguments of the fit method are input data (features / explanatory variables), output data (objective variables), and output destination directory (default: None). If the output destination directory is passed, the analysis result report will be output.
Open report.html in the directory specified in ↑ with a browser. (* Note: The content of the screenshot may be from the old version)
Several machine learning algorithms are automatically selected according to the analysis task and data (mainly size), and the cross-validation score of each algorithm is displayed. Looking at this table, we can see that the score for this data is the highest when using Logistic Regression. In machine learning analysis, a technique called ** cross-validation ** is used to prevent the model from over-adapting to given data only. This is explained below, and links to explanation pages (Wikipedia, etc.) are provided for technical terms. Once you get used to it, you can hide this comment by setting the verbose option to False.
Next, the summary of the data is displayed. MALSS has a categorical data conversion function using ** dummy variables ** and a ** missing value interpolation ** function, and explanations about them are also included.
After that, the analysis results for each algorithm are displayed.
The performance of most machine learning algorithms changes by adjusting the parameters. If there are multiple parameters to be adjusted, use ** grid search ** to tune the parameters. As described in the comment part, if the optimum parameter (red part) is the value at the end of the parameter change width, it is necessary to change the width (the change method will be described next time). In the case of this figure, it is good to consider the case where the parameter * C * is 1.
For details on the parameters, jump to the scikit-learn document from the algorithm name link and check.
In the classification / identification task, if the label ratio is biased, it may not be appropriate to simply evaluate with accuracy (the ratio that the prediction was correct) (for example, the data with AHD Yes is 1% of the total. In that case, the accuracy of the model that always predicts No is 99%). In such cases, ** F value ** (F1 value) is a more appropriate evaluation index.
Finally, the ** learning curve ** is displayed. The learning curve illustrates how the evaluation score changes when the data size is changed. By looking at the learning curve, you can see what the model is doing (** high variance ** or ** high bias **) and get hints on what to do to improve performance. ..
Even if it can be analyzed automatically, it is meaningless unless it can be incorporated into the system. MALSS can output a sample module using the learning algorithm with the best evaluation score.
python
cls.generate_module_sample('sample_code.py')
The output module sample can be learned by fit and predicted by predict, similar to the machine learning algorithm of scikit-learn.
python
from sample_code import SampleClass
from sklearn.metrics import f1_score
cls = SampleClass()
cls.fit(X_train, y_train) # X_train, y_train is training data (same as the one passed to MALSS is OK)
pred = cls.predict(X_test) # X_test is unknown data input to the real system
print f1_score(y_test, pred)
Actually, I don't think that the model is trained every time, so I think it is better to dump the trained model with pickle.
I explained the basic usage of MALSS. We would be grateful if you could give us your opinions on things that are difficult to understand.
Next time will write about applied usage such as adding algorithms by yourself.
Recommended Posts