When studying or teaching machine learning based on PyData.Tokyo Tutorial # 1, from the division of training data, I find it difficult to understand the learning, prediction, and verification parts. I will explain this part.
--Supervised learning-> In other words, there is labeled data
--There are a certain number of datasets-> 890 in this tutorial
--Learning and verifying with 20% of test data left
--The feature matrix is multidimensional (it is natural ...)
--Use sklearn (scikit-learn)
--Estimate by logistic regression
pydatatokyo_tutorial_ml.ipynb in PyData.Tokyo Tutorial # 1 for detailed code.
Class label data
If you do the following, you can divide the data.
from sklearn.model_selection import train_test_split X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.8, random_state=1)
--X_train: Feature matrix for learning (80%) --X_val: Evaluation feature matrix (20%) --y_train: Training class label (80%) Unknown data --y_val: Evaluation class label (20%) Used for answering unknown data (keep it hidden)
from sklearn.linear_model import LogisticRegression clf = LogisticRegression()
Initialize clf and use it for the following learning, prediction, and verification.
Train using the initialized clf fit method The data gives 80% of the training data a feature matrix and class labels
y_train_pred = clf.predict(X_train) y_val_pred = clf.predict(X_val)
Predict with clf's predict method.
y_train_pred: Result of re-prediction with training data
y_val_pred: Result of prediction using evaluation data
So far, I haven't used
y_val. That is,
y_train is treated as unknown data
from sklearn.metrics import accuracy_score train_score = accuracy_score(y_train, y_train_pred) val_score = accuracy_score(y_val, y_val_pred)
is given class label data
andpredicted result` above, and the correct answer rate is output.
--train_score: Results of prediction using training data --val_score: As a result of making a prediction using evaluation data, it means that a prediction was made using unknown data.