[PYTHON] Machine learning Training data division and learning / prediction / verification

When studying or teaching machine learning based on PyData.Tokyo Tutorial # 1, from the division of training data, I find it difficult to understand the learning, prediction, and verification parts. I will explain this part.


--Supervised learning-> In other words, there is labeled data --There are a certain number of datasets-> 890 in this tutorial --Learning and verifying with 20% of test data left --The feature matrix is multidimensional (it is natural ...) --Use sklearn (scikit-learn) --Estimate by logistic regression --See pydatatokyo_tutorial_ml.ipynb in PyData.Tokyo Tutorial # 1 for detailed code.

Training data split

Feature matrix X Class label data y If you do the following, you can divide the data.

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X, y, train_size=0.8, random_state=1)


--X_train: Feature matrix for learning (80%) --X_val: Evaluation feature matrix (20%) --y_train: Training class label (80%) Unknown data --y_val: Evaluation class label (20%) Used for answering unknown data (keep it hidden)

Learning / prediction / verification

Initialization of classifier (learner)

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()

Initialize clf and use it for the following learning, prediction, and verification.


clf.fit(X_train, y_train)

Train using the initialized clf fit method The data gives 80% of the training data a feature matrix and class labels


y_train_pred = clf.predict(X_train)
y_val_pred = clf.predict(X_val)

Predict with clf's predict method.

--y_train_pred: Result of re-prediction with training data --y_val_pred: Result of prediction using evaluation data

So far, I haven't used y_val. That is, y_train is treated as unknown data

Evaluation / verification

from sklearn.metrics import accuracy_score
train_score = accuracy_score(y_train, y_train_pred)
val_score = accuracy_score(y_val, y_val_pred)

ʻAccuracy_score is given class label dataandpredicted result` above, and the correct answer rate is output.

--train_score: Results of prediction using training data --val_score: As a result of making a prediction using evaluation data, it means that a prediction was made using unknown data.

Recommended Posts

Machine learning Training data division and learning / prediction / verification
Time series data prediction by AutoML (automatic machine learning)
Data set for machine learning
Machine learning and mathematical optimization
How to split machine learning training data into objective variables and others in Pandas
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
Pre-processing in machine learning 3 Missing values, outliers, and imbalanced data
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
Significance of machine learning and mini-batch learning
Classification and regression in machine learning
Organize machine learning and deep learning platforms
Machine learning in Delemas (data acquisition)
Preprocessing in machine learning 2 Data acquisition
Preprocessing in machine learning 4 Data conversion
Basic machine learning procedure: ② Prepare data
How to collect machine learning data
[Machine learning] OOB (Out-Of-Bag) and its ratio
Machine learning imbalanced data sklearn with k-NN
[Machine learning] FX prediction using decision trees
Machine learning
Python data structure and operation (Python learning memo ③)
[Python] First data analysis / machine learning (Kaggle)
One-click data prediction for the field realized by fully automatic machine learning
Machine learning algorithm classification and implementation summary
Python and machine learning environment construction (macOS)
Python: Preprocessing in machine learning: Data conversion
"OpenCV-Python Tutorials" and "Practical Machine Learning System"
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
Preprocessing in machine learning 1 Data analysis process
Summary of mathematical scope and learning resources required for machine learning and data science
Until launching a boat race triple prediction site using machine learning and Flask
I tried to process and transform the image and expand the data for machine learning
Machine Learning with docker (40) with anaconda (40) "Hands-On Data Science and Python Machine Learning" By Frank Kane
Study machine learning and computer science. Resource list
Data supply tricks using deques in machine learning
Training data and test data (What are X_train and y_train?) ②
Numerai Tournament-Fusion of Traditional Quants and Machine Learning-
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
[Python3] Let's analyze data using machine learning! (Regression)
I started machine learning with Python Data preprocessing
A story about data analysis by machine learning
Collect machine learning training image data on your own (Google Custom Search API Pikachu)
Creating training data
Predicting offensive and defensive attributes from the Yu-Gi-Oh! Card name --Yu-Gi-Oh! Data Science 3. Machine Learning
Collect machine learning training image data on your own (Tumblr API Yoshioka Riho ed.)
[Machine learning] Where will you win this year's Hakone Ekiden? ~ From data to prediction ~
[Memo] Machine learning
Machine learning classification
Machine Learning sample
Machine learning with Raspberry Pi 4 and Coral USB Accelerator
Relationship data learning with numpy and NetworkX (spectral clustering)
Easy machine learning with scikit-learn and flask ✕ Web app
Python learning memo for machine learning by Chainer Chapters 1 and 2
Machine learning #k-nearest neighbor method and its implementation and various
[PyTorch Tutorial ⑦] Visualizing Models, Data, And Training With Tensorboard
Use scikit-learn training dataset with chainer (for learning / prediction)
Machine learning engineer lawyer explains AI and rights story
Artificial intelligence, machine learning, deep learning to implement and understand
Practical machine learning with Scikit-Learn and TensorFlow-TensorFlow gave up-
xgboost: A valid machine learning model for table data
Set up python and machine learning libraries on Ubuntu