Aidemy 2020/9/23

Hello, it is Yope! I'm a liberal arts college student, but I'm interested in the AI field, so I'm studying at the AI-specialized school "Aidemy". I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This time, I will write down important notes about the introduction to machine learning.

・ Machine learning includes __ "supervised learning", "unsupervised learning", and "reinforcement learning" __. ・ Supervised learning is a method of thinking until the answer is correct by giving learning data and correct answer (teacher) data. Mode most often. -Unsupervised learning is a method in which only learning data is given and the computer itself finds regularity. ・ Reinforcement learning is a method of continuing to think to maximize the benefits (rewards) that the actor can obtain.

・ __ Supervised learning procedure __: Data collection → Data cleansing → Learning → Check with test data → Implementation

・ __ Supervised learning practice 1 __: Holdout method: Data is divided into learning data and test data. Use the train_test_split () function. __X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = test data percentage, random_state = 0) __

- X is the correct label, y is the correct label. train is training data and test is test data. random_state is a seed that selects test data.

・ __ Supervised learning practice 2 __: k-Cross-validation: Data is divided into k and one of them is used as test data. The test data is changed each time and verified a total of k times to calculate the average performance. (If there are 20 ex data, 19 will be training data and 1 will be test data, and will be verified 20 times in total)

-__ Overfitting **: A state in which the learning accuracy is too high to be properly abstracted and unknown data cannot be handled.
-** Dropout __: A means of avoiding overfitting. Ignore obvious exceptions.
・ __Ensemble learning __: Improve accuracy by training multiple models and averaging the results.

· __Confusion matrix __: A table used to evaluate the accuracy of the model. The results are classified into __ "true positive", "false positive", "true negative", and "false negative" __. "True or false" indicates whether the answer is correct, and "Yin" indicates the answer of the model. (That is, if it is a "false positive", the model answered True, but the answer was False.)

-Implementation of confusion matrix: Describe as follows ("y_true" is given [list of correct answers], and "y_pred" is given [list of model answers])

```
from sklearn.metrics import confusion_matrix
#Define "correct answer" and "answer" in a list(0 is positive, 1 is negative)
y_true=[1,1,1,1,1,1]
y_pred=[1,1,1,0,0,0]
confmat = confusion_matrix(y_true, y_pred)
#[[0 0] #[[True positive False negative]
# [3 3]] # [False positives True negatives]]
```

・ __Correct answer rate __: Percentage of all answers that was "true". (True positive + True negative / Overall) ・ __Compliance rate / Accuracy __: Percentage of those who answered "positive" was "true". (True positive / True positive + False positive) -__Recall rate __: Percentage of "actual sun" that was "true". (True positive / True positive + False negative) -__F value __: Harmonic mean of precision and recall (2 * precision * recall / precision + recall)

- All of these are represented by 0 to 1, and it can be said that the closer to 1 the better the performance.

-Implemented the above evaluation index: It calculates by importing the function and giving "y_true" and "y_pred" as arguments respectively.

```
#precision_score: Match rate, recall_score: recall, f1_score: Import of F value
from sklearn.metrics import precision_score,recall_score,f1_score
y_true=[0,0,1,1]
y_pred=[0,1,1,1]
#F value output
print("F1".format(f1_score(y_true,y_pred)))
# 0.666666
```

-__PR curve __: A graph with the precision rate on the vertical axis and the recall rate on the horizontal axis.
・ __The precision rate and the recall rate are in a trade-off relationship **, and it is necessary to consider which one should be emphasized in some cases. Unless you are particular about it, it is recommended to use the F value or the point where P and R match on the PR curve (** breakeven point (BEP) __).

Recommended Posts