[PYTHON] Personal notes and links about machine learning ① (Machine learning)

Introduction

There is a limit to what you can do from scratch. There is also the phrase "standing on the shoulders of giants," but I would like to use articles that can be used as reference as the wisdom of our predecessors to improve our level.

-Personal notes and links about machine learning ① (Machine learning) -Personal notes and links about machine learning (2) (Deep Learning) -[Personal notes and links about machine learning ③ (BI / Visualization)] (https://qiita.com/CraveOwl/items/7846abccbbaebed6ce63)

Machine learning method

There are various methods for machine learning, and it is helpful to organize them as follows.

-Overview of machine learning techniques learned from scikit-learn -Thaw! There are many data analysis and machine learning methods, but when should I use them?

Classification

Decision Tree

The accuracy is not high, but the visualization by the tree is highly explanatory.

-[Decision tree analysis with scikit-learn (CART method)](https://pythondatascience.plavox.info/scikit-learn/scikit-learn%E3%81%A7%E6%B1%BA%E5%AE%9A % E6% 9C% A8% E5% 88% 86% E6% 9E% 90) -Decision Tree and Random Forest -Generate Python code from scikit-learn decision tree / random forest rules

Support Vector Machine

Random Forest

-Machine Learning for Package Users (5): Random Forest -What I was asked when using Random Forest in practice -Importance of features that can be calculated by Random Forest -[Compare Random Forest vs SVM with Python scikit-learn] (http://yut.hatenablog.com/entry/20121012/1349997641) -Verification of tuneRF function behavior

Regression

Linear regression

-[Machine learning] Regression analysis using scikit learn -Linear regression in Python (statmodels, scikit-learn, PyMC3) -Linear? non-linear?

Lasso regression

Regression model for L1 regularization

SVR -Multivariable regression model with scikit-learn --I tried to compare and verify SVR

Time series analysis

Clustering

Hierarchical cluster analysis (aggregation method)

A method of visually showing how many clusters it is appropriate to divide by drawing a dendrogram (tree diagram) that shows the closeness of objects. However, the number of objects is limited to several hundreds because it is within the range that can be represented by a tree diagram. Beyond that, reading is difficult.

In the world of Data Mining and Big Data, the amount of data has increased enormously and it has become less popular.

-Heatmap with Dendrogram in Python + matplotlib -Python: Hierarchical clustering dendrogram drawing and threshold division Tweet

Non-hierarchical cluster analysis (k-means)

The most famous non-hierarchical clustering technique. If you divide the number of clusters into K, how to divide them will automatically determine the optimization based on the input information.

The biggest feature and weakness of this method is that it is necessary to determine the number of clusters (K) in advance. To avoid this, methods such as K-means ++ and X-means that automatically derive the optimum number of clusters have also been developed.

It is also used when clustering customers according to purchasing tendency, but it is often extremely divided, such as a cluster with tens of thousands of people and a cluster with only a few people at the same time, to avoid that. I don't use it much personally because it is difficult to adjust the parameters.

-[Cluster analysis with scikit-learn (K-means method)](https://pythondatascience.plavox.info/scikit-learn/%E3%82%AF%E3%83%A9%E3%82%B9%E3 % 82% BF% E5% 88% 86% E6% 9E% 90-k-means) -I checked the X-means method that automatically estimates the number of clusters

Spectral clustering

-Spectral Clustering Story -I tried spectral clustering

Self-organizing map (SOM, Kohonen)

A model that expresses the similarity of input information given by a type of neural network by the distance on the map.

Since it is expressed on a map (two-dimensional), when determining the number of clusters, it is necessary to think about multiplication in consideration of vertical and horizontal, such as a 3x3 map. (Therefore, the prime numbers such as 5 or 7 clusters are only 1x5 and 1x7, which is somewhat unpleasant.)

Personally, when it comes to customer clustering, I love it so much that I should use this method. Compared to other methods such as K-means, it is less likely to be divided into extremes, and it tends to be vertical and horizontal, so it is easy for anyone to interpret the results.

Since it is a model devised by Dr. T. Kohonen, it is often called Kohonen instead of a self-organizing map (SOM).

-Self-organizing map in Python NumPy version -Generative Topographic Mapping (GTM) -Upward compatible method of self-organizing map (SOM)-

Topic model

Originally used as a method of statistical latent semantic analysis in natural language processing to estimate the "probability of appearance of a word" in a sentence, it is a kind of numerical probability model and estimates the "probability of appearance". Networking that is not 1: 1 when used in data (eg: one customer does not belong to one cluster, but to multiple clusters. The probability of belonging to cluster A is 60%, B is 30% ... ・ It is also used for (the probability of belonging to is divided).

Although there are various methods for topic models, LDA (Latent Dirichlet Allocation) is often used.

Since the model has different affiliation probabilities, it goes well with the idea of product DNA (I personally think).

-"Statistical Latent Semantics Analysis by Topic Model" Reading Group "Chapter 1 What is Statistical Latent Semantics" -Consider the probability of generating topics and documents with LDA -Machine learning_Latent semantic analysis_Implemented with python -PLSA (Stochastic Latent Semantics)

Dimensional compression

-[Data science by R] Multidimensional scaling (continued) Non-lightweight MDS -Notice of release of python library of high-dimensional vector data search technology "NGT"

Mechanism to support learning (even if this is the main)

Parameter tuning

-Parameter optimization by grid search from Scikit learn -Easy tuning with grid search function or option for machine learning with R -Automatically optimize machine learning hyperparameters, Preferred Networks publishes library -Hyperparameter automatic optimization tool "Optuna" released -Optimize CNN hyperparameters with Optuna + Keras

Feature selection

-[Machine learning] Selection of features using RFE -Feature engineering for machine learning starting with the 1st Google Colaboratory -Feature engineering for machine learning starting with the 2nd Google Colaboratory

Other

-Useful tool when using sklearn from pandas -Pivot table -Parallel processing -Save classifiers together

Recommended Posts

Personal notes and links about machine learning ① (Machine learning)
Personal memos and links related to machine learning ③ (BI / Visualization)
About machine learning overfitting
About machine learning mixed matrices
Machine learning and mathematical optimization
Personal notes about the integration of vscode and anaconda
What I learned about AI and machine learning using Python (4)
Significance of machine learning and mini-batch learning
Classification and regression in machine learning
Organize machine learning and deep learning platforms
(Personal notes) Python metaclasses and metaprogramming
Machine learning
[Reading Notes] Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow Chapter 1
About _ and __
[Machine learning] OOB (Out-Of-Bag) and its ratio
Notes on PyQ machine learning python grammar
A story about machine learning with Kyasuket
Notes on running Azure Machine Learning locally
Machine learning algorithm classification and implementation summary
Python and machine learning environment construction (macOS)
"OpenCV-Python Tutorials" and "Practical Machine Learning System"
Vulkan compute with Python with VkInline and think about GPU machine learning and more
A story about automating online mahjong (Mahjong Soul) with OpenCV and machine learning
Study machine learning and computer science. Resource list
A story about simple machine learning using TensorFlow
Python learning notes
Notes about with
Numerai Tournament-Fusion of Traditional Quants and Machine Learning-
Notes on machine learning (updated from time to time)
[Note] Python, when starting machine learning / deep learning [Links]
Machine learning Training data division and learning / prediction / verification
List of links that machine learning beginners are learning
About the development contents of machine learning (Example)
[Memo] Machine learning
Machine learning classification
python personal notes
Notes about pytorch
About symbolic links
A story about data analysis by machine learning
python learning notes
Python learning notes for machine learning with Chainer Chapters 11 and 12 Introduction to Pandas Matplotlib
Machine Learning sample
What I learned about AI / machine learning using Python (1)
Machine learning with Raspberry Pi 4 and Coral USB Accelerator
Mayungo's Python Learning Note: List of stories and links
Easy machine learning with scikit-learn and flask ✕ Web app
Python learning memo for machine learning by Chainer Chapters 1 and 2
About testing in the implementation of machine learning models
Machine learning #k-nearest neighbor method and its implementation and various
What I learned about AI / machine learning using Python (3)
Machine learning engineer lawyer explains AI and rights story
Artificial intelligence, machine learning, deep learning to implement and understand
Practical machine learning with Scikit-Learn and TensorFlow-TensorFlow gave up-
What I learned about AI / machine learning using Python (2)
Set up python and machine learning libraries on Ubuntu
Talk about improving machine learning algorithm bottlenecks with Cython
About Class and Instance
Machine learning tutorial summary
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost
Machine learning logistic regression