[PYTHON] I tried clustering with PyCaret

Introduction

I tried using the machine learning library PyCaret that was released the other day. Data feature analysis and performance comparison work with multiple models will be automated, and I think that the work time of data scientists so far will be significantly reduced.

This time, we will use Mice Protein Expression Data Set 2015 to perform clustering and see the results. ..

Data overview Expression levels of 77 proteins measured in the cerebral cortex of 8 classes of control and Down syndrome mice exposed to context fear conditioning, a task used to assess associative learning.

1. Install PyCaret

Execute the code below to install it. I'm using Anaconda, but I launched and installed a virtual environment dedicated to PyCaret. In an existing virtual environment managed by Conda, an error may occur. (Probably due to a collision between pip and conda)

pip install pycaret

2. Data acquisition

from pycaret.datasets import get_data
dataset = get_data('mice')

result image.png

Let's take a look at the contents of the data using Pandas profile_report ().

import pandas_profiling
dataset.profile_report()

result image.png

Then, 95% is divided into training data and 5% is divided into test data (called Unseen Data).

data = dataset.sample(frac=0.95, random_state=786).reset_index(drop=True)
data_unseen = dataset.drop(data.index).reset_index(drop=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

result

Data for Modeling: (1026, 82) Unseen Data For Predictions: (54, 82)

3. Data preprocessing

Preprocess the data with setup ().

from pycaret.clustering import *
data_clust = setup(data, normalize = True, 
                   ignore_features = ['MouseID'],
                   session_id = 123)

Perform Normalization of Numerical data. Ignore the'Mouse ID'Feature. Then specify session_id = 123 as a fixed Random seed.

result image.png

4. Generation of analytical model

Select a clustering model for modeling. Use create_model (). This time, we will use the kmeans model.

kmeans = create_model('kmeans',num_clusters = 5 )
print(kmeans)

The number of clusters should be 5. Default is 4.

result

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto', random_state=123, tol=0.0001, verbose=0)

5. Model Assign

Assign a cluster label to the dataset (1080 samples) using assign_model ().

kmean_results = assign_model(kmeans)
kmean_results.head()

result image.png

6. Model visualization

Visualize the clustering results using plot_model.

6.1. PCA Plot

plot_model(kmeans)

result image.png

6.2. Elbow Plot

plot_model(kmeans, plot = 'elbow')

Elbow Plot will tell you the recommended number of clusters. In this case, the optimal number of clusters is specified as 5. result image.png

6.3. Silhouette Plot

plot_model(kmeans, plot = 'silhouette')

result image.png

6.4. Distribution Plot

plot_model(kmeans, plot = 'distribution', feature = 'class')

result image.png

7. Forecast

unseen_predictions = predict_model(kmeans, data=data_unseen)
unseen_predictions.head()

The Label column represents the result of the prediction.

result image.png

8. Summary

  1. I tried clustering, which is unsupervised learning, with PyCaret.

8.1 List of Pycaret functions used for clustering

  1. Data preprocessing: setup ()
  2. Generate analytical model: create_model ()
  3. Cluster label Assign: assign_model ()
  4. Visualization: plot_model ()
  5. Prediction: predict_model ()

9. Reference materials

1.PyCaret Home Page , http://www.pycaret.org/ 2.PyCaret Classification, https://pycaret.org/classification/ 3. I tried using PyCaret at the fastest speed, https://qiita.com/s_fukuzawa/items/5dd40a008dac76595eea 4. I tried to classify the quality of wine with PyCaret. Https://qiita.com/kotai2003/items/c8fa7e55230d0fa0cc8e 5. I tried to predict Titanic survival with PyCaret https://qiita.com/kotai2003/items/a377f45ddee9829ed2c5 6. I tried to predict the Boston real estate price with PyCaret (regression) https://qiita.com/kotai2003/items/bf4e8a278e43c58cab06

Recommended Posts

I tried clustering with PyCaret
I tried PyCaret2.0 (pycaret-nightly)
I tried using PyCaret
I tried using PyCaret
I tried to predict Titanic survival with PyCaret
I tried fp-growth with python
I tried scraping with Python
I tried Learning-to-Rank with Elasticsearch!
I tried gRPC with Python
I tried scraping with python
I tried trimming efficiently with OpenCV
I tried summarizing sentences with summpy
I tried machine learning with liblinear
I tried moving food with SinGAN
I tried implementing DeepPose with PyTorch
I tried face detection with MTCNN
I tried to implement hierarchical clustering
I tried running prolog with python 3.8.2.
I tried SMTP communication with Python
I tried sentence generation with GPT-2
I tried learning LightGBM with Yellowbrick
I tried face recognition with OpenCV
I tried to predict Boston real estate prices with PyCaret
Clustering with python-louvain
I tried multiple regression analysis with polynomial regression
I tried sending an SMS with Twilio
I tried using Amazon SQS with django-celery
I tried to implement Autoencoder with TensorFlow
I tried linebot with flask (anaconda) + heroku
I tried to visualize AutoEncoder with TensorFlow
I tried scraping
I tried to get started with Hy
I tried scraping Yahoo News with Python
I tried using Selenium with Headless chrome
I tried factor analysis with Titanic data!
I tried learning with Kaggle's Titanic (kaggle②)
I tried sending an email with python.
I tried non-photorealistic rendering with Python + opencv
I tried AutoKeras
Clustering with scikit-learn (1)
I tried a functional language with Python
I tried batch normalization with PyTorch (+ note)
I tried recursion with Python ② (Fibonacci sequence)
I tried implementing DeepPose with PyTorch PartⅡ
I tried papermill
I tried to implement CVAE with PyTorch
Clustering with scikit-learn (2)
I tried playing with the image with Pillow
I tried to solve TSP with QAOA
I tried simple image recognition with Jupyter
I tried CNN fine tuning with Resnet
I tried django-slack
I tried natural language processing with transformers.
I tried Django
I tried spleeter
I tried cgo
#I tried something like Vlookup with Python # 2
I tried to visualize the model with the low-code machine learning library "PyCaret"
I tried handwriting recognition of runes with scikit-learn
I tried to predict next year with AI
I tried "smoothing" the image with Python + OpenCV