Introduction

I tried using the machine learning library PyCaret that was released the other day. Data feature analysis and performance comparison work with multiple models will be automated, and I think that the work time of data scientists so far will be significantly reduced.

This time, we will use Mice Protein Expression Data Set 2015 to perform clustering and see the results. ..

Data overview Expression levels of 77 proteins measured in the cerebral cortex of 8 classes of control and Down syndrome mice exposed to context fear conditioning, a task used to assess associative learning.

1. Install PyCaret

Execute the code below to install it. I'm using Anaconda, but I launched and installed a virtual environment dedicated to PyCaret. In an existing virtual environment managed by Conda, an error may occur. (Probably due to a collision between pip and conda)

pip install pycaret

2. Data acquisition

from pycaret.datasets import get_data
dataset = get_data('mice')

result

Let's take a look at the contents of the data using Pandas profile_report ().

import pandas_profiling
dataset.profile_report()

result

Then, 95% is divided into training data and 5% is divided into test data (called Unseen Data).

data = dataset.sample(frac=0.95, random_state=786).reset_index(drop=True)
data_unseen = dataset.drop(data.index).reset_index(drop=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

result

Data for Modeling: (1026, 82) Unseen Data For Predictions: (54, 82)

3. Data preprocessing

Preprocess the data with setup ().

from pycaret.clustering import *
data_clust = setup(data, normalize = True, 
                   ignore_features = ['MouseID'],
                   session_id = 123)

Perform Normalization of Numerical data. Ignore the'Mouse ID'Feature. Then specify session_id = 123 as a fixed Random seed.

result

4. Generation of analytical model

Select a clustering model for modeling. Use create_model (). This time, we will use the kmeans model.

kmeans = create_model('kmeans',num_clusters = 5 )
print(kmeans)

The number of clusters should be 5. Default is 4.

result

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto', random_state=123, tol=0.0001, verbose=0)

5. Model Assign

Assign a cluster label to the dataset (1080 samples) using assign_model ().

kmean_results = assign_model(kmeans)
kmean_results.head()

result

6. Model visualization

Visualize the clustering results using plot_model.

6.1. PCA Plot

plot_model(kmeans)

result

6.2. Elbow Plot

plot_model(kmeans, plot = 'elbow')

Elbow Plot will tell you the recommended number of clusters. In this case, the optimal number of clusters is specified as 5. result

6.3. Silhouette Plot

plot_model(kmeans, plot = 'silhouette')

result

6.4. Distribution Plot

plot_model(kmeans, plot = 'distribution', feature = 'class')

result

7. Forecast

unseen_predictions = predict_model(kmeans, data=data_unseen)
unseen_predictions.head()

The Label column represents the result of the prediction.

result

8. Summary

I tried clustering, which is unsupervised learning, with PyCaret.

8.1 List of Pycaret functions used for clustering

Data preprocessing: setup ()
Generate analytical model: create_model ()
Cluster label Assign: assign_model ()
Visualization: plot_model ()
Prediction: predict_model ()

9. Reference materials

1.PyCaret Home Page , http://www.pycaret.org/ 2.PyCaret Classification, https://pycaret.org/classification/ 3. I tried using PyCaret at the fastest speed, https://qiita.com/s_fukuzawa/items/5dd40a008dac76595eea 4. I tried to classify the quality of wine with PyCaret. Https://qiita.com/kotai2003/items/c8fa7e55230d0fa0cc8e 5. I tried to predict Titanic survival with PyCaret https://qiita.com/kotai2003/items/a377f45ddee9829ed2c5 6. I tried to predict the Boston real estate price with PyCaret (regression) https://qiita.com/kotai2003/items/bf4e8a278e43c58cab06

[PYTHON] I tried clustering with PyCaret

Introduction

1. Install PyCaret

2. Data acquisition

3. Data preprocessing

4. Generation of analytical model

5. Model Assign

6. Model visualization

7. Forecast

8. Summary

8.1 List of Pycaret functions used for clustering

9. Reference materials