I tried using the machine learning library PyCaret that was released the other day. Data feature analysis and performance comparison work with multiple models will be automated, and I think that the work time of data scientists so far will be significantly reduced.
This time, we will use Mice Protein Expression Data Set 2015 to perform clustering and see the results. ..
Data overview Expression levels of 77 proteins measured in the cerebral cortex of 8 classes of control and Down syndrome mice exposed to context fear conditioning, a task used to assess associative learning.
Execute the code below to install it. I'm using Anaconda, but I launched and installed a virtual environment dedicated to PyCaret. In an existing virtual environment managed by Conda, an error may occur. (Probably due to a collision between pip and conda)
pip install pycaret
from pycaret.datasets import get_data
dataset = get_data('mice')
result
Let's take a look at the contents of the data using Pandas profile_report ().
import pandas_profiling
dataset.profile_report()
result
Then, 95% is divided into training data and 5% is divided into test data (called Unseen Data).
data = dataset.sample(frac=0.95, random_state=786).reset_index(drop=True)
data_unseen = dataset.drop(data.index).reset_index(drop=True)
print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))
result
Data for Modeling: (1026, 82) Unseen Data For Predictions: (54, 82)
Preprocess the data with setup ().
from pycaret.clustering import *
data_clust = setup(data, normalize = True,
ignore_features = ['MouseID'],
session_id = 123)
Perform Normalization of Numerical data. Ignore the'Mouse ID'Feature. Then specify session_id = 123 as a fixed Random seed.
result
Select a clustering model for modeling. Use create_model (). This time, we will use the kmeans model.
kmeans = create_model('kmeans',num_clusters = 5 )
print(kmeans)
The number of clusters should be 5. Default is 4.
result
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto', random_state=123, tol=0.0001, verbose=0)
Assign a cluster label to the dataset (1080 samples) using assign_model ().
kmean_results = assign_model(kmeans)
kmean_results.head()
result
Visualize the clustering results using plot_model.
6.1. PCA Plot
plot_model(kmeans)
result
6.2. Elbow Plot
plot_model(kmeans, plot = 'elbow')
Elbow Plot will tell you the recommended number of clusters. In this case, the optimal number of clusters is specified as 5. result
6.3. Silhouette Plot
plot_model(kmeans, plot = 'silhouette')
result
6.4. Distribution Plot
plot_model(kmeans, plot = 'distribution', feature = 'class')
result
unseen_predictions = predict_model(kmeans, data=data_unseen)
unseen_predictions.head()
The Label column represents the result of the prediction.
result
1.PyCaret Home Page , http://www.pycaret.org/ 2.PyCaret Classification, https://pycaret.org/classification/ 3. I tried using PyCaret at the fastest speed, https://qiita.com/s_fukuzawa/items/5dd40a008dac76595eea 4. I tried to classify the quality of wine with PyCaret. Https://qiita.com/kotai2003/items/c8fa7e55230d0fa0cc8e 5. I tried to predict Titanic survival with PyCaret https://qiita.com/kotai2003/items/a377f45ddee9829ed2c5 6. I tried to predict the Boston real estate price with PyCaret (regression) https://qiita.com/kotai2003/items/bf4e8a278e43c58cab06
Recommended Posts