Introduction

Machine learning has the problem that it is difficult to manage experiments because it is necessary to manage not only the code used for learning the model but also the data set, the product generated by preprocessing, the model, etc. as a set. Proper experimental management is also important for bringing the code that was working in the experimental stage to the production environment and reproducing similar prediction results.

MLflow is famous for machine learning experiment management, but I found an experiment management tool called ClearML (former name: Allegro Trains), so in this article I will explain how to use ClearML easily. ..

ClearML: https://github.com/allegroai/clearml (Apache-2.0 License) Official documentation: https://allegro.ai/clearml/docs/index.html#

The following articles are also very helpful for the concept of experiment management. Thinking about experiment management Re: ML life starting from zero

wrap up

ClearML is a tool that provides machine learning experiment management and MLOps functions. It supports time-consuming and error-prone tasks related to development and version tracking in the machine learning life cycle.

ClearML has the following three main functions.

-Experiment management --Automatic experiment management including environment and learning results

MLOps --Automation, pipeline, orchestration of ML/DL jobs -Data Management --Data management & version control in object storage (S3/GS/Azure/NAS)

In this article, I will mainly explain how to use experiment management among the three functions. It also briefly describes the ClearML architecture at the end.

I have tried using ClearML and confirmed that the following information can be managed as experiment management.

--Code version --Get the Commit ID of the code used for learning and the version of the library as a log --Data version --There is a function to manage the output intermediate products and models. --Hyperparameters --Automatically get Python argparse parameters as logs --Metrics --General loss, accuracy, confusion matrix, etc. can be obtained --Environment --Get the learning directory location of the machine used for learning as a log

Operation check environment

Ubuntu 18.04
Python 3.7
ClearML 0.17.2

Please note that the new version of ClearML may not work as described in this article.

Set up a free ClearML host service

This article uses a free, externally hosted ClearML server. The setup method follows the following document. https://allegro.ai/clearml/docs/docs/getting_started/getting_started_clearml_hosted_service.html

It seems that it is possible to set up your own ClearML server on-premises, AWS, GCP, so if you have security requirements, you can set it by following the document procedure below. https://allegro.ai/clearml/docs/rst/deploying_clearml/index.html

--Sign up at the following site to register your account. --It seems that you can register your account with Google account, Bitbucket, or Github.

https://app.community.clear.ml/login?redirect=%2F

--Enter your name, email, interests, etc. and click "SIGN UP" to register your account.

--Run the following command to install clearml.

pip install clearml

--Execute the following command to start the ClearML setup wizard.

clearml-init

--A message will be displayed asking you to create account credentials, so get the credentials. Click User Account> Profile in the upper right corner of the free host service web screen

--Click Create new credentials> Copy to clipboard.

--When you paste the credential that you copied in the terminal, the message that the credential was detected is displayed as shown below.

Detected credentials key="********************" secret="*******"

--Specify the URL of the web server. This time press Enter by default.

WEB Host configured to: [https://app.community.clear.ml]

--Next, specify the URL of the API server. Keep the defaults and press Enter.

API Host configured to: [https://api.community.clear.ml]

--The following message will be displayed, and the setup is complete.

CLEARML Hosts configuration:
Web App: https://app.community.clear.ml
API: https://api.community.clear.ml
File Store: https://files.community.clear.ml

Verifying credentials ...
Credentials verified!

New configuration stored in /home/<username>/clearml.conf
CLEARML setup completed successfully.

Try Reporting Tutorial

--There is a Tutorial code in ClearML, so clone the repository.

cd ~
git clone https://github.com/allegroai/clearml.git
cd ~/clearml/examples/frameworks/pytorch
pip install -r requirements.txt
pip install pandas scikit-learn

--There is a script for Reporting Tutorial called pytorch_mnist.py, so copy it and rename the file.

cp pytorch_mnist.py pytorch_mnist_tutorial.py

Set the directory where model checkpoints are saved

--The output directory where model checkpoints are output can be set by specifying output_uri in Task.init. --Change the following parts.

task = Task.init(project_name='examples', task_name='pytorch mnist train')

--Checkpoints will be saved in ./clearml if you make the following changes.

model_snapshots_path = './clearml'
if not os.path.exists(model_snapshots_path):
    os.makedirs(model_snapshots_path)

task = Task.init(project_name='examples', 
    task_name='extending automagical ClearML example', 
    output_uri=model_snapshots_path)

--When you run the script, ClearML will create the following directory structure.

+ - <output destination name>
|   +-- <project name>
|       +-- <task name>.<Task Id>
|           +-- models
|           +-- artifacts

Set up Logger

ClearML seems to have explicit reporting of plots, log text, tables, etc. in addition to the automatic logging feature. https://allegro.ai/clearml/docs/docs/tutorials/tutorial_explicit_reporting.html#step-2-logger-class-reporting-methods

--The logger can be obtained from Task as follows.

logger = task.get_logger

logger = Logger.current_logger()

--Use the Logger.report_scalar method to log scalar metrics as follows:

def train(args, model, device, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % args.log_interval == 0:
            Logger.current_logger().report_scalar(
                "train", "loss", iteration=(epoch * len(train_loader) + batch_idx), value=loss.item())
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))

--In addition, metrics such as histgram and confusion_matrix other than scalar values can be implemented in the following form.

def test(args, model, device, test_loader, epoch):
    save_test_loss = []
    save_correct = []
    preds = []
    targets = []
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item()  # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True)  # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

            preds.append(pred.cpu().detach().numpy())
            targets.append(target.cpu().detach().numpy())

            save_test_loss.append(test_loss)
            save_correct.append(correct)

    test_loss /= len(test_loader.dataset)

    Logger.current_logger().report_scalar(
        "test", "loss", iteration=epoch, value=test_loss)
    Logger.current_logger().report_scalar(
        "test", "accuracy", iteration=epoch, value=(correct / len(test_loader.dataset)))
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))
    
    preds = np.concatenate(preds)
    targets = np.concatenate(targets)
    matrix = confusion_matrix(targets, preds)  # use confusion matrix of scikit-learn
    Logger.current_logger().report_confusion_matrix(title='Confusion matrix example', 
        series='Test loss / correct', matrix=matrix, iteration=1, 
        xaxis='correct', yaxis='pred', yaxis_reversed=True)
    
    Logger.current_logger().report_histogram(title='Histogram example', series='correct',
        iteration=1, values=save_correct, xaxis='Test', yaxis='Correct')

--You can also use Logger.report_text to display a text message according to the level argument.

Logger.current_logger().report_text('The default output destination for model snapshots and artifacts is: {}'.format(model_snapshots_path ), level=logging.DEBUG)

Register the product

ClearML can also be uploaded to ClearML Server by registering the product when the script is executed. If the product changes, ClearML Server will log the change. However, as of December 29, 2020, only Pandas DataFrame is supported. https://allegro.ai/clearml/docs/docs/tutorials/tutorial_explicit_reporting.html#step-3-registering-artifacts

--To register the product, add the following code to the test method as shown below.

# Create the Pandas DataFrame
test_loss_correct = {
        'test lost': save_test_loss,
        'correct': save_correct
}
df = pd.DataFrame(test_loss_correct, columns=['test lost','correct'])

# Register the test loss and correct as a Pandas DataFrame artifact
Task.current_task().register_artifact('Test_Loss_Correct', df, metadata={'metadata string': 'apple', 
    'metadata int': 100, 'metadata dict': {'dict string': 'pear', 'dict int': 200}})

--The registered product can be referenced from the Python code as follows, and can be used for later processing.

# Once the artifact is registered, we can get it and work with it. Here, we sample it.
sample = Task.current_task().get_registered_artifacts()['Test_Loss_Correct'].sample(frac=0.5, 
    replace=True, random_state=1)

Upload the product

You can upload script-generated products to ClearML by using the Task.upload_artifact method. However, unlike the registration above, this upload is not tracked for changes.

--Put the following code in the test method to upload the Prediction result.

# Upload test loss as an artifact. Here, the artifact is numpy array
Task.current_task().upload_artifact('Predictions', artifact_object=np.array(save_test_loss),
    metadata={'metadata string': 'banana', 'metadata integer': 300,
    'metadata dictionary': {'dict string': 'orange', 'dict int': 400}})

Run the Reporting script

--Execute the script with the following command. When executed, logs such as ClearML log and model training loss will be displayed.

python3 pytorch_mnist_tutorial.py

――In this case, the model is saved as follows.

ls clearml/examples/extending\ automagical\ ClearML\ example.13e46b70da274fa085e772ed700df028/models/
mnist_cnn.pt  test.pt  training.pt

Checking learning results on the web screen

--You can check the learning result on the web screen. Since project_name ='examples' is passed in the argument of Task.init, click project of examples on the web screen.

--Since task_name ='extending automagical ClearML example' in the argument of Task.init, click the one that is displayed as extending automagical ClearML example and supports learning.

--In EXPERIMENTS EXECUTION, you can check the information of the source code executed during learning. The file name of the executed script and the COMMIT ID are logged so that the experiment can be reproduced.

--In CONFIGURATION, you can check the log of hyperparameters during learning. I found this useful because I don't need to add my own log code for hyperparameters.

--ARTIFACTS allows you to check the output model information and product information.

--RESULTS allows you to see logs related to scalar values and plots. The plot of loss change and accuracy change during learning is as follows.

--In addition, the plot of the confusion matrix is as follows.

That's it for the Reporting Tutorial, which logs metrics and products.

Clear ML Architecture

ClearML consists of the following components.

ClearML Python Package (clearml) --Integrate into ClearML by adding a few lines of code to your existing script
ClearML Server (clearml-server) --Save experiment, model and workflow data. It also has the role of experiment management by Web UI and automation of MLOps for reproducibility and tuning.
ClearML Agent (clearml-agent) --MLOps has a role in orchestration, experimentation and workflow

Quote: https://allegro.ai/clearml/docs/rst/architecture/index.html

The ClearML Server shown above is from a free external host this time. As a reminder, ClearML Server can be used by setting up its own server in an on-premises environment, or by setting up a server on the cloud such as AWS or GCP.

Also, it seems to be an advantage that it can be used by just adding the same few lines of code in both the DATA SCIENTIST ENVIRONMENT environment and GPU MACHINES (on-premise or cloud) on the left of the above figure.

The goodness I felt when using ClearML

--Easy to use by installing pip and adding a few lines of code --Easy to get started with a free external host --By setting up your own server, you can use it both on-premises and in the cloud. --The code of examples is substantial - https://github.com/allegroai/clearml/tree/master/examples --Pytorch, Pytorch-Supports various frameworks such as Lightning, Tensorflow, Keras, AutoKeras - https://allegro.ai/clearml/docs/rst/integrations/index.html --The web UI looks beautiful --It seems that there is a function like MLOps, for example, a function to iteratively tune hyperparameters.

References

ClearML: https://github.com/allegroai/clearml (Apache-2.0 License) --ClearML official document: https://allegro.ai/clearml/docs/index.html# -Thinking about experiment management Re: ML life starting from zero

Disclaimer

The author pays close attention to the content, functions, etc. of this article, but does not guarantee that the content is accurate or safe. We are not responsible. The author and the organization to which the author belongs (NS Solutions Corporation) shall not be liable for any inconvenience or damage caused to the user by using the contents of this article.

[PYTHON] Introduction to ClearML-Easy to manage machine learning experiments-