Some kind of nice timing

――Kubeflow v1.0 RC was released at the time of writing this article. ――In my environment v0.71, some functions such as Pipeline and Kalib do not work, so I would like to try v1.0RC as soon as possible ... ――But I wrote an article about v0.71, so it also serves as a memorial service (most of the steps can be converted to v1.0).

What is this article?

--How to install Kubeflow v0.71 --I tried using my own Notebook container image & I tried mounting node data (Kaggle's Titanic) in the Notebook environment

It is a double stand. Pipelines and Kalib that I expected at the beginning did not work well, so let's use v1.0RC next time, it ends with (I wanted to do preprocessing Pipeline creation & parameter tuning with Kaggle's Titanic data)

Background of trying to use Kubeflow

I'm working on a machine learning project sometime,

--Machine learning model quality control during actual battle --Automation of data processing --Ad hoc data analysis artifact management

I felt a problem with such things. At first, I prepared the tools myself and dealt with them, but now I have to spend more time on data analysis work (main business), and eventually the tools are left unattended. No one could use the obsolete tools, and no one else could do accuracy verification or data preprocessing for learning.

I decided to try Kubeflow, asking if "Kubeflow", which I often hear in the context of MLOps, could improve the above situation.

procedure

In this article, we will prepare kubernetes and Kubeflow environment in Ubuntu 18.04 environment. Also, Docker has been set up separately.

Install k8s with microk8s

Before Kubeflow, prepare a kubernetes (k8s) environment. There are various ways to build it, but this time I built it using the simplest microk8s.

(As an aside, from Kubeflow v1.0 RC, kubeflow is provided as an add-on for microk8s. It was good to choose microk8s.)

Please refer to the following site for installation. https://v0-7.kubeflow.org/docs/other-guides/virtual-dev/getting-started-multipass/

It is scripted and ends with just 6 lines.

git clone https://github.com/canonical-labs/kubernetes-tools
cd kubernetes-tools
git checkout eb91df0 # v1.Checkout as it may be updated for 0
sudo ./setup-microk8s.sh
microk8s.enable registry #Required to use your own Notebook image
microk8s.enable gpu #If you have a GPU

This script will install the microk8s v1.15 Please note that the latest microk8s version entered by snap install is v.1.17 and Kubeflow v0.7 is not supported.

After installation,

kubectl get pod --all-namespaces

Make sure all pods are Running in.

(Optional) Access to Kubernetes Dashborad

Access the k8s dashboard to help you debug:

Type the following command and record the output TOKEN

token=$(microk8s.kubectl -n kube-system get secret | grep default-token | cut -d " " -f1)
microk8s.kubectl -n kube-system describe secret $token

Port forwarding

microk8s.kubectl port-forward -n kube-system service/kubernetes-dashboard 10443:443 --address=0.0.0.0

Go to https: // <hostname>: 10443 and sign in with the previous TOKEN.

When the Dashboard is displayed, it's OK.

kubeflow installation

https://v0-7.kubeflow.org/docs/started/k8s/kfctl-k8s-istio/ Follow the steps in.

(There is also a script called kubeflow-tools like the kubernetes-tools I used earlier, but I didn't use it because the installed version of kubeflow is old.)

kfctl binary download

wget https://github.com/kubeflow/kubeflow/releases/download/v0.7.1/kfctl_v0.7.1-2-g55f9b2a_linux.tar.gz
tar -xvf kfctl_v0.7.1-2-g55f9b2a_linux.tar.gz

Set environment variables


#Put the kfctl executable file in the PATH
export PATH=$PATH:"<path-to-kfctl>"

#Name the deployment appropriately(The writer is ’kf-yums')
export KF_NAME=<your choice of name for the Kubeflow deployment>

#Directory to place yaml files etc.(The writer is`~/.local/`)
export BASE_DIR=<path to a base directory>

export KF_DIR=${BASE_DIR}/${KF_NAME}
export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_k8s_istio.0.7.1.yaml"

Installation

mkdir -p ${KF_DIR}
cd ${KF_DIR}
#It may not be successful at one time. Retry several times.
kfctl apply -V -f ${CONFIG_URI}

If you run kubectl get pod --all-namespaces, you will see that many containers have been created. Wait for a while until everything is Running.

Access kubeflow Dashboard

Port forwarding

#Anyone can access it, so it is better to restrict access as appropriate.
kubectl port-forward -n istio-system svc/istio-ingressgateway 10080:80 --address 0.0.0.0

If you access port 10080 with http, the Dashboard (Welcome screen) will appear.

Note that this configuration poses a security concern as anyone who knows the URL can access it. When using it in-house, it may be better to use Dex etc. to protect the password or consider access restrictions by port forwarding.

As you proceed, you will be taken to the Namespace creation screen, so set it appropriately (I chose kf-yums).

You can access the Kubeflow Dashboard with Finish.

Set up Notebook Server with your own image

One of the features of Kubeflow is the hosting feature of Jupyter Notebook. Anyone can easily create an Oreore Notebook environment just by specifying the necessary resources such as memory, CPU, (GPU) and environment (Docker image). You can reduce the time required to build an analysis infrastructure.

Use this feature to host your own Docker image with Jupyter Notebook. It also makes it possible to refer to the data on the terminal running k8s from the notebook.

Build your own image and Push

https://www.kubeflow.org/docs/notebooks/custom-notebook/ I recommend referring to.

For the time being, I created a Dockerfile like the one below, asking if I should use Random Forest or something.

FROM python:3.8-buster
RUN pip --no-cache-dir install pandas numpy scikit-learn jupyter

ENV NB_PREFIX /
EXPOSE 8888
CMD ["sh","-c", "jupyter notebook --notebook-dir=/home/jovyan --ip=0.0.0.0 --no-browser --allow-root --port=8888 --NotebookApp.token='' --NotebookApp.password='' --NotebookApp.allow_origin='*' --NotebookApp.base_url=${NB_PREFIX}"]

Build.

docker build -t myimage .

To push this image to the container registry (localhost: 32000) of microk8s, edit daemon.json as follows.

> sudo vim /etc/docker/daemon.json

Add the following

{
    "insecure-registries" : ["localhost:32000"]
}

After adding, restart Docker and push it to the microk8s registry.

sudo systemctl restart docker
docker tag myimage:latest localhost:32000/myimage:latest
docker push localhost:32000/myimage:latest

You can see the image pushed in the registry as follows:

microk8s.ctr -n k8s.io images ls | grep myimage

Make it possible to refer to terminal data from Notebook

The Notebook image created above does not include the input data (usually it is not) Create PV and PVC in advance so that the input data on the node can be referenced from within the notebook.

Here, let's make Kaggle's Titanic dataset located in / data / titanic / visible from Notebook.

#It is assumed that the data downloaded in advance from kaggle is placed.
> find /data/titanic

titanic/gender_submission.csv
titanic/train.csv
titanic/test.csv

Define PersistentVolume (PV) and PersistentVolumeClaim (PVC) as follows.


kind: PersistentVolume
apiVersion: v1
metadata:
  name: titanic-pv
  namespace: <namespace created on kubeflow>
spec:
  storageClassName: standard
  capacity:
    storage: 1Gi
  claimRef:
    namespace: <namespace created on kubeflow>
    name: titanic-pvc
  accessModes:
  - ReadWriteOnce
  hostPath:
    path: /data/titanic/
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: titanic-pvc
  namespace: <namespace created on kubeflow>
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi

Save this as titanic_data.yaml and

kubectl apply -f titanic_data.yaml

The Volume is now created. By specifying this Volume when creating Notebook Server, you can refer to the data in it from Notebook.

Create Notebook Server

From the Kubeflow Dashboard, go to Notebook Servers-> NEW SERVER.

When you proceed to the Notebook creation screen, Check the Custom Image in the ʻImage` item and push the image first.

localhost:32000/myimage:latest

Is specified.

--Notebook server name

CPU
Memory
Workspace Volume

Is set appropriately.

In the DataVolumes item, specify the PVC created earlier.

After setting the above, press CREATE at the bottom

The Notebook Server will start up in just a few seconds. (If it doesn't start up, something is wrong behind the scenes. You can see it on the k8s dashboard.)

Press CONNECT and the usual Jupyter Notebook screen will appear. Of course, Titanic data is also mounted.

After that, create a notebook as usual and analyze various data.

Functions that could not be operated other than Notebook Server

Now that you've hosted your Notebook Server, Kubeflow has many other features.

Of particular interest

--Pipeline that can perform data processing, learning, inference processing execution, and tracking of various KPIs --Parameter tuning is possible Kalib

2 functions. But these two ... didn't work in my environment.

Pipeline gives an error that there is no Docker when executing Job. The main cause was that microk8s used containerd instead of Docker to run the container. Issue also stood.

When Kalib also executes Job,

INFO:hyperopt.utils:Failed to load dill, try installing dill via "pip install dill" for enhanced pickling support.
INFO:hyperopt.fmin:Failed to load dill, try installing dill via "pip install dill" for enhanced pickling support.

Message is issued and the process does not proceed. Although it is closed, issue is

Both of them have a smell that seems to be solved by Update, so I will try Kubeflow v1.0 RC immediately.

[PYTHON] Play with custom image notebook on Kubeflow v0.71