――Kubeflow v1.0 RC was released at the time of writing this article. ――In my environment v0.71, some functions such as Pipeline and Kalib do not work, so I would like to try v1.0RC as soon as possible ... ――But I wrote an article about v0.71, so it also serves as a memorial service (most of the steps can be converted to v1.0).
--How to install Kubeflow v0.71 --I tried using my own Notebook container image & I tried mounting node data (Kaggle's Titanic) in the Notebook environment
It is a double stand. Pipelines and Kalib that I expected at the beginning did not work well, so let's use v1.0RC next time, it ends with (I wanted to do preprocessing Pipeline creation & parameter tuning with Kaggle's Titanic data)
I'm working on a machine learning project sometime,
--Machine learning model quality control during actual battle --Automation of data processing --Ad hoc data analysis artifact management
I felt a problem with such things. At first, I prepared the tools myself and dealt with them, but now I have to spend more time on data analysis work (main business), and eventually the tools are left unattended. No one could use the obsolete tools, and no one else could do accuracy verification or data preprocessing for learning.
I decided to try Kubeflow, asking if "Kubeflow", which I often hear in the context of MLOps, could improve the above situation.
In this article, we will prepare kubernetes and Kubeflow environment in Ubuntu 18.04 environment. Also, Docker has been set up separately.
Before Kubeflow, prepare a kubernetes (k8s) environment.
There are various ways to build it, but this time I built it using the simplest microk8s
.
(As an aside, from Kubeflow v1.0 RC, kubeflow is provided as an add-on for microk8s. It was good to choose microk8s.)
Please refer to the following site for installation. https://v0-7.kubeflow.org/docs/other-guides/virtual-dev/getting-started-multipass/
It is scripted and ends with just 6 lines.
git clone https://github.com/canonical-labs/kubernetes-tools
cd kubernetes-tools
git checkout eb91df0 # v1.Checkout as it may be updated for 0
sudo ./setup-microk8s.sh
microk8s.enable registry #Required to use your own Notebook image
microk8s.enable gpu #If you have a GPU
This script will install the microk8s v1.15
Please note that the latest microk8s version entered by snap install
is v.1.17
and Kubeflow v0.7 is not supported.
After installation,
kubectl get pod --all-namespaces
Make sure all pods are Running
in.
Access the k8s dashboard to help you debug:
Type the following command and record the output TOKEN
token=$(microk8s.kubectl -n kube-system get secret | grep default-token | cut -d " " -f1)
microk8s.kubectl -n kube-system describe secret $token
Port forwarding
microk8s.kubectl port-forward -n kube-system service/kubernetes-dashboard 10443:443 --address=0.0.0.0
Go to https: // <hostname>: 10443
and sign in with the previous TOKEN.
When the Dashboard is displayed, it's OK.
https://v0-7.kubeflow.org/docs/started/k8s/kfctl-k8s-istio/ Follow the steps in.
(There is also a script called kubeflow-tools
like the kubernetes-tools
I used earlier, but I didn't use it because the installed version of kubeflow is old.)
wget https://github.com/kubeflow/kubeflow/releases/download/v0.7.1/kfctl_v0.7.1-2-g55f9b2a_linux.tar.gz
tar -xvf kfctl_v0.7.1-2-g55f9b2a_linux.tar.gz
#Put the kfctl executable file in the PATH
export PATH=$PATH:"<path-to-kfctl>"
#Name the deployment appropriately(The writer is ’kf-yums')
export KF_NAME=<your choice of name for the Kubeflow deployment>
#Directory to place yaml files etc.(The writer is`~/.local/`)
export BASE_DIR=<path to a base directory>
export KF_DIR=${BASE_DIR}/${KF_NAME}
export CONFIG_URI="https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_k8s_istio.0.7.1.yaml"
mkdir -p ${KF_DIR}
cd ${KF_DIR}
#It may not be successful at one time. Retry several times.
kfctl apply -V -f ${CONFIG_URI}
If you run kubectl get pod --all-namespaces
, you will see that many containers have been created. Wait for a while until everything is Running
.
Port forwarding
#Anyone can access it, so it is better to restrict access as appropriate.
kubectl port-forward -n istio-system svc/istio-ingressgateway 10080:80 --address 0.0.0.0
If you access port 10080 with http, the Dashboard (Welcome screen) will appear.
Note that this configuration poses a security concern as anyone who knows the URL can access it. When using it in-house, it may be better to use Dex
etc. to protect the password or consider access restrictions by port forwarding.
As you proceed, you will be taken to the Namespace
creation screen, so set it appropriately (I chose kf-yums
).
You can access the Kubeflow Dashboard with Finish.
One of the features of Kubeflow is the hosting feature of Jupyter Notebook. Anyone can easily create an Oreore Notebook environment just by specifying the necessary resources such as memory, CPU, (GPU) and environment (Docker image). You can reduce the time required to build an analysis infrastructure.
Use this feature to host your own Docker image with Jupyter Notebook. It also makes it possible to refer to the data on the terminal running k8s from the notebook.
https://www.kubeflow.org/docs/notebooks/custom-notebook/ I recommend referring to.
For the time being, I created a Dockerfile like the one below, asking if I should use Random Forest or something.
FROM python:3.8-buster
RUN pip --no-cache-dir install pandas numpy scikit-learn jupyter
ENV NB_PREFIX /
EXPOSE 8888
CMD ["sh","-c", "jupyter notebook --notebook-dir=/home/jovyan --ip=0.0.0.0 --no-browser --allow-root --port=8888 --NotebookApp.token='' --NotebookApp.password='' --NotebookApp.allow_origin='*' --NotebookApp.base_url=${NB_PREFIX}"]
Build.
docker build -t myimage .
To push this image to the container registry (localhost: 32000
) of microk8s, edit daemon.json
as follows.
> sudo vim /etc/docker/daemon.json
Add the following
{
"insecure-registries" : ["localhost:32000"]
}
After adding, restart Docker and push it to the microk8s registry.
sudo systemctl restart docker
docker tag myimage:latest localhost:32000/myimage:latest
docker push localhost:32000/myimage:latest
You can see the image pushed in the registry as follows:
microk8s.ctr -n k8s.io images ls | grep myimage
The Notebook image created above does not include the input data (usually it is not)
Create PV
and PVC
in advance so that the input data on the node can be referenced from within the notebook.
Here, let's make Kaggle's Titanic dataset located in / data / titanic /
visible from Notebook.
#It is assumed that the data downloaded in advance from kaggle is placed.
> find /data/titanic
titanic/gender_submission.csv
titanic/train.csv
titanic/test.csv
Define PersistentVolume (PV)
and PersistentVolumeClaim (PVC)
as follows.
kind: PersistentVolume
apiVersion: v1
metadata:
name: titanic-pv
namespace: <namespace created on kubeflow>
spec:
storageClassName: standard
capacity:
storage: 1Gi
claimRef:
namespace: <namespace created on kubeflow>
name: titanic-pvc
accessModes:
- ReadWriteOnce
hostPath:
path: /data/titanic/
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: titanic-pvc
namespace: <namespace created on kubeflow>
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
Save this as titanic_data.yaml
and
kubectl apply -f titanic_data.yaml
The Volume is now created. By specifying this Volume when creating Notebook Server, you can refer to the data in it from Notebook.
From the Kubeflow Dashboard, go to Notebook Servers
-> NEW SERVER
.
When you proceed to the Notebook creation screen,
Check the Custom Image
in the ʻImage` item and push the image first.
localhost:32000/myimage:latest
Is specified.
--Notebook server name
Is set appropriately.
In the DataVolumes item, specify the PVC
created earlier.
After setting the above, press CREATE
at the bottom
The Notebook Server will start up in just a few seconds. (If it doesn't start up, something is wrong behind the scenes. You can see it on the k8s dashboard.)
Press CONNECT
and the usual Jupyter Notebook screen will appear.
Of course, Titanic data is also mounted.
After that, create a notebook as usual and analyze various data.
Now that you've hosted your Notebook Server, Kubeflow has many other features.
Of particular interest
--Pipeline
that can perform data processing, learning, inference processing execution, and tracking of various KPIs
--Parameter tuning is possible Kalib
2 functions. But these two ... didn't work in my environment.
Pipeline gives an error that there is no Docker
when executing Job.
The main cause was that microk8s used containerd
instead of Docker
to run the container.
Issue also stood.
When Kalib also executes Job,
INFO:hyperopt.utils:Failed to load dill, try installing dill via "pip install dill" for enhanced pickling support.
INFO:hyperopt.fmin:Failed to load dill, try installing dill via "pip install dill" for enhanced pickling support.
Message is issued and the process does not proceed. Although it is closed, issue is
Both of them have a smell that seems to be solved by Update, so I will try Kubeflow v1.0 RC immediately.
Recommended Posts