MLOps Workshop

In this workshop, we are creating an end-to-end machine learning pipeline inside Kubernetes cluster.

We will train models to recognize the hand gestures showing N fingers. The best model among experiments will be served as API.

It's just like automated "Miss Universe" choosing.

Chapter 0: Preparations

Prerequisites:

k8s kIND installed
helm installed
k9s installed
kubectl installed

Starting from clean Kind cluster:

kind delete cluster
kind create cluster

Alternatively: configure kubeconfig (kubectl) to use some existing cluster.

The skill of IT engineer is not to deploy flawlessly (nobody's able to), but to troubleshoot quickly.

Optional step: connect cluster to Komodor:

xdg-open https://app.komodor.com
helm repo add komodorio https://helm-charts.komodor.io 
helm upgrade --install --create-namespace --namespace komodor komodor-agent komodorio/komodor-agent --set apiKey=`cat komodor-key.txt` --set clusterName=local-kind-`date +%s`

Creating dedicated images would decrease the startup overhead. Our cluster will need a special Docker image inside. In case it's not possible to pull it from registries, you can build it locally and load into Kind cluster:

docker build . -t docker.io/komikomodor/mlops-workshop:latest
kind load docker-image docker.io/komikomodor/mlops-workshop:latest

Workshop setting: bypass auth, everything in single cluster and same namespace. In short, the real production setup is very complex, we simplified it. Using PVCs and configmaps for storage - workshop setup. All has to go into same namespace because of using PVCs to share the data. Real world would use Git, S3, NFS etc.

kubectl apply -f C0S1-workshop-pvc.yaml
kubectl apply -f C0S2-workshop-helper-pod.yaml

How do we eat elephant? Piece by piece

Chapter 1: Airflow as top-level orchestrator

helm repo add apache-airflow https://airflow.apache.org/
helm upgrade --install airflow apache-airflow/airflow -f C1S2-airflow-values.yaml

Wait for Airflow pods to become ready.

xdg-open http://localhost:8082
kubectl port-forward svc/airflow-webserver 8082:8080

See empty Airflow UI.

kubectl cp C1S3-dag1.py workshop-helper-pod:/data/dags/
kubectl exec deployment/airflow-scheduler -- /bin/bash -c "airflow dags reserialize && airflow dags unpause dag1_all_stubs && airflow dags trigger dag1_all_stubs"

Go to Airflow UI, see first DAG passing all green. Start it again manually, if you wish.

Chapter 2: Spark for data preparation

helm repo add spark-operator https://kubeflow.github.io/spark-operator
helm upgrade --install spark-operator spark-operator/spark-operator
kubectl apply -f C2S1-airflow-spark-crb.yaml

Wait for operator deploy to complete.

Loading dataset: koryakinp/fingers from Kaggle. Taking 200 images as "small" dataset, full images

kubectl exec workshop-helper-pod -- mkdir -p /data/scripts
kubectl cp C2S2-preprocess_fingers.py workshop-helper-pod:/data/scripts/
kubectl cp C2S3-dag2.py workshop-helper-pod:/data/dags/
kubectl cp C2S4-spark-preprocess.yaml workshop-helper-pod:/data/dags/
kubectl exec deployment/airflow-scheduler -- /bin/bash -c "airflow dags reserialize && airflow dags unpause dag2_spark_preprocess && airflow dags trigger dag2_spark_preprocess"

Go to Airflow UI and second DAG, look at logs of preprocess_data step. Start it second time, notice is goes faster ( cached download).

Note that we'll disable this step of DAG until the end, to make DAGs run faster. The dataset is already in storage volume.

Chapter 3: MLFlow to track experiments

helm repo add bitnami https://charts.bitnami.com/bitnami
helm upgrade --install mlflow bitnami/mlflow --set artifactRoot.s3.enabled=false --set tracking.auth.enabled=false --set artifactRoot.filesystem.enabled=true

Wait for its pods to become ready

xdg-open http://localhost:8083
kubectl port-forward svc/mlflow-tracking 8083:80

See its empty UI of MLFlow.

Start tracking experiments, initially with fake metrics from fake training.

kubectl cp C3S1-dag3.py workshop-helper-pod:/data/dags/
kubectl cp C3S2-fake-train.py workshop-helper-pod:/data/scripts/
kubectl cp C3S3-mlflow-chooser.py workshop-helper-pod:/data/scripts/
kubectl exec deployment/airflow-scheduler -- /bin/bash -c "airflow dags reserialize && airflow dags unpause dag3_mlflow_tracking && airflow dags trigger dag3_mlflow_tracking"

Go to Airflow UI to see third DAG running and completing. After its completion, go to MLFlow UI, refresh list of experiments, see "finger-counting-fake", browse into it: runs, metrics etc. Add "val_accuracy" column to table.

Chapter 4: PyTorch and real training

We'll run direct training of single model variant. The training script is parameterized to have all our combinations inside. We're skipping realistic long data loading (eg from S3) by mounting volume to Pod. Beware: Training on CPU is not recommended to children and pregnant women. We'll do for the sake of simplicity.

kubectl cp C4S1_train.py workshop-helper-pod:/data/scripts/
kubectl cp C4S2-dag4.py workshop-helper-pod:/data/dags/
kubectl exec deployment/airflow-scheduler -- /bin/bash -c "airflow dags reserialize && airflow dags unpause dag4_real_training && airflow dags trigger dag4_real_training"

In Airflow, open fourth DAG and see it succeeds. After its completion, go to MLFlow UI, refresh list of experiments, see "finger-counting-debug", look at the model metrics graphs. Best model chooser would choose that run, can be validated in data volume /data/best_model_info.txt file.

Chapter 5: Ray Tune Experiment Scaling

Ray is very sensitive to version mismatch of its client and Python. Also needs access to all python sources, both on head and workers. Our Docker image is critical here. We mount volume into containers of Ray to skip data loading and script fetching.

helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm upgrade --install kuberay-operator kuberay/kuberay-operator -f C5S1-kuberay-operator-values.yaml
helm upgrade --install ray-cluster kuberay/ray-cluster -f C5S2-ray-values.yaml

Wait for its pods to become ready.

xdg-open http://localhost:8265
kubectl port-forward service/ray-cluster-kuberay-head-svc 8265:8265

See empty Ray interface. Let's run Ray-scaled experiment. 4 models with 2 learning rates = 8 experiments total. Ray will decide how to run experiments according to their resource requests.

kubectl cp C5S4-dag5.py workshop-helper-pod:/data/dags/
kubectl cp C5S5-ray-tune.py workshop-helper-pod:/data/scripts/
kubectl exec deployment/airflow-scheduler -- /bin/bash -c "airflow dags reserialize && airflow dags unpause dag5_ray_tune && airflow dags trigger dag5_ray_tune"

At this stage your cooling fan should go WE-E-E-E-E-E-E. Can get OutOfMemory problems, too.

That's quite alright. Main skill of modern data engineer seems to be "drinking your coffee slowly".

Look at running DAG in Airflow and multiple experiment runs in MLFlow. It starts to take a lot of time, we'll skip this step in the next chapter, results are already in MLFlow, ready for querying. Also look at Ray UI showing the experiment runs.

Wait for Airflow DAG to complete (can also proceed further, we have best model written in previous steps).

Chapter 6: Ray Serve the best model

Let's deploy the best model for serving. For speed, we stubbed other steps of DAG.

kubectl cp C6S1-serve.py workshop-helper-pod:/data/scripts/
kubectl cp C6S2-dag6.py workshop-helper-pod:/data/dags/
kubectl exec deployment/airflow-scheduler -- /bin/bash -c "airflow dags reserialize && airflow dags unpause dag6_ray_serve && airflow dags trigger dag6_ray_serve"

Wait for Airflow DAG to complete. Validate the Ray Serve started to function inside cluster.

kubectl exec workshop-helper-pod -- /bin/bash -c "echo '{\"image\":\"'\$(base64 -w 0 /data/data/fingers_data/test/ffb81e76-93e8-4e7f-87cf-7a91a62c8f25_3R.png)'\"}' | curl -X POST -H 'Content-Type: application/json' -d @- http://ray-cluster-kuberay-head-svc:8000/"

Chapter 7: Everything together, at once

Skip this step if you trust it will work, because it will run long. It would be good to delete all runs of small dataset from "finger-counting" experiment, to not compete for best model chooser.

kubectl cp C7S1-ray-tune-large.py workshop-helper-pod:/data/scripts/
kubectl cp C7S2-dag7.py workshop-helper-pod:/data/dags/
kubectl exec deployment/airflow-scheduler -- /bin/bash -c "airflow dags reserialize && airflow dags unpause dag7_complete"

After this, your machine will be busy for long. This is where you want to switch to GPU training.

In my decent desktop, it trained ~40mins cnn_deep, which is the fastest of our models. Trained ~100mins cnn_wide. Trained ~160mins mobilenet_v2.

Chapter 8: Consuming the results

Let's have a web UI to interact with our model inference.

kubectl cp C8S1-index.html workshop-helper-pod:/data/scripts/
kubectl cp C8S2-server.py workshop-helper-pod:/data/scripts/
kubectl exec workshop-helper-pod -- /bin/bash -c "cd /data/scripts && python C8S2-server.py"

It will hang running that command.

Open in parallel terminal.

xdg-open http://localhost:8080
kubectl port-forward workshop-helper-pod 8080:8080

Have fun uploading images from disk or even capturing via camera.

Also you can try a local app using camera live.

python3 C8S3-camera.py

To improve results, maybe the training data should be changed to https://www.kaggle.com/c/finger-n/data

Cleanup

In reverse order

helm uninstall ray-cluster
helm uninstall kuberay-operator
helm uninstall mlflow
helm uninstall spark-operator
helm uninstall airflow
helm uninstall --namespace komodor komodor-agent

kubectl delete -f C2S1-airflow-spark-crb.yaml
kubectl delete -f C0S2-workshop-helper-pod.yaml
kubectl delete -f C0S1-workshop-pvc.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MLOps Workshop

Chapter 0: Preparations

Chapter 1: Airflow as top-level orchestrator

Chapter 2: Spark for data preparation

Chapter 3: MLFlow to track experiments

Chapter 4: PyTorch and real training

Chapter 5: Ray Tune Experiment Scaling

Chapter 6: Ray Serve the best model

Chapter 7: Everything together, at once

Chapter 8: Consuming the results

Cleanup

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
C0S1-workshop-pvc.yaml		C0S1-workshop-pvc.yaml
C0S2-workshop-helper-pod.yaml		C0S2-workshop-helper-pod.yaml
C1S2-airflow-values.yaml		C1S2-airflow-values.yaml
C1S3-dag1.py		C1S3-dag1.py
C2S1-airflow-spark-crb.yaml		C2S1-airflow-spark-crb.yaml
C2S2-preprocess_fingers.py		C2S2-preprocess_fingers.py
C2S3-dag2.py		C2S3-dag2.py
C2S4-spark-preprocess.yaml		C2S4-spark-preprocess.yaml
C3S1-dag3.py		C3S1-dag3.py
C3S2-fake-train.py		C3S2-fake-train.py
C3S3-mlflow-chooser.py		C3S3-mlflow-chooser.py
C4S1_train.py		C4S1_train.py
C4S2-dag4.py		C4S2-dag4.py
C5S1-kuberay-operator-values.yaml		C5S1-kuberay-operator-values.yaml
C5S2-ray-values.yaml		C5S2-ray-values.yaml
C5S4-dag5.py		C5S4-dag5.py
C5S5-ray-tune.py		C5S5-ray-tune.py
C6S1-serve.py		C6S1-serve.py
C6S2-dag6.py		C6S2-dag6.py
C7S1-ray-tune-large.py		C7S1-ray-tune-large.py
C7S2-dag7.py		C7S2-dag7.py
C8S1-index.html		C8S1-index.html
C8S2-server.py		C8S2-server.py
C8S3-camera.py		C8S3-camera.py
Dockerfile		Dockerfile
README.md		README.md

komodorio/mlops-workshop

Folders and files

Latest commit

History

Repository files navigation

MLOps Workshop

Chapter 0: Preparations

Chapter 1: Airflow as top-level orchestrator

Chapter 2: Spark for data preparation

Chapter 3: MLFlow to track experiments

Chapter 4: PyTorch and real training

Chapter 5: Ray Tune Experiment Scaling

Chapter 6: Ray Serve the best model

Chapter 7: Everything together, at once

Chapter 8: Consuming the results

Cleanup

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages