Skip to content

komodorio/mlops-workshop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLOps Workshop

Kaggle Dataset Illustration

In this workshop, we are creating an end-to-end machine learning pipeline inside Kubernetes cluster.

We will train models to recognize the hand gestures showing N fingers. The best model among experiments will be served as API.

It's just like automated "Miss Universe" choosing.

Chapter 0: Preparations

Prerequisites:

Starting from clean Kind cluster:

kind delete cluster
kind create cluster

Alternatively: configure kubeconfig (kubectl) to use some existing cluster.

The skill of IT engineer is not to deploy flawlessly (nobody's able to), but to troubleshoot quickly.

Optional step: connect cluster to Komodor:

xdg-open https://app.komodor.com
helm repo add komodorio https://helm-charts.komodor.io 
helm upgrade --install --create-namespace --namespace komodor komodor-agent komodorio/komodor-agent --set apiKey=`cat komodor-key.txt` --set clusterName=local-kind-`date +%s` 

Creating dedicated images would decrease the startup overhead. Our cluster will need a special Docker image inside. In case it's not possible to pull it from registries, you can build it locally and load into Kind cluster:

docker build . -t docker.io/komikomodor/mlops-workshop:latest
kind load docker-image docker.io/komikomodor/mlops-workshop:latest

Workshop setting: bypass auth, everything in single cluster and same namespace. In short, the real production setup is very complex, we simplified it. Using PVCs and configmaps for storage - workshop setup. All has to go into same namespace because of using PVCs to share the data. Real world would use Git, S3, NFS etc.

kubectl apply -f C0S1-workshop-pvc.yaml
kubectl apply -f C0S2-workshop-helper-pod.yaml

How do we eat elephant? Piece by piece

Chapter 1: Airflow as top-level orchestrator

helm repo add apache-airflow https://airflow.apache.org/
helm upgrade --install airflow apache-airflow/airflow -f C1S2-airflow-values.yaml

Wait for Airflow pods to become ready.

xdg-open http://localhost:8082
kubectl port-forward svc/airflow-webserver 8082:8080 

See empty Airflow UI.

kubectl cp C1S3-dag1.py workshop-helper-pod:/data/dags/
kubectl exec deployment/airflow-scheduler -- /bin/bash -c "airflow dags reserialize && airflow dags unpause dag1_all_stubs && airflow dags trigger dag1_all_stubs"

Go to Airflow UI, see first DAG passing all green. Start it again manually, if you wish.

Chapter 2: Spark for data preparation

helm repo add spark-operator https://kubeflow.github.io/spark-operator
helm upgrade --install spark-operator spark-operator/spark-operator
kubectl apply -f C2S1-airflow-spark-crb.yaml

Wait for operator deploy to complete.

Loading dataset: koryakinp/fingers from Kaggle. Taking 200 images as "small" dataset, full images

kubectl exec workshop-helper-pod -- mkdir -p /data/scripts
kubectl cp C2S2-preprocess_fingers.py workshop-helper-pod:/data/scripts/
kubectl cp C2S3-dag2.py workshop-helper-pod:/data/dags/
kubectl cp C2S4-spark-preprocess.yaml workshop-helper-pod:/data/dags/
kubectl exec deployment/airflow-scheduler -- /bin/bash -c "airflow dags reserialize && airflow dags unpause dag2_spark_preprocess && airflow dags trigger dag2_spark_preprocess"

Go to Airflow UI and second DAG, look at logs of preprocess_data step. Start it second time, notice is goes faster ( cached download).

Note that we'll disable this step of DAG until the end, to make DAGs run faster. The dataset is already in storage volume.

Chapter 3: MLFlow to track experiments

helm repo add bitnami https://charts.bitnami.com/bitnami
helm upgrade --install mlflow bitnami/mlflow --set artifactRoot.s3.enabled=false --set tracking.auth.enabled=false --set artifactRoot.filesystem.enabled=true

Wait for its pods to become ready

xdg-open http://localhost:8083
kubectl port-forward svc/mlflow-tracking 8083:80 

See its empty UI of MLFlow.

Start tracking experiments, initially with fake metrics from fake training.

kubectl cp C3S1-dag3.py workshop-helper-pod:/data/dags/
kubectl cp C3S2-fake-train.py workshop-helper-pod:/data/scripts/
kubectl cp C3S3-mlflow-chooser.py workshop-helper-pod:/data/scripts/
kubectl exec deployment/airflow-scheduler -- /bin/bash -c "airflow dags reserialize && airflow dags unpause dag3_mlflow_tracking && airflow dags trigger dag3_mlflow_tracking"

Go to Airflow UI to see third DAG running and completing. After its completion, go to MLFlow UI, refresh list of experiments, see "finger-counting-fake", browse into it: runs, metrics etc. Add "val_accuracy" column to table.

Chapter 4: PyTorch and real training

We'll run direct training of single model variant. The training script is parameterized to have all our combinations inside. We're skipping realistic long data loading (eg from S3) by mounting volume to Pod. Beware: Training on CPU is not recommended to children and pregnant women. We'll do for the sake of simplicity.

kubectl cp C4S1_train.py workshop-helper-pod:/data/scripts/
kubectl cp C4S2-dag4.py workshop-helper-pod:/data/dags/
kubectl exec deployment/airflow-scheduler -- /bin/bash -c "airflow dags reserialize && airflow dags unpause dag4_real_training && airflow dags trigger dag4_real_training"

In Airflow, open fourth DAG and see it succeeds. After its completion, go to MLFlow UI, refresh list of experiments, see "finger-counting-debug", look at the model metrics graphs. Best model chooser would choose that run, can be validated in data volume /data/best_model_info.txt file.

Chapter 5: Ray Tune Experiment Scaling

Ray is very sensitive to version mismatch of its client and Python. Also needs access to all python sources, both on head and workers. Our Docker image is critical here. We mount volume into containers of Ray to skip data loading and script fetching.

helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm upgrade --install kuberay-operator kuberay/kuberay-operator -f C5S1-kuberay-operator-values.yaml
helm upgrade --install ray-cluster kuberay/ray-cluster -f C5S2-ray-values.yaml

Wait for its pods to become ready.

xdg-open http://localhost:8265
kubectl port-forward service/ray-cluster-kuberay-head-svc 8265:8265 

See empty Ray interface. Let's run Ray-scaled experiment. 4 models with 2 learning rates = 8 experiments total. Ray will decide how to run experiments according to their resource requests.

kubectl cp C5S4-dag5.py workshop-helper-pod:/data/dags/
kubectl cp C5S5-ray-tune.py workshop-helper-pod:/data/scripts/
kubectl exec deployment/airflow-scheduler -- /bin/bash -c "airflow dags reserialize && airflow dags unpause dag5_ray_tune && airflow dags trigger dag5_ray_tune"

At this stage your cooling fan should go WE-E-E-E-E-E-E. Can get OutOfMemory problems, too.

That's quite alright. Main skill of modern data engineer seems to be "drinking your coffee slowly".

Look at running DAG in Airflow and multiple experiment runs in MLFlow. It starts to take a lot of time, we'll skip this step in the next chapter, results are already in MLFlow, ready for querying. Also look at Ray UI showing the experiment runs.

Wait for Airflow DAG to complete (can also proceed further, we have best model written in previous steps).

Chapter 6: Ray Serve the best model

Let's deploy the best model for serving. For speed, we stubbed other steps of DAG.

kubectl cp C6S1-serve.py workshop-helper-pod:/data/scripts/
kubectl cp C6S2-dag6.py workshop-helper-pod:/data/dags/
kubectl exec deployment/airflow-scheduler -- /bin/bash -c "airflow dags reserialize && airflow dags unpause dag6_ray_serve && airflow dags trigger dag6_ray_serve"

Wait for Airflow DAG to complete. Validate the Ray Serve started to function inside cluster.

kubectl exec workshop-helper-pod -- /bin/bash -c "echo '{\"image\":\"'\$(base64 -w 0 /data/data/fingers_data/test/ffb81e76-93e8-4e7f-87cf-7a91a62c8f25_3R.png)'\"}' | curl -X POST -H 'Content-Type: application/json' -d @- http://ray-cluster-kuberay-head-svc:8000/"

Chapter 7: Everything together, at once

Skip this step if you trust it will work, because it will run long. It would be good to delete all runs of small dataset from "finger-counting" experiment, to not compete for best model chooser.

kubectl cp C7S1-ray-tune-large.py workshop-helper-pod:/data/scripts/
kubectl cp C7S2-dag7.py workshop-helper-pod:/data/dags/
kubectl exec deployment/airflow-scheduler -- /bin/bash -c "airflow dags reserialize && airflow dags unpause dag7_complete"

After this, your machine will be busy for long. This is where you want to switch to GPU training.

In my decent desktop, it trained ~40mins cnn_deep, which is the fastest of our models. Trained ~100mins cnn_wide. Trained ~160mins mobilenet_v2.

Chapter 8: Consuming the results

Let's have a web UI to interact with our model inference.

kubectl cp C8S1-index.html workshop-helper-pod:/data/scripts/
kubectl cp C8S2-server.py workshop-helper-pod:/data/scripts/
kubectl exec workshop-helper-pod -- /bin/bash -c "cd /data/scripts && python C8S2-server.py"

It will hang running that command.

Open in parallel terminal.

xdg-open http://localhost:8080
kubectl port-forward workshop-helper-pod 8080:8080

Have fun uploading images from disk or even capturing via camera.

Also you can try a local app using camera live.

python3 C8S3-camera.py

To improve results, maybe the training data should be changed to https://www.kaggle.com/c/finger-n/data

Cleanup

In reverse order

helm uninstall ray-cluster
helm uninstall kuberay-operator
helm uninstall mlflow
helm uninstall spark-operator
helm uninstall airflow
helm uninstall --namespace komodor komodor-agent

kubectl delete -f C2S1-airflow-spark-crb.yaml
kubectl delete -f C0S2-workshop-helper-pod.yaml
kubectl delete -f C0S1-workshop-pvc.yaml

About

Source code for MLOps Workshop

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published