This article we will do a bit of a review of the technology stack required to enable this as well as the logistics behind setting it all up and operating against it. The solution uses Azure Container Services, Docker, Kubernetes, Azure Storage, Jupyter, Tensor Flow and Tensorboard. WOW! That is a lot of technology; so we won’t do a deep dive how to but rather some pointers on how to get provisioned and then the high level process on how to use it.
So to date this solution is the best large scale deep learning solution I have used. I’ve tried out Spark, TensorBrick, Tensorflow on Spark, just raw TensorFlow, Data Science VMs and gosh I don’t know what all else; a lot of stuff. This solution bar none beats every single one of them. What is it that this unique combination of tools brings to the table where all else fail?
De-coupled Data and Compute
This is problem number 1. I need my compute to be elastic and my data persisted. I need it done in a way that my team doesn’t have to think about. In DSVMs and Jupyter and other solutions often it is complicated in how to de couple my data from my compute or it requires manual (or custom coded) synchronization and orchestration which often just gets dropped. This solution solves that problem by putting all data into Azure Files and then mounting those Files as drives into the compute containers. Sweet. Below is an image where I have accessed two separate 5 TB data shares and used Jupyter Notebook’s command interface to list and manage my mounted drives. One drive is literally ‘dump zone’ and the other is rvlcdip (doc data). I will mention that Spark (HDInsight) solves this problem, albeit you have to learn all of the Spark specific stuff and deal with how slow it is along with the fact that yarn sucks up a bunch of resources.
Elastic Scale GPUs and CPUs
Next up is the need to dynamically pay for what I need in an easy to manage way. I can’t be standing up VMs all the time. I have a team to run and science to do. So I need to provision a deep learning workload and let it rip. If there aren’t enough GPUs and CPUs currently available; it needs to provision them. When my work is done; it should de-provision them or schedule pending work against them should there be pending work. Below is an image of several deep learning workloads going and a pending workload for image classification moving from pending to running state after a new multi-gpu node has been provisioned for it to execute against.
Externally Visible Micro-Services
So in a deep learning context my micro-services are tensorboard and jupyter. I might even jupyter against a GPU machine. Because my data is decouped from my compute; I can have several Jupyter notebooks mounted to the central jupyter file share some with GPU and some with CPU and operate against them. Better yet; I can get interactive charts from plotly (suck on that Spark; even though I know you claim to do it; I know you can’t (as of this writing anyways)). Lets check out some workloads…
Now I can finally check my workloads without doing any crazy tunneling or anything. Better yet; I can check all of my simultaneously training algorithms. There was a time when only one of my engineers could actually access tensorboard because he used to work in networking and knew how to set up that stuff (but only for himself; jerk).
Quick and Easy Deployments, Healing Etc
Because its all kubernetes under the hood with Docker; these systems are very easy to heal. I can literally replace any service within a few minutes and due to the data decoupling; I can just destroy and rebuild services at will. This has been critical as I have noticed some instabilities in tensorboard which cause it to crash or even if there is a bug in a deep learning workload that causes it to crash. Each system is completely independent and easy to reprovision.
This system kicks butt. I am definitely going to start rolling this out for my day job with customers. I have already rolled it out for internal usage and also for my own startup’s usage. It really encompasses everything you need to do and I suspect upgrades will be coming out.
Provisioning Your Cluster
So how do you get started? The cluster really is a combination of a variety of blog articles and technologies mashed together. Do your best to follow these:
Some of those instructions don’t quite work exactly; so be prepared to modify but they are a great starting point. I have also open sourced a repo which contains some common configurations as well as code which runs against the centralized data repo: https://github.com/drcrook1/CIFAR10/tree/master/TensorFlow_K8
Using Your Cluster – Pre-Requirements
I do all of my development on a windows 10 machine. I am going to list out the pre-requirements based on that.
- Developer Mode Turned On
- Windows Subsystem for Linux Enabled
- Hyper-V Enabled
- Docker for Windows installed
- Git for Windows installed
- Kubectl installed within the Linux system.
- Visual Studio Code w/Python extension
- Anaconda Stack & Tensor Flow installed
- Docker Hub Subscription
Your basic process is to write your python code in Visual Studio Code, manage your docker images and such in windows. You really only manage your workloads from kubectl within the linux subsystem.
Process for Using your Cluster – Deep Learning Workloads
- Modify your Code; Make sure you follow my CIFAR10 sample and double check your mount locations and that you change your model name in CONSTANTS.py every time you deploy a new model. The reason you change the model name is such that it is reflected in tensorboard for monitoring.
- Dockerize your Code; I created a sample buildenv.ps1. Execute that file from the same folder your docker file is in. Therefor from a windows cmd prompt, you should execute the command “powershell K8/buildenv.ps1”. Make sure in the buildenv.ps1 file that your container name matches what you are looking for. I prefer to follow the convention <user>/<productsuite>:<product>. Once you have an image; you can push that image using docker push <user>/<productsuite>:<product>
- Modify your Job.yaml; The key thing here is to ensure you have a unique job name, double check your image name and that imagePullPolicy is still set to talways. Double check your volume mounts, your requested number of gpus and that you are executing the correct python file to kick off training.
- Schedule your workload; From your bash for windows command prompt: navigate to the folder with the job.yaml file. execute the command “kubectl create -f job.yaml”. You can then view the job via “kubectl get pods”. Ensure the final 4 characters of the pod stay consistent. You can view the logs via “kubectl logs <pod id>”. If your workload fails; execute: “kubectl delete job <jobname>” or “kubect delete -f job.yaml”.
Process for Using your Cluster – Services
To view the public IP for your services; execute the command “kubectl get svc”. This will list out your services. Tensorboard currently is not secured and can be easily navigated to via the external ip. Jupyter depends on how you provisioned; but if using the Tensorflow image; you can check the logs for the token using “kubectl logs <jupyter pod name>”
This greatly simplifies the deep learning process at scale. Multiple engineers against multiple shares with the ability to do cross data analysis and tensorboards set up for each team. This is absolutely fantastic.