So Jupyter is a great tool for experimental science. Running a jupyter notebook though can be tricky; especially if you want to maintain all of the data that is stored in it. I have seen many strategies; but I have come up with one that I like best of all. It is based on my “Micro Services for Data Science” strategy. By using decoupled data and compute we can literally thrash our Jupyter notebook and all of our data and notebooks still live. So why not put it in a self healing orchestrater and deploy via Kubernetes :D.
Step 1: Get a container registry
I use Azure’s Container Registry. Its a private container registry that comes with Azure and gets billed through my Azure subscription. I just like keeping everything in one place. I’ll go through how to use a private container registry here as well; but if you don’t want to worry about that you can simply create a docker hub repository to push your containers to.
Step 2: Create a light weight container with Jupyter in it.
Here is the docker file:
FROM ubuntu:16.04 RUN apt-get update -y RUN apt-get upgrade -y RUN apt-get install -y -qq build-essential libssl-dev libffi-dev python3-dev curl python3-pip RUN pip3 install --upgrade pip ADD requirements.txt /app/ RUN pip3 install -r /app/requirements.txt EXPOSE 8888
We start with a 16.04 image, we run some upgrades, install python, upgrade pip, install our requirements and expose port 8888 (jupyter’s default port).
Here is our requirements.txt file
numpy pandas scipy jupyter azure_common azure-storage scikit-learn nltk plotly
Notice how Jupyter is in there, I also added a few other things that I very commonly use including numpy, pandas, plotly, scikit-learn and some azure stuff.
Step 3: Create a .yaml file
apiVersion: v1 kind: Service metadata: labels: app: jupyter-persisted name: jupyter-persisted spec: ports: - port: 80 targetPort: 8888 selector: app: jupyter-persisted type: LoadBalancer --- apiVersion: extensions/v1beta1 kind: Deployment metadata: labels: app: jupyter-persisted name: jupyter-persisted spec: template: metadata: labels: app: jupyter-persisted spec: volumes: - name: dumpzone azureFile: secretName: storagesecret shareName: dumpzone readOnly: false - name: rvlcdip azureFile: secretName: storagesecret shareName: rvlcdip readOnly: false - name: jupyter azureFile: secretName: storagesecret shareName: jupyter readOnly: false - name: models azureFile: secretName: storagesecret shareName: models readOnly: false - name: tensorboard azureFile: secretName: storagesecret shareName: tensorboard readOnly: false containers: - name: jupyter image: YOUR_REGISTRY_REPO_OR_LOGIN_SERVER/YOURIMAGE:YOURTAG imagePullPolicy: Always command: ["bash", "-c"] args: ["jupyter notebook --no-browser --port=8888 --ip=0.0.0.0 --notebook-dir=/jupyter --allow-root --NotebookApp.password='YOUR NOTE BOOK PASSWORD'"] ports: - containerPort: 8888 volumeMounts: - mountPath: "/dumpzone" name: dumpzone - mountPath: "/rvlcdip" name: rvlcdip - mountPath: "/jupyter" name: jupyter - mountPath: "/models" name: models - mountPath: "/tensorboard" name: tensorboard imagePullSecrets: - name: regsecret
This is where most of the magic happens. Notice the volume spec and the mounts specifically. These are Azure Files. I mount the notebook to a variety of file shares which have data in them so my teams can operate against them. Very importantly, notice that we mount an Azure File to /jupyter; which happens to be the exact place we run Jupyter from. This means my notebooks are persisted to geo redundant data which is seperate from my container or even my vm. So my container and my vm can go down and Kubernetes will simply reschedule the pod with the same mount and my notebook will be automatically online again without me touching a single thing… WONDERFUL!
You can also see in the above image I have mounted all of my other file shares for management of those shares from jupyter. Exploratory analysis etc etc.
Generating the notebook password
Its just some simple code:
from notebook.auth import passwd print(passwd())
If you run that code; you have to enter a password twice and it gives back a hashed code that you put into the .yaml file.
Using a Private Container Registry
So you simply need to provide a registry secret in Kubernetes to use. Just follow these steps to create the secret and use that…
Deploy to Kubernetes and get your endpoint
Just some simple commands.
kubectl create -f YOURYAMLFILE.yaml kubectl get svc
The first command will schedule the service on the cluster and the second service will give you a list of all services on the cluster with a print out of what each ones public ip is. Navigate to that public ip address, enter your password and off you go.
This Kubernetes thing is pretty freaking awesome. I’m taking a big liking to this type of approach for my workloads. I can now run workloads from anywhere in the world on whatever compute I want, know it is going to be reliable and also give access to other folks in a secure way. Good luck!