Hosting Conference Workshops with JupyterHub on Google Kubernetes Engine: A Step-by-Step Guide (Part 1)

Meret RingwaldDr. Philipp Thomann
MRI1

Introduction

Conferences serve as vibrant hubs for knowledge exchange, collaboration, and skill enhancement within diverse communities. Among the various interactive elements of these gatherings, workshops stand out as invaluable opportunities for hands-on learning and skill development. As organizers, we are dedicated to making these sessions as effective as possible. That’s where our focus on finding the right hosting solution comes in. At D ONE, we want every participant to have a seamless and enriching experience.

We’re searching for a hosting solution that is robust yet user-friendly. It needs to be adaptable for all users, regardless of their technical background, and must include customizable options for RAM and GPU to accommodate different workshop needs. The platform should be simple to set up, easy to access, and capable of supporting prolonged engagement. Our goal is to provide a tool that allows participants to smoothly share and execute code, making the technology aspect hassle-free so they can concentrate on the learning and interactive experience.

Typically workshop organizers either distribute their course material so participants can run them locally — or they use free services like Binder (mybinder.org) or Google Colab. We used these options in abundance in the past. However they often present some challenges or usability issues:

  • Distribution of material / datasets: Do you want to have them openly available for the world? If not, how do you distribute them, where do you host them?
  • Local installation might work well for very technical participants, but even then it poses problems with different setups and needs more support by the organizers for a successful workshop. For less technical people, it might just not be feasible at all.
  • Google Colab is a very good solution for self-contained notebooks and can offer GPUs. As a downside, it is always a hassle to distribute datasets and has some limitations on how to start servers in the background. Furthermore it needs a Google Account, which not everyone has access to and might mean that some of our clients cannot use their company laptop. Lastly participants can run out of free credits and are therefore not able to use GPUs or even CPUs.
  • Binder: Can be easily configured to run anything you want in a docker container, see the Documentation thanks to repo2docker. However there is a hard 2GB RAM limit and sessions get completely lost if no interaction happens in 10 minutes. Also no GPUs are available.

After elaborating different options we decided to install the JupyterHub infrastructure on one of our Cloud Providers as this meets all our requirements.

  • Robust and User-friendly: Ensures stability and ease of use.
  • Adaptable to All Users: Suitable for both technical and non-technical participants.
  • Customizable RAM and GPU: Allows adjustments to meet varying technical demands.
  • Simple Setup: Easy to configure and start using.
  • Easy Access: Facilitates effortless entry and participation.
  • Support for Prolonged Engagement: Capable of handling extended use without issues.
  • Smooth Code Sharing and Execution: Streamlines the process of sharing and running code to focus on learning.

We chose to use Google Kubernetes Engine on GCP, but it could easily have been AKS on Azure or EKS on AWS or any Kubernetes Cluster. In fact we started development of the deployment (e.g. Helm Charts) on a local Kubernetes cluster.

We tested our setup with a couple of small workshops first with 20, then with 40 participants, where we gained a decent amount of experience. When we started preparing our yearly Growing Together Company Conference it became clear that this setup would make the conference much smoother for both the participants and the workshop organizers. We expected almost 200 participants that would hit our interface in up to 3 workshops in parallel over 3 time slots over the day. We estimated the needed resources, prepared the infrastructure, and invited all the organizers to submit their workshop material.

Everything was ready for the big day! On the day itself everything did run as a breeze and both the internal and external participants were amazed by this setup. We don’t want to keep our learnings a secret and are therefore sharing our setup and experience with you in a series of blog posts, to give you the opportunity to make your workshops as successful.

We first will describe the benefits of the involved technologies and after that proceed with the step-by-step process of setting up JupyterHub on GKE, empowering you to replicate our success in hosting engaging and impactful workshops.

We assume that you have some basic understanding of using Jupyter Notebooks and Python, and it might help if you have some basic knowledge of Docker and Kubernetes. If you want to follow the instructions step by step, you also need access to a Google Cloud project. You can also use any other existing Kubernetes Cluster.

Harnessing Synergy — JupyterHub, Google Kubernetes Engine (GKE), and Helm Charts in Workshop Hosting

Our workshop hosting solution consists of the following components:

  • Hosting: A cloud infrastructure in our case a Kubernetes Cluster running on Google Kubernetes Engine. Kubernetes provides scalable, efficient, and reliable hosting. It supports our demand for high availability and enables us to dynamically adjust resources as workshop participation scales up or down.
  • Jupyter Hub: To start Kubernetes Pods that use these Images and to proxy the user session for a seamless experience. Jupyter Hub allows for the easy deployment of interactive coding environments. It simplifies access for users by managing authentication and routing, making it straightforward for participants to start working with no setup delays. This setup ensures that each participant’s environment is isolated and consistent, enhancing learning and reducing technical issues.
  • Helm Charts: We use Helm charts to manage and deploy application configurations on our Kubernetes cluster. Helm charts facilitate efficient and consistent deployment of our applications, making updates and maintenance smoother and reducing setup errors.
  • Docker Images Build: We build the images for the individual workshops locally with custom Dockerfiles that suit our situation and push those to a private registry (Artifact Registry on Google Cloud). This allows for a high degree of customization and control over the environment in which our workshops operate, ensuring that all dependencies and configurations are aligned with our needs.
MRI2

In the following we give you a quick description of the technologies used.

JupyterHub: Collaborative Computing and Educational Significance

JupyterHub is a powerful server application that enables collaborative computing and data analysis by granting multiple users simultaneous access to Jupyter Notebook instances within a shared environment. It serves as an open-source gateway facilitating the access to customizable computing environments and supporting various programming languages and interactive tools, JupyterHub stands as a versatile solution empowering collaborative work and educational initiatives.

Google Kubernetes Engine (GKE): Scalable Application Deployment

GKE represents a managed Kubernetes service within Google Cloud, providing robust orchestration for containerized applications. Kubernetes, the underlying open-source platform automates deployment, scaling, and management of containerized workloads forming the backbone of GKE.

GKE simplifies application deployment in containers by abstracting underlying infrastructure complexities. Its core strengths lie in scalable, resilient application hosting, leveraging automated scaling, self-healing capabilities, and efficient resource utilization. GKE’s prowess in managing complex systems makes it an ideal choice for deploying and managing scalable applications, and by that enables some more sophisticated demands of our JupyterHub setting.

Helm Charts for Streamlined Deployment

In our deployment journey, we harnessed the power of Helm charts — a valuable tool for simplifying the setup process. Helm charts provide a structured way to package, share, and deploy complex Kubernetes applications. They streamline the deployment of JupyterHub on GKE, making it more accessible and efficient.

Setting up JupyterHub on GKE

In this section, we will provide you with a step-by-step guide on how to set up JupyterHub on Google Kubernetes Engine (GKE), utilizing Helm charts for streamlined deployment. These instructions will help you establish a robust environment for hosting workshops. Let’s dive into the key steps:

Step 1: Creating a GKE Cluster

We begin by creating a GKE cluster through the Google Cloud Platform (GCP) console or the Google Cloud SDK. You need a GCP project and the necessary access rights to set up the cluster. For those planning to repeat this setup process, adopting a “code-based” approach is advisable. We use the gcloud Command Line Tool for creating and configuring the GKE cluster. Ensure you have the gcloud CLI installed (refer to these instructions for gcloud CLI installation). We will define the cluster’s specifications, including node count, machine type, zone, and other necessary configurations that match your anticipated workshop requirements.

# adjust to your setup
CLUSTER_NAME="cluster_name"
PROJECT_NAME="gcp_project"
REGION="europe-west4"
ZONE="europe-west4-b"

# create cluster
gcloud container clusters create $CLUSTER_NAME \
   --project=$PROJECT_NAME \
   --region=$REGION \   
   --monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER,STORAGE,POD,DEPLOYMENT,STATEFULSET,DAEMONSET,HPA

# create node pool and attach to cluster
gcloud container node-pools create workload \
   --cluster=$CLUSTER_NAME \
   --zone=$REGION \
   --machine-type="c3-standard-8" \
   --num-nodes="1" \
   --enable-autoscaling \
   --min-nodes="1" \
   --max-nodes="10" \
   --enable-autoupgrade \
   --enable-autorepair \
   --scopes="https://www.googleapis.com/auth/cloud-platform" \
   --node-locations=$ZONE

# delete default node pool (it is automatically created with the cluster)
gcloud container node-pools delete default-pool \
   --cluster=$CLUSTER_NAME



Let’s walk through the code: Before you run the script, make sure you set your cluster configurations at the beginning of the script to match your setup. The first command will create a cluster with the specified name in your project and desired location. After that, we create a node pool attached to the previously created cluster. We specify the `machine-type` and enable autoscaling ( — enable-autoscaling). You must specify the minimum and maximum number of nodes you want the cluster to have. The — num-nodes flag specifies the number of nodes the cluster will have after you have just created it. Last but not least we delete the default node pool which is automatically created when you create a cluster. We won’t need it because we have created our own node pool. Be aware that creating or updating the cluster or the node pool can take up to 30 min!

While gcloud CLI is useful for setting up and configuring your cluster, for daily interactions with your cluster, kubectl is recommended. Install it using:

gcloud components install kubectl

Next, store your cluster credentials in a kubeconfig file:

gcloud container clusters get-credentials "$CLUSTER_NAME" --region "$REGION" --project "$PROJECT_NAME"

With these steps completed, you’re all set to interact with your GKE cluster!

Step 2: Deploying JupyterHub Using Helm Charts

Helm charts simplify the deployment of JupyterHub on GKE by providing a structured and efficient way to package and manage Kubernetes applications. We will walk you through the process of deploying JupyterHub using Helm charts and review key settings and configurations.

First make sure you have Helm installed on your computer (link). With Helm installed, you can now focus on tailoring the JupyterHub deployment to your requirements. There are two main sections to the helm Chart: Single user configurations and hub configurations. Whereas the first one is used to customize the environment that is provided to the users after they log in, the hub config contains settings that apply to the hub itself.

Here’s a snippet of how these configurations might look in your Helm chart:

singleuser:
image:
pullPolicy: Always
# `cmd: null` allows the custom CMD of the Jupyter docker-stacks to be used
# which performs further customization on startup.
cmd: null
memory:
limit: 3G
guarantee: 3G
cpu:
limit: 2
guarantee: 0.5
storage:
# The following configuration will increase the SHM allocation by mounting a tmpfs (ramdisk) at /dev/shm, replacing the default 64MB allocation.
extraVolumes:
- name: shm-volume
emptyDir:
medium: Memory
extraVolumeMounts:
- name: shm-volume
mountPath: /dev/shm
lifecycleHooks:
postStart:
exec:
command:
- "sh"
- "-c"
- >
NAME=`basename "$JUPYTER_IMAGE"`;
mkdir -p /home/jovyan/"$NAME";
cp -r /workshop-notebooks/* /home/jovyan/"$NAME"/;
ln -s /workshop-data /home/jovyan/"$NAME"/data;
echo done setting up

hub:
config:
Authenticator:
admin_users:
- admins
DummyAuthenticator:
password: your_password
JupyterHub:
authenticator_class: dummy


Let’s go through the most important settings:

Single User Configurations:

  • Image:
    pullPolicy: Always: This ensures the image is always pulled before starting containers and helps to be up to date in development when new images get pushed. We didn’t see any big impact on startup performance on the workshop day because all the layers are cached already after the initial startup.
    cmd: null: This setting allows the use of a custom CMD from the Jupyter docker-stacks, enabling further customization at startup.
  • Resource Allocation: Memory and CPU limits and guarantees can be set, such as memory.limit: 3G and cpu.limit: 2. This controls the resources allocated to each user’s instance.
  • Storage Management: Enhanced Shared Memory (SHM) allocation: By configuring extraVolumes and extraVolumeMounts, you can increase the SHM allocation, which is crucial for certain computational tasks.
  • Lifecycle Hooks: postStart: These hooks can be used for initial setup tasks. For example, we copy the workshop notebooks into the user’s workspace and set up symbolic links to the data directory that is coming from the docker image.

Hub Configurations:

  • Authenticator: Admin_users: This is a list of users who have admin rights in JupyterHub.
  • DummyAuthenticator: Password: This is using a dummy authentication method for JupyterHub, where the password is set to ‘your_password’.

Once you’ve configured your Helm chart, you can deploy JupyterHub to your GKE cluster. First, add the JupyterHub Helm chart repository:

helm repo add jupyterhub https://jupyterhub.github.io/h...
helm repo update


Next, deploy JupyterHub using Helm. Replace <release-name> with your desired release name for the deployment and <namespace> with the Kubernetes namespace you wish to deploy into. If you don’t specify a namespace, it will default to ‘default’. Make sure you have stored your Helm configuration in config.yaml or adjust the file name accordingly.

helm upgrade --install <release-name> jupyterhub/jupyterhub --namespace <namespace> --create-namespace --version=3.1.0 --values config.yaml


Now you could use the external IP of your proxy-public Kubernetes to connect to it — however that opens an unencrypted connection (http) , so do it only if you are running it within your local machine or in a secured intranet. Better use a port-forwarding in kubectl by issuing:

kubectl port-forward -n <namespace> service/proxy-public 8080:http


After which you can access the JupyterHub application in your browser at http://localhost:8080/

This is all you need to set up your Jupyter Hub. If you change any configuration settings then you can easily re-apply your settings using the same command as before.

As a matter of fact, you can do all of this on any Kubernetes Cluster, including those you can run locally on your Development Machine, e.g. with minikube or the Kubernetes Cluster running by default in Docker Desktop. The exact same helm and kubectl commands apply as above — only always remember to select the correct Kubernetes Context before issuing the commands.

Step 3: Load Balancer (Advanced)

Having a LoadBalancer is the way to go if you want your participants to be able to seamlessly access this service in your custom (sub-)domain. However it is somewhat finicky and also needs support by the DNS Admin of your organization.

Here we give you some general pointers how to set it up in the GCP Console UI but your mileage may vary:

  • Go to the Google Kubernetes Page in the GCP Console UI
  • In the menu on the left, select “Gateway, Services & Ingress” in the Networking Section. In the Services Tab select your cluster and namespace and create an Ingress for proxy-public service and wait until it is ready
  • Then got to the Load Balancers Section of the GCP Console to create an “Application Load Balancer (HTTP/S), we used the following specifications
    - a “Classic Application Load Balancer”
    - “Standard Network Service Tier”
    - Any fixed IP address (is needed for DNS configuration)
    - A Google-managed certificate for the domain
    - Use the public IP of the load balancer and configure it in your DNS
    - Create a new Backend with Typ “zonal network endpoint group — GCE & GKE” and with a simple health check. Also think about increasing the timeout from 30s to e.g. 3600s so the websockets from the browser don’t have to reconnect every 30 seconds
  • You will need your DNS Admin to register the IP address as an A record in your DNS together with the verification entries that GCP will ask of you

It might take some time until the DNS entry propagates to the system your Browser runs on. In the meantime you can start testing the LoadBalancer with the IP address (see section before).

Conclusion

In this guide, we’ve explored the potent combination of JupyterHub and Google Kubernetes Engine (GKE) for hosting conferences. Let’s quickly recap the key points:

  • Seamless Collaboration and Adaptability: By integrating JupyterHub with the scalable infrastructure of GKE, we ensure that our hosting solution can easily adapt to both small and large groups, promoting effective collaboration across diverse technical levels.
  • Customizable and Efficient Deployment: Using Helm charts, we have streamlined the deployment process, allowing for quick and customizable setups that cater to specific workshop needs.
  • Robust and User-Friendly Platform: The choice of technologies, including Docker and Kubernetes, ensures our platform is both powerful and easy to use, facilitating a smooth experience for all participants, regardless of their technical background.

In the next blog post, we will go into more detail on how to set up the individual environments for each workshop and how to allocate the needed resources. Stay tuned for part 2 and reach out if you have any questions!

Thanks for your registration!