Drinking the Slurm-aid: GPU clusters for AI research 🥤

November 05, 2024

Slurm vs Kubernetes

“Our researchers need a GPU cluster for training and fine-tuning models.”

You may be tempted — as I once was — to build on Kubernetes. After all, Kubernetes is designed to manage a cluster of computers. That is its bread and butter.

A reasonable Kubernetes-based platform for AI research looks like this:

Researchers develop on local laptops or machines without GPUs.
They’ll use some Makefile / Bash script / CLI to build and upload a container with their code in, and talk to the Kubernetes API to spin up their workload.
You may also record their job submission in some other more persistent store, so that they can refer to their workload after the Kubernetes API has garbage-collected it.

This architecture sounds reasonable. But it introduces friction for AI research:

Containers are hard: containers work best when they contain applications i.e. you can enumerate all the files, runtime dependencies, and privileges they require. But the opposite is true for AI training, which is tightly-coupled to the hardware, and needs many different system-level dependencies (CUDA, GPUs, network devices, …)
Time-to-first-batch is slow: ML containers are huge, and they are constantly being built, rebuilt, uploaded and downloaded. Pip install is slow. And knowledge about how to best optimise the image cache is not particularly well understood. New experiments take many attempts to iron out shape errors and so on. Iterating slowly is frustrating and time-consuming.
Kubernetes is not designed for fixed-length batch jobs: it is designed to run infinitely-long server processes. If a container dies, its default behaviour is to restart it, rather than tell you. This strands GPUs, leaving them to continue executing a crash-looping train.py. Kubernetes does not have gang-scheduling built-in. Which means that if you absolutely, categorically require 16 GPU containers for your distributed training to work, Kubernetes would quite happily run just 15 of them. Thereby stalling the entire workload AND preventing anything else from consuming those GPUs.

So if Kubernetes is an entirely reasonable answer that is pretty frustrating to use in practice, what can we do instead?

Drinking the Slurm-aid

It turns out the ideal scheduler was invented over 20 years ago: Slurm. Traditionally only found in national laboratories (running, among other things, nuclear weapon simulations) and university supercomputing centres, Slurm has enjoyed a resurgence as the scheduler du jour. Around 70% of clusters run Slurm (20% Kubernetes, 10% in-house)¹. It has many features that are useful for running AI workloads:

Workloads run directly on compute instances, without additional virtualisation or containerisation
Has support for gang scheduling, queuing, and partitioning of clusters
Persists historic job information in an accounting database
Code is (typically) saved to an attached network drive shared throughout the cluster, which greatly reduces cold-start latency.

Slurm on its own is no silver bullet:

You’re usually responsible for running and configuring the Slurm control plane (of the big three, only AWS has a managed Slurm product). Slurm is finicky and difficult to configure.
You’re required to have a fast shared NFS drive attached to all nodes.
You’ve also got to manage a set of “login nodes” for researchers to submit jobs from. They’ll probably want to run VSCode to author code which tends to be resource intensive.
Researchers need to learn how to use Slurm. There’s a reason every university supercomputing lab has written their own extensive “how to use Slurm” documentation.
You’ll still need some additional system to visualise hardware occupancy and utilisation, and correlate that with Slurm jobs.
Because Slurm runs code directly from a user directory, and by default stores logs in the same directory, it can become hard to work out what code ran when. And therefore difficult to reproduce research.

Kubernetes’ use of containers, and inappropriately-designed scheduler, will slow your AI research down. Slurm has higher setup and operational cost, but is simpler and faster for AI workloads, and it is likely many researchers are already familiar with it. Clusterfudge introduces modern comforts that make it easier to get started.

https://semianalysis.com/2024/10/03/ai-neocloud-playbook-and-anatomy/ ↩︎