Run web services on HPC? Stupid?

Dirk Petersen
3 min readSep 25, 2024

--

or are there some use cases that can benefit from forever-slurm ?

So, what do you do if you want to offer a long running backend web service that requires a lot of compute power, but all the powerful machines, and those with GPUs, in your organization are installed in a high performance compute (HPC) system that only allows you to submit batch jobs ?

You are facing 2 problems:

  1. The HPC batch system will only allow you to run your job for a few hours, or perhaps days, before it will kill it, or it will simply kill it any time because some other high priority job comes along, and your web frontend or client process will no longer be able to use the backend.
  2. The beefy compute nodes on HPC systems are often hidden behind a gateway node, called “login node”. You can access the login node, and only from there you can see the compute nodes, but not directly from your web server or client process.

So, which use cases are we talking about? As an example, many organizations do not allow ChatGPT or other cloud based AI systems and want to build their own on-premises Chatbot. For this, you need a beefy inference server with GPUs to host an open source large language model (LLM) such as Llama. But what can you do if all your beefy machines are in the HPC system ?

forever-slurm solves this by using a load balancing proxy called Traefik that sits on the login node and routes traffic to one or multiple compute nodes that are currently running suitable jobs. forever-slurm will submit as many jobs to the cluster as needed to create a stable environment that has at least one node with the appropriate software running at all times.

There is a bit more to it, such as keeping some metadata that tells Traefik which nodes and ports to use, and ensuring that processes that belong to the same service do not end up on the same compute node to ensure some level of high availability. You can get forever-slurm ready to run in a few seconds; simply execute the setup script ./config.sh .

Now, you could argue that this is a hilarious idea, because the huge Llama 3.1 405B model will block something like 6 x A100 GPUs with 80GB each for an LLM inference server, which does nothing most of the time. That would be the equivalent of a $150k machine idling ! What now? Well, one solution could be buying a $150k dedicated server outside the HPC cluster. This does 2 things:

  1. The HPC sysadmins and the HPC steering committee feel better because an inefficient process has been removed from a system they care about
  2. The inefficiency for the entire organization has increased further because now we see that this machine will be idle even more, as it can never be used for other computations

Clearly a Verschlimmbesserung. How can we overcome this? We know that some parts of an Enterprise HPC system are always idle; in fact a system that has 80% or more average utilization would be considered quite busy. However, most HPC systems have an option to dish up those remaining resources temporarily if you allow the system to take them away from you should a higher priority batch job come along. You just need to add the “preemptable” option to your job, which is something your HPC sysadmin can setup for the job queue. With this feature, forever-slurm can can grab idle cycles on the HPC system and will ensure that there is always at least one compute node running that Traefik can then route traffic to.

--

--

Dirk Petersen

Research Computing Leader and passionate code paster who is also interested in history and politics.