Dear Research Computing Community:

As you may have noticed, our newly added H200 GPUs on the Explorer cluster have quickly become highly popular and heavily utilized. We are excited that these state-of-the-art GPUs are enabling a wide range of scientific and engineering explorations.

To enhance your computing experience and ensure our shared resources are used effectively, we’ve implemented several operational improvements:

Introduction of “gpu-interactive” queue.

We have launched a new “gpu-interactive” queue dedicated solely to interactive GPU jobs via the Open OnDemand interface. All other existing GPU queues (gpu-short, gpu, multi-gpu) are specifically dedicated for traditional long-running batch jobs. This separation ensures that batch jobs are not blocked by bursts of interactive GPU jobs, improving GPU availability for all users.

Active GPU utilization best practices.

GPUs are valuable and shared resources — please request only what you need, and release GPUs when no longer in use. There is always someone waiting in the queue to do exciting research on these shared GPUs and test their new idea. To support this, we have deployed “GPU IdleBot” tool, which detects when a GPU is practically idle. If idle use continues for an hour, the GPU will be automatically released back to others in the queue.

Improvements to GPU queue structure and access.

To help reduce queue wait times and improve fairness among users, we have made refinements to the GPU queue structure, adjusted per-user job submission quotas (please see this link), and reinforced that direct SSH access to GPU nodes is not supported — a standard practice across HPC centers to prevent accidental or malicious disruptions.

I thank our RC staff team for successfully transitioning most of our GPU nodes in the public partition to our shiny Explorer cluster and working tirelessly to deploy these new improvements.

I would like to highlight that these improvements were in part shaped by the feedback we received from our user community. So, please feel free to continue sharing your suggestions/ideas with us — we are always listening and improving.

Warmly,

…∂t