Dear Research Computing Community:
Introduction of “gpu-interactive” queue.
We have launched a new “gpu-interactive” queue dedicated solely to interactive GPU jobs via the Open OnDemand interface. All other existing GPU queues (gpu-short, gpu, multi-gpu) are specifically dedicated for traditional long-running batch jobs. This separation ensures that batch jobs are not blocked by bursts of interactive GPU jobs, improving GPU availability for all users.
Active GPU utilization best practices.
GPUs are valuable and shared resources — please request only what you need, and release GPUs when no longer in use. There is always someone waiting in the queue to do exciting research on these shared GPUs and test their new idea. To support this, we have deployed “GPU IdleBot” tool, which detects when a GPU is practically idle. If idle use continues for an hour, the GPU will be automatically released back to others in the queue.
Improvements to GPU queue structure and access.
To help reduce queue wait times and improve fairness among users, we have made refinements to the GPU queue structure, adjusted per-user job submission quotas (please see this link), and reinforced that direct SSH access to GPU nodes is not supported — a standard practice across HPC centers to prevent accidental or malicious disruptions.
I thank our RC staff team for successfully transitioning most of our GPU nodes in the public partition to our shiny Explorer cluster and working tirelessly to deploy these new improvements.
I would like to highlight that these improvements were in part shaped by the feedback we received from our user community. So, please feel free to continue sharing your suggestions/ideas with us — we are always listening and improving.
Warmly,
…∂t
Devesh Tiwari
Associate Vice Provost for Research Computing
Office of the Provost
d.tiwari@northeastern.edu
https://coe.northeastern.edu/people/tiwari-devesh/
