USC High-Performance Computing recently transitioned from PBS Torque/Moab to Slurm. The transition has gone surprisingly well, and I wanted to blog a little about what we did as ACI-REFs to prepare for it.
The transition took place during the week of March 19th. By early January, the lead Slurm administrator had set up a representative 1-head-node/4-compute-node test cluster under Slurm for external use (beyond the internal development cluster), and HPC started reaching out to early adopters—individually-targeted researchers, cluster managers, and PIs who owned a small number of condo nodes that could potentially be moved to the new test cluster.
ACI-REFs—my colleague, Cesar Sul and I—followed up on this initial outreach with requests to discuss the upcoming transition and to assist researchers with testing. We started researching Slurm, and developing a presentation and documentation. The first explained why we were transitioning to Slurm and what was going to change, and the second described how to use the test cluster. We met with six project groups, urging users to start testing their research.
Eventually, repeating “Your PBS scripts will no longer work…” had the desired effect. As we started working with early adopters on the test cluster, and as 11 condo nodes and 2 partitions were added, the earliest issues, like cgroup configuration and cluster size, were administrative. Eventually we were assisting serious testers, who were asking many good questions, which we would discuss in weekly meetings with cluster administrators.
The first real issues were due to the fact that, unlike PBS, Slurm did not automatically clear the runtime environment, and that many researchers were not in the habit of specifying memory constraints, which they now needed to do. Also, we were encouraging users to change from node-centric to cpu-centric allocation requests, to prepare for HPC’s implementation of node sharing, to increase optimization of compute resources. Finally, we had to find substitutes for the resource use information in PBS’ epilogue, and for informational commands, generally.
The new cluster was growing, maturing, and being used. It was exciting. Researchers were sharing their findings and administrators were configuring final changes. As ACI-REFs, we sat between the two, documenting the process almost daily for two new webpages—a short-term transitioning page and a long-term documentation page—and, more generally, updating the information and examples on the website.
On March 19th, the entire 2711 node cluster went down for a week…