In late March, USC ‘s Center for High-Performance Computing (HPC) brought our cluster back online after a week-long downtime. We schedule two downtimes per year for regular maintenance but, as mentioned in part one, we had to do extra planning for this particular downtime as we were implementing a major change to how jobs are scheduled in our environment. In fact, we had been working out the details of the transition from PBS to Slurm for the past year or so. I spent most of the downtime week reviewing our online documentation to make sure it was accurate. Our biggest concern was that we would be overwhelmed by many people asking, “Why don’t my PBS scripts work?” so having accurate documentation explaining the new job scheduler was a high priority for me.
I think having good documentation really helped us out because we received very few questions on basic things like Slurm syntax that could be easily found on our website. Most of the issues our users had were pitfalls that our testers had reported to us, so we already had a good idea of how to help them. The rest of the problems our users encountered were related to our Slurm configuration; these issues could only be seen after putting it into production. Since we had already done a lot of prep work before the downtime, ACI-REFs were able to take care of most of the tickets, which allowed the system administrators to focus on the tough configuration settings issues.
There’s never a good time to make a disruptive change to a cluster environment but based on our Slurm experiences we recommend the following:
- Create a separate test cluster environment
- Identify people who are willing to be early testers
- Use their feedback to create documentation
- Open the test cluster to everyone
- Use additional feedback to refine documentation
- Offer as much training as possible before the change
- Periodically send out answers to frequently asked questions after the change