Introducing the latest Slurm on Google Cloud scripts
Google Cloud is a great home for your high performance computing (HPC) workloads. As with all things Google Cloud, they work hard to make complex tasks seem easy. For HPC, a big part of user friendliness is support for popular tools such as schedulers.
If you run high performance computing (HPC) workloads, you’re likely familiar with the Slurm workload manager. Today, with SchedMD, Google are announcing the newest set of features for Slurm running on Google Cloud, including one-click hybrid configuration, Google Cloud Storage data migration support, real-time configuration updates, Bulk API support, improved error handling, and more. You can find these new features today in the Slurm on Google Cloud GitHub repository and on the Google Cloud Marketplace.
Slurm is one of the leading open-source HPC workload managers used in TOP 500 supercomputers around the world. Over the past five years, Google have worked with SchedMD, the company behind Slurm, to release ever-improving versions of Slurm on Google Cloud.
Here’s more information about their newest features:
Turnkey hybrid configuration
You can now use a simple hybrid Slurm configuration setup script for enabling Google Cloud partitions in an existing Slurm controller, allowing Slurm users to connect an on-premise cluster to Google Cloud quickly and easily.
Google Cloud Storage data migration support
Slurm now has a workflow script that supports Google Cloud Storage, allowing users to define data movement actions to and from storage buckets as part of their job. Note that Slurm can handle jobs with input and output data pointing to different Google Cloud Storage locations.
Real-time Configuration Updates
Slurm now supports post-deployment reconfiguration of partitions, with responsive actions taken as needed, allowing users to make changes to their HPC environment on-the-fly.
Bulk API support
Building on the Bulk API integration completed in the Slurm scripts released last year, the newest scripts now support Bulk API’s Regional Endpoint calls, Spot VMs, and more.
Clearer error handling
This latest version of Slurm on Google Cloud will indicate the specific place (e.g. job node, node info, filtered log file, etc.) where an API error has occurred, and expose any underlying Google API errors directly to users. The scripts also add an “installing” animation and guidance on how to check for errors during the installation process if it takes a longer time than expected.
Billing tracking in BigQuery and Stackdriver
You can now access usage data in BigQuery, which you can merge with Google Cloud billing data to compute the costs of individual jobs, and track and display custom metrics for Stackdriver jobs.
Adherence to Terraform and Image Creation best practices
The Slurm image creation process has now been converted to a Packer-based solution. The necessary scripts are incorporated into an image and then parameters are provided via metadata to define the Ansible configuration, all of which follows Terraform and Image Creation best practices. All new Terraform resources now use Cloud Foundation Toolkit modules where available, and you can use bootstrap scripts to configure and deploy Terraform modules.
You can now enable or disable oslogin and install LDAP libraries (e.g. OSLogin, LDAP, Disabled, etc) across your Slurm cluster. Note that the admin must manually configure non-oslogin auth across the cluster.
Support for Instance Templates
Following on the Instance Template support launched in last year’s Slurm on Google Cloud version, you can now use additional Instance Template features launched in the intervening year (e.g. hyperthreading, Spot VM).
Enhanced customization of partitions
The latest version of Slurm on Google Cloud adds multiple ways to customize your deployed partitions including: Injection of custom prolog and epilog scripts, pre-partition startup scripts, and the ability to configure more Slurm capabilities on compute nodes.
The Slurm experts at SchedMD built this new release. You can download this release in SchedMD’s GitHub repository. For more information, check out the included README. If you need help getting started with Slurm check out the quick start guide, and for help with the Slurm features for Google Cloud check out the Slurm Auto-Scaling Cluster codelab and the Deploying a Slurm cluster on Google Compute Engine and Installing apps in a Slurm cluster on Compute Engine solution guides. If you have further questions, you can post on the Slurm on Google Cloud Google discussion group, or contact SchedMD directly.