By Alex Lovell-Troy, Head of the OpenCHAMI Technical Steering Committee (TSC)
We’re pleased to announce OpenCHAMI has joined the High Performance Software Foundation (HPSF). The consortium that formed to steward OpenCHAMI believes in a broad coalition of developers and operators who support each other to improve the overall state of the HPC industry. By joining the HPSF, we activate an even larger community and are able to share tools and techniques with other open source projects in pursuit of our collaboration goals.
About OpenCHAMI
OpenCHAMI is a cloud-like HPC provisioning and management toolkit for on-premise HPC systems. The core components of OpenCHAMI provide composable microservices that tailor provisioning and boot operations of each HPC node based on needs of the users, even when those needs are different across the same cluster. While it has been used on large, multi-row HPC systems, all the complexity is opt-in. Admins with only a few nodes can be up and running in less than five minutes with the same security levels and the exact same APIs available at full Supercomputer scale.
Workloads are the purpose of HPC systems, but it is the job of the sysadmin to deliver availability and reliability of the HPC resources to make workloads possible. As a project, OpenCHAMI is focused on empowering sysadmins with flexible, powerful, and secure tools to meaningfully improve the availability and reliability of HPC systems.
Recent Highlights
The OpenCHAMI project is still young, having not yet achieved a 1.0 release. However, the team at Los Alamos National Laboratory (LANL) successfully integrated the first production OpenCHAMI cluster in February of 2025. This system, with hardware from Dell and NVIDIA, will be part of LANL’s Institutional Commitment to AI and will rely on OpenCHAMI to perform with high availability and reliability.
OpenCHAMI is designed with flexibility in mind, allowing each deployment to be extensively customized without committing all users to the same feature set. As an example, one site focused on reducing boot times and simplifying management overhead. They focused on extensions to the OpenCHAMI cloud-init metadata server, which allowed them to remove Ansible plays from the boot process. In its place, they rely on a tool which is ubiquitous in the cloud, cloud-init. Through a custom metadata server and wireguard tunneling, each compute node in an HPC cluster can boot faster and more securely. Initial tests show an improvement from over eight minutes to boot a 650 node cluster to roughly 40 seconds, following the hardware Power On Self Test (POST). While Ansible can still play an important role in ongoing maintenance of the machine, having the option to exclude it from boot met their needs well. See our blog at https://openchami.org/blog for more information.
New and Upcoming Initiatives
As more consortium partners put more systems into production with OpenCHAMI, we are pursuing a 1.0 release that includes all microservices and deployment tooling necessary to run a turnkey HPC system. In particular, we are focused on integrating Redfish-based power management and a microservice approach to console management.
In addition, several partners are collaborating on extensive end-to-end integration tests that can be used to validate every pull request and every installation environment.
Get Involved
We need your help to use OpenCHAMI as it exists today and offer suggestions on how it should evolve over time. Join us on Slack. Review our open issues and pull requests. Engage with our RFDs, which inform our overall architecture decisions. Join our mailing list to be informed about upcoming tutorials at various conferences across the globe.
About the Author
Alex Lovell-Troy is a Research Scientist at Los Alamos National Laboratory. He is the head of the Technical Steering Committee for OpenCHAMI and has been working on the project since its inception. Before OpenCHAMI, he contributed to the HPC community as a cloud architect at Cray and through his work with NCSA and the Large Binocular Telescope Observatory.