AsianScientist (Mar. 28, 2019) – From massive systems to minuscule structures, scientists are collecting data at every imaginable scale: satellites and sensors constantly survey our planet and the cosmos, while scientists in laboratories around the world scrutinize the movements of individual atoms and molecules.
But data in its raw form is like unprocessed mineral ore—ugly and not particularly useful. The fact that there is so much of it only compounds the problem—sifting through the noise to derive insights from data is not only expensive and time-consuming, but also almost impossible to do manually. As the rate of data generation outstrips the pace of analysis, scientists are finding supercomputers increasingly necessary to conduct meaningful research.
“For example, modeling the Big Bang and forecasting the weather are very computationally intensive. Also, in biology, the number-crunching capabilities of supercomputers are needed for modeling molecular dynamics,” said Dr. Andreas Wilm, team leader of the bioinformatics core at the Genome Institute of Singapore (GIS). “There are plenty of research areas that require high performance computing (HPC).”
HPC as a public utility
Useful as supercomputers may be, not everyone has access to them. Here, access may be defined in two different ways—availability of supercomputing resources, and technical ability to use supercomputing productively.
Most of the world’s HPC prowess is currently concentrated in the US, China and Japan, which means that one does not simply plug in to a petaFLOP supercomputer and run an experiment. This is set to change as cloud computing providers such as Microsoft Azure increasingly offer HPC-like resources on demand, said Wilm.
“Rather than own supercomputing hardware themselves, researchers can tap on systems like AWS’ CfnCluster or Microsoft’s Cycle Computing to spin up their very own ‘supercomputer’ in the cloud from their laptops,” he added.
Mr. Alex Nodeland, cofounder and CEO of Archanan, a company that develops cloud-based development environments for HPC software, thinks that in the future, “the majority of the world’s HPC resources will be accessible over a cloud-like interface.” This looks set to usher in an era of democratized supercomputing, where geography and cost no longer prevent scientists from performing complex and computationally intensive analyses.
FLOPS is not where it stops
Hardware woes aside, the processes and practices surrounding the use of supercomputers also pose barriers to the widespread adoption of supercomputing for research.
“For complex analytics such as those used in computational genomics, researchers may have to execute many different programs on different slices of data in a specific order, and with certain dependencies between jobs. Managing this orchestration can become an overwhelming task,” explained Wilm.
Nodeland also highlighted this steep learning curve for researchers.
“HPC users come from diverse backgrounds and are not always well equipped with knowledge of computer architecture and networking. Combined with very diverse user interfaces and complex computing environments, it can be challenging for scientists to take advantage of supercomputers for their research.”
Supercomputing simplified
Fortunately for researchers, the HPC community is working to improve the usability of supercomputers. To overcome obstacles in orchestration, for instance, Wilm recommended the use of workflow management software such as Snakemake and Nextflow, which simplify the development, execution and scaling of a computational pipeline.
“With a given workflow recipe, users—in theory—only have to invoke one command, which then automatically orchestrates hundreds of others,” he said, alluding to a domino effect where an initial trigger drives all downstream activities. “At GIS, we use Nextflow to orchestrate the analysis of thousands of genomic samples on the National Supercomputing Centre’s ASPIRE 1 supercomputer.”
Meanwhile, Nodeland is looking to go beyond workflow management and enable researchers to custom-build their own HPC software. Yet, software created on traditional desktops may not run properly when installed on
supercomputers due to differences in core count and coding standards.
Because of this, about one third of a supercomputer’s capacity typically goes towards development and debugging, hogging HPC resources that could otherwise have been used for running productive computations, he said.
“We need to bridge the gap between the creation of large-scale parallel codes on conventional workstations and the deployment of software on multi-million-dollar supercomputers,” he added.
Simulation station
Nodeland’s solution involves getting researchers to offload their software engineering processes to the cloud. His company, Archanan, has built a platform consisting of tools such as debuggers, profilers and memory map analytics, which allows researchers to develop, test and deploy HPC applications on cloud-based infrastructure that emulates a supercomputer.
With this platform, researchers can have greater confidence that whatever software they code on the cloud will function as intended when ported over to an actual supercomputer, since the development (cloud) and deployment (supercomputer) conditions are now largely similar.
Nodeland told Supercomputing Asia that early adopters of Archanan’s platform are already seeing improvements in development life cycles for supercomputing software pertaining to materials science, bioengineering, physics, economics, data mining and artificial intelligence, among others.
“By removing the burden of dealing with system architecture, Archanan allows HPC users to focus on their research, leading to better results in less time,” he concluded.
This article was first published in the print version of Supercomputing Asia, January 2019.
Click here to subscribe to Asian Scientist Magazine in print.
———
Copyright: Asian Scientist Magazine; Photo: Shutterstock.
Disclaimer: This article does not necessarily reflect the views of AsianScientist or its staff.