Move It!

Teams raced to move two terabytes of data across five different countries in the inaugural ‘Move that Data!’ Data Mover Challenge.

AsianScientist (Jul. 19, 2019) – As compared to a time when goods were moved between countries on horse carts, the volume of cargo that can cross geographical boundaries has increased exponentially with the invention of container ships and cargo planes. Analogous to physical cargo, Big Data must also be moved around the world quickly. Therefore, innovations for borderless, rapid data transfer are urgently needed.

To encourage data scientists and researchers to think up novel data transfer strategies, the National Supercomputing Centre (NSCC) Singapore organized the inaugural ‘Move that Data!’ Data Mover Challenge.

Seven teams participated and were each given one week to deploy software to transfer two terabytes of data across five different countries, namely NSCC in Singapore, National Computational Infrastructure (NCI) in Australia, Korea Institute of Science and Technology Information (KISTI) in Korea, National Institute of Information and Communications Technology (NICT) in Japan, and StarLight in the US.

Supercomputing Asia caught up with one of the organizers, Associate Professor Francis Lee Bu Sung of Nanyang Technological University, Singapore, and vice-president of the Singapore Advanced Research and Education Network (SingAREN), to find out more about the challenge.

Why is high-speed data transfer important for research?

Assoc. Prof. Francis Lee: In this day and age of digital data and data-intensive computing, moving scientific data from one spot to another is important for global collaboration. If you cannot move data fast enough, scientists working together on a research project will not be able to make their discoveries in a timely manner.

In the past, data was moved around the globe by FedEx, which means there was a lag between data generation and data analysis. But things have changed; the volume of data has grown even bigger, and people want things faster online, even on the spot. I have seen collaborations among researchers where the minute they capture data, it is shared with other researchers elsewhere in the world.

We are tackling bigger problems with bigger data, so I think rapid data transfer is very essential. There is no use having to wait one day or two for data to be transferred; you need to get it out as fast as possible.

Why is transferring research data more difficult than transferring everyday internet data?

FL: The transfer of everyday internet data, like streaming data and such, usually involves what we call small files, within the size range of a few hundred of megabytes to gigabytes. On the other hand, research data—like genomic data, for instance—is as large as one terabyte!

Even when data files are not that large, transferring large volumes of data at a time can be a challenge. For example, I might be helping Indonesia with the transfer of satellite data which may not be as large as one terabyte, but which consists of a lot of files, each of them a satellite image between 60 and 200 megabytes in size.

To enable the fast transfer of such research data, we need to tune the device parameters of data transfer nodes, such as the network card parameters, the input/output (I/O) scheduler, disk access and so on. Just by tuning the I/O scheduler, you can optimize the sequence in which data is read into the computer and sent out to the network. This allows you to more than double the throughput and achieve higher speeds of data transfer needed to move large amounts of data.

What are the motivations behind organizing this challenge?

FL: Singapore has a data transfer rate of 100 Gbps to Japan and the US, and very soon to Europe as well. But it is not just about the bandwidth, you also need software that can make use of that bandwidth. That’s why the Data Mover Challenge came about. We wanted to tackle the question: How can we get the best tools to move data efficiently from one point to another, so they can be shared?

At SingAREN, we have been tuning our servers and doing some data transfer optimization work as well, but we felt that if we opened up and engaged with external stakeholders in the field, we could learn and adopt best practices that we may not have thought about. And true enough, we found people doing things in innovative ways.

I think the main advantage of throwing out a challenge like this is the diversity of participants we get to engage with. Academics and industry players alike came in to try and solve something that is relevant to everyone in the field. We had teams from Argonne National Laboratory and FermiLab in the US who took part alongside others from the University of Tokyo and the Japan Aerospace Exploration Agency.

From the private sector, companies like Fujitsu and Zettar joined in as well, so we really had a good range of expertise involved. More importantly, because everyone came under the same roof at the SupercomputingAsia 2019 conference, there were many opportunities to interact and exchange notes, and hopefully take the field of rapid data transfer to the next level.

What are some of the hurdles the teams had to overcome?

FL: Each team was given one week to set up their software in multiple nodes in Japan, Korea, the US, Singapore and Australia, then demonstrate the transfer of two terabytes of data disk-to-disk, not just memory-to-memory.

Eventually, our intention is to deploy good data transfer node software around the globe, which is why deployment across the five sites is essential.

There was a variety of large and small files to be transferred, thus allowing us to assess how the teams’ software handled different packets of data. We gave participants a week to set up, after which the data transfers were carried out while we monitored the process.

Thereafter, the teams had to present to the judges their data transfer rate and how they achieve that rate.

While the speed of data transfer was a main criterion for judging, it was not the only one. For example, we also wanted to know how each team’s software maintained the data transfer rate over international links when the delay among those links is different. You need to demonstrate what the performance is when the delay is very large, when the delay is not so large, and so on. It takes a lot of effort to deploy software over five servers, make them run properly and tune them accordingly.

What skills did the teams need to have in order to succeed?

FL: I would say that a variety of skills are required, from computer science to network engineering. Some participants had software development backgrounds while others were good at tuning the networks. Of course, teamwork was essential as well.

Overall, I think all the teams performed very well. Initially, we said we only wanted to announce one overall winner. But as the competition went along, we felt that there was another deserving team, so we decided to recognize that team—the StarLight/International Center for Advanced Internet Research team—with the ‘Most Innovative’ award.

At the end of the day, we hope that competitions such as this will facilitate the transfer of ideas, knowledge and skills among people from diverse disciplines to enable faster transfer of research data around the world.

This article was first published in the print version of Supercomputing Asia, July 2019.

Click here to subscribe to Asian Scientist Magazine in print.


Copyright: Asian Scientist Magazine.
Disclaimer: This article does not necessarily reflect the views of AsianScientist or its staff.

Lidao is a chemistry undergraduate at the University of Oxford, UK. An aspiring scientist, he is excited to learn about the latest discoveries and even more intrigued by the fascinating stories behind them. In his free time, he enjoys badminton, swimming and running around Singapore to uncover the best durian spots.

Related Stories from Asian Scientist