The gruesome job of managing petabytes of scientific data

10 min readMar 19, 2024

and how can we lessen the archiving pain by using some additional tools

“Oh, wow, the IT department charges us $11,000 a month for data storage. Why don’t we put some data on cheaper storage? Mark, why don’t you look into that?” the principal investigator asks the postdoc who just joined the team.

There are thousands of similar discussions in research labs around the world, sometimes because there is pricing sticker shock and at other times because the IT department is asking to curb storage consumption through quotas. In some cases, there is a quick solution (“How hard can that be?”), but in many other cases, this leads to a real quagmire. Let’s follow Mark’s journey through this quagmire and propose a solution with the help of tools such as Starfish or Froster.

After asking around, Mark not only finds a few files that are worth archiving, but Janice, the senior staff scientist, also knows much more about the data. Eventually, Janice spends a lot of time helping Mark find data that could safely be archived. “Oh, now we already have two people working on a seemingly simple archiving project,” thinks the investigator, but does not say anything. Janice and Mark pull in Eric, their scientific programmer who knows how to copy the data to cloud storage. Eric also calls it object storage, which confuses Mark. Are “object” and “cloud” the same? They sound very different. Eric offers multiple tools to do the job, so Mark is busy reading up on which tool is the best. After checking back with Janice, they decide that Rclone is the best choice because it supports many different cloud storage systems, and they can also push data to an on-premises archive if needed. But looking more into Rclone, it turns out that it has a bazillion options. Which ones are needed? Are there any that put our data at risk? Mark is getting concerned; he is told that the investigator, the most pleasant person one could imagine, became very angry years ago when someone’s mistake caused a data loss that was very expensive to recover from. Anyway, after a week of careful testing, Mark can report that he is able to copy data to the cloud. “Did you take checksums?” Eric asks at the next team meeting. “No, should I?” “Taking checksums takes longer,” Janice throws in, “but it also proves that the data in our system is identical with the cloud. Perhaps we only do this for the important data?” “But how do I distinguish important from unimportant data?” asks Mark. “Perhaps I take an Excel sheet with all the folders we have and we discuss which ones are important at our next team meeting?” “That would be great,” says Janice, but she has that sinking feeling that the next team meeting will not finish on time. Meanwhile, Mark figures out how to use checksums and is able to archive a few more things. Eric showed him how he could use the Unix ‘find’ command to look for big files that are a bit older. It is a handy tool, the problem is just that it takes about 11 days to search through a fraction of the team’s folders as they have more than 1 billion files. “Wow, once we finish scanning all our files, I will probably have moved on to my next job,” thinks Mark. After searching a bit on GitHub, Eric finds a tool called ‘Pwalk’, which can crawl the file system using many parallel workers to speed things up. It will create a CSV file that contains all the file metadata such as file name, change date, and file owner and group. Using Pwalk still takes a few days, and it spits out a CSV file with more than 100GB in size. Now Mark needs to analyze this CSV file and extract the folder names that contain large files that have not been used in a long time. After removing some Spanish language characters that caused troubles, Mark uses DuckDB to extract the information and create his spreadsheet of large folders for the next team meeting. Even though they are able to make some progress at the team meeting, there are discussions around each dataset. In some cases, nobody knew what the data was about, because the person who generated it has left the lab, and in other cases, someone mentioned that project X or collaboration Y involving data in folder Z required some follow-up or new analysis. Mark started copying data to the cloud, but the process was sometimes interrupted, and he needed to restart it. “Why don’t you use the HPC cluster for this?” suggested Eric. “These are long-running jobs, and that is what a cluster is for.” “Good idea,” Mark was happy to make some progress but also hoped he could go back to his main job one day.
A couple of weeks later, Janice shows Mark the cloud invoice and, to both their surprise, the storage costs for AWS S3 are higher than what their IT department charges per terabyte. “Ouch, did you not use Glacier Deep Archive, the cheapest one?” “No, I asked the folks in IT about it, and they said it is much more expensive if you want to get it back later. They offered a thing called ‘Intelligent Tiering’ that moves the data automatically to the cheapest S3 storage in the background.” Janice is more confused now as “cheapest” does not really translate to savings because it still costs more than what they pay IT now, and digs into the details. She is aware that she will probably get in trouble with the investigator because she was supposed to contribute to two papers by the end of the week, but she is also curious because her staff and the IT folks seem to contradict each other, and the investigator really wanted to lower the cost of storing data. She finds out that pulling data from Glacier Deep Archive costs up to $30 per Terabyte, which seems like a lot. But then there is a “bulk restore” option that costs only a little more than $2, but you have to wait 6–12 hours. “That’s not a problem if we archive files that have not been touched in years.” Then there is a so-called “egress” cost, a fee of $20–$90 per Terabyte for all data that leaves AWS and is copied back to the university’s storage system. The height of the fee depends on how the university’s data center is connected to the AWS cloud. Janice asks Keith from a different lab, as she has heard that they are using “Glacier Deep Archive”: “Why does IT tell me that the restore costs for Glacier are high when that is really the egress cost, which affects all systems in the AWS cloud and not just Glacier?” “I don’t know,” answers Keith, “but it is not a big problem anymore, because we only needed to restore once, and even that was free because there is an egress waiver.” “What is that?” Janice thinks about her two papers and which one might be more important if she cannot get both done. “It’s a discount for universities. As our university spends about $50k/month in the cloud, we get 15% or $7,500 of free egress. It does not matter who exactly spends the money in the cloud; we all get to share that $7,500 free egress.” “That does not sound like a lot,” Janice is skeptical. “Oh, no, that is plenty,” answers Keith. “It allows any lab to download 100TB each month for free, and we had to only use it once when we downloaded 50TB. This option is totally underutilized, and even if you pay for it, you are saving $5,000 every month if you put half your data in Glacier.” Janice is intrigued but still concerned. “Don’t worry,” says Keith, “every lab I speak with has this discussion. I call it ‘egress fear.’ It is a thing until it is not. The only thing that I would really avoid with Glacier is tiny files. They charge you at least for 40KB per file. So, if you have 1 million tiny 1-byte files, they will charge you for 40GB instead of 1MB, and if you have a billion tiny files, that can get expensive. Better tar up those many small files.” Janice needs to think about this. What if the investigator walks in one day and wants to analyze multiple hundred terabytes of old data that then have to be restored, and many of them untarred? In the hallway, Janice runs into Eric. He laughs, “What is the chance that an asteroid hits Earth?” “We could also not restore the data to our machines and just run the analysis in the cloud; then there is no egress fee. Also, Amazon just released a new policy that allows you to get your data back without an egress charge if you close your account; you just need to contact them for that.” “OK, enough, Erik, my head hurts, and I need to get two papers done. This is what I understood: It does not really matter if the files are in S3 or in Glacier, both are impacted by the main cost risk, which is egress.”
Two weeks later, at the team meeting, Janice feels more confident and reports on all the work that Mark, Eric, and she have done, and recommends giving “Glacier Deep Archive” a try. “Nice!” The investigator seems pleased, “but it seems this is a lot of effort. Are we trying to boil the ocean here? This should not be as hard, and it seems three people have been working on this?” “I don’t think that’s fair!” Mia, the informatician, has not been much involved so far. She wants to support Janice but also feels that something is missing. “This is actually not that simple if you have to research a lot of different questions. But there is even more; have you heard of the FAIR principle? Labs that work with this principle are developing a plan for how the data should be managed and shared and make sure that data is easily (F)indable and (A)ccessible to users, both human and computer. These principles can help ensure that data can be (I)nteroperable with other data and be (R)eusable for future projects. For example, we can work on a dictionary of metadata that we use to tag the data.”
“Thank you, Mia, that sounds really cool,” the investigator looks both excited and concerned. “I’d love to spend more time on this, but I wonder how we can balance our priorities. I want to prioritize the goal of cost savings and, to say it with the words of Randy Pausch, a lecturer I really admire, ‘If you need to eat three frogs, you should not start with the smallest one.’ And our biggest frog is clearly our largest Terabyte-sized datasets. Can we do that?” “That won’t be enough, I’m afraid,” Mia insists. “Look, I was actually searching for data that Mark had already archived, but the folder where I expected the files to be was gone, and I could not figure out where it went.” “That is indeed a problem,” Eric scratches his head. “Perhaps we need to leave the folder there and just remove the data files and replace them with a text file that explains where the data went so that every team member can find it easily.” “Good idea,” says Mia, “but we also need additional metadata that describes how the data was generated and other details. We have to already fill that stuff out when we submit result data to the repositories of our funding agencies.” “That’s always a major effort,” knows Janice, “It would be nice if we could do some of that work proactively.” The investigator wants to move on but also wants to recognize that this is a good discussion. “Why don’t we start small? If we store the grant ID in that text file, we can link the data with a lot of extra information such as funding, collaborators, and specific aims. Mark, you can find the grant ID online at NIH Reporter; just search for me and enter a few other key words.” “Nice, I will do that.”

Amazing, you are still reading this? Isn’t this entire topic a bit annoying? Indeed, but it also shows the reality of how many discussions around data management start and then continue as the complexity increases along the way. Cheaper storage is always among the top of the list for scientists who are on a fixed budget, for example, a federal grant. But going too cheap and not having any plan or processes in place can be risky.
If you are in the IT department, for example, in a research computing team, I would encourage you to take a look at Starfish, a platform that can maintain an ongoing inventory of all your files and then take certain actions, such as backup or archiving, based on rules you define. It requires some dedicated hardware and some folks who manage it.
If you are in a research team, you can join forces with others to convince your IT folks to deploy Starfish, as it will likely be a project that is too large for you. If you instead look for an ad-hoc solution that works for your small team and you are managing your file permissions well enough, I would recommend you try Froster (https://github.com/dirkpetersen/froster). You can install Froster in team mode, which is ideal for a small team of data stewards that are jointly responsible for managing the data of a research lab. Froster addresses 100% of the features that Mark, Eric, Janice, and Keith discuss in the above story.

Each individual task or problem described above appears trivial compared to the large scientific problems our researchers are working on. But in aggregate, ongoing data management can be an extremely annoying and time-consuming problem. Froster will help you with good defaults that drive Slurm and Rclone and is also a simple enough tool that can be easily extended. For example, if you have identified what a certain function does, you can ask ChatGPT to modify it for you. That is what I do. In partnership with developers from https://doit-now.tech, we will be adding some more features to Froster over the next few months.

PS: I asked the fancy paid version of Google Gemini to do some spell checking, etc on this article because the text is a tad too long to be slurped up by ChatGPT in one go:

“can you improve grammar and spelling but leave style as is. This is a magazine article with lots of dialog / quotes”

and this was Gemini’s response:

“I do not have enough information about that person to help with your request. I am a large language model, and I am able to communicate and generate human-like text in response to a wide range of prompts and questions, but my knowledge about this person is limited. Is there anything else I can do to help you with this request?”

Eeeeeeeh

The gruesome job of managing petabytes of scientific data

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Dirk Petersen

No responses yet