As Software Evolution Engineers at Wizeline, we’re responsible for managing and handling production systems. Here, we take care of all the aspects of the system, such as the code, data platforms, code pipeline, databases, services, etc. We have many indicators that allow us to evaluate how well we are performing our work or where we need to focus our attention, such as the number of bugs, incidents, lines of code changed, and user requests.
There’s one indicator of particular interest to us and our clients in cloud-native environments: the cost per cloud service. Keep reading for the story of how we helped one customer reduce the cost of S3 storage by 85%.
The What: Identifying Something’s Not Right
While working for a client with a cloud-native data platform, we identified a problem with their S3 storage account. We reviewed the monthly fee in the last three months and saw a constant increase in the price—around $300 per month. If we do the math, the cost of GB per month in standard storage in S3 is about $0.023, depending on the zone, which means we were producing 13TB of new data per month.
This indicator was presented in a data project. 13TB of data seems normal for a data project, and for big data projects, it may even seem insignificant. The problem was identified when we investigated the amount of data we submit to the data pipeline per month, which was 1.7TB. So, how were the other 11.3TB of data being generated? If you think about it, for a data project ingesting 1.7TB of data and generating 11.3TB, it may seem like the system duplicates a lot of data or maybe there is something else going on.
The Where: Charting the Data Path
To understand the situation, we first identified where the data was being generated. AWS offers two metrics by default when you access a bucket: Total Bucket Size and Total Number of Objects. We did our search based on the first metric: Total Bucket Size. We navigated each of the buckets for the account and discovered one in particular that contained 90% of the total size charged monthly. Due to its size, we considered this bucket to be the one where the problem was occurring.
Next, we identified the path where most of the objects were stored. The first approach, quick and easy, was to use the AWS CLI. You can get the total size of a path with this command:
aws s3 ls s3://mybucket –recursive –human-readable –summarize
The problem was that the sum of the size of the directories did not match the total size of the bucket. It was not a small difference either. The total size of the script was around 10% of the total size of the bucket. How? What was the problem? It was Versioning.
In an S3 bucket, you can enable a feature called Versioning. With this feature, you can have versions of the objects saved in the bucket. For example, if you create an object with the name “A.txt” and then decide to override this object with a new version of the object, as soon as you save the latest version of A.txt, AWS will save a copy of the previous version. The copies of the earlier versions are kept in the same location but hidden and only visible in the console when you click on “Show versions.” AWS uses the object’s metadata to differentiate the current version of the object from the previous versions. In this metadata, you have a field with the name “is_latest” where only the object’s current version contains this field equal to true—the fields “version_id” and “last_modified” help to differentiate one version object from another.
Returning to the original problem, the reason why the size displayed by the AWS CLI output and the total size of the bucket was not the same WAS because the total size shown by the AWS CLI command considered only the current version of the objects. In contrast, the total size of the bucket reflected the size of all objects, including the versioned objects.
To deal with the versioned objects, we used an AWS buckets feature called S3 Inventory. This feature works by creating a dump of the metadata for all the objects in the bucket. We set the frequency for the dump, and the best part is that we can ingest it into AWS Athena as well. Success!
Using this feature, we can use a query engine to work with the object’s metadata. In the object’s metadata, we can find fields like: “is_latest,” “version_id,” “size,” “last_modified_date,” or “storage_class,” and with the query engine, we can perform filters using clauses, aggregations (sum, min, max, count), groups or orders. I know you can already imagine how easy it is to identify big object paths or big objects with this feature.
With the S3 inventory, we could identify the path where most of the objects were created. Now that we knew what the problem was and where it originated, we needed to also understand how and why.
The How: Too Many Versions Saved
In the path where most of the objects are created, we observed a pattern across all subdirectories. The original versions of the object had been deleted on each subdirectory, but the previous versions remained there.
To understand how this kind of problem is generated, let’s comment on another feature of versioned buckets. In a versioned bucket, when you delete the current object version, the previous version of the object remains there. So, if you create the object B.txt and then decide you no longer need it and delete the object, a copy of that B.txt object will be saved as a previous version, and a “delete marker” will be created in the place of the original object but without content. This is to let you know an original object existed there and was deleted.
So, that’s what was happening in this directory. We had thousands of delete markers created with a copy of the original object that was deleted. This lets us conclude that we had temporary objects created and deleted in a bucket with versioning enabled.
The Why: Creating New Data Instead of Deleting Temporary Objects
After analyzing the code for the jobs pointing to the S3 location, we figured out the guilty one, a Spark Job. What was happening is that the Spark Job was created to perform set operations for multiple data sources. For each operation from a different source, the job created a temporary object containing the result of that operation. The job utilizes all the objects containing the result of the set operation to generate the final data object, and the temporary objects are deleted.
The problem here was that because we were using a versioned bucket, an object version and a delete marker were created instead of deleting the temporary object. Bingo!
The Solution: Fixing the Code & Deleting Hundreds of Thousands of Small Objects
Using the object’s metadata in AWS Athena, we got a list of all the version objects created by the Spark Job. The list comprised hundreds of thousands of small objects, but together they were using 90% of our S3 storage. We deleted them with a script that reads this list and runs a delete command from the AWS CLI for each object in the list.
To finalize, we analyzed the possibility to move the S3 location for this Spark Job to a bucket where the versioning feature is not enabled. However, due to upstream/downstream dependencies, it was not feasible. We ended up creating a lambda function. This lambda function is triggered by the arrival of the dumps with the object’s metadata. As soon as the dump arrives, the lambda uploads it to AWS Athena to query Athena and determine the temporary objects created to delete them.
Having this solution in place helped us delete the objects we don’t need, keep control of the size of the bucket, and save on the monthly S3 storage bill. For our client, this helped them reduce the cost of object storage by more than 85%.
Like this, many other issues can be presented in cloud-native systems or other production systems. The Product Evolution Discipline is ready to detect them, handle them and provide you with the best practices to mitigate them.
To learn more about our Product Evolution practice at Wizeline, visit Wizeline Product Evolution.