Wednesday, January 29, 2014

Getting Size and File Count of a 25 Million Object S3 Bucket

Amazon S3 is a highly durable storage service offered by AWS. With eleven 9s (99.999999999%) durability, high bandwidth to EC2 instances and low cost, it is a popular input & output files storage location for Grid Engine jobs. (As a side note, we were at Werner Vogels's AWS Summit 2013 NYC keynote where he disclosed that S3 stores 2 trillion objects and handles 1.1 million requests/second.)

AWS Summit 2013 keynote - Werner Vogels announces over 2T objects stored in S3On the other hand, S3 presents its own set of challenges. For example, unlike a POSIX filesystem, there are no directories in S3 - i.e. S3 buckets have a flat namespace, and files (AWS refers them as "objects") can have delimiters that are used to create pseudo directory structures.

Further, there is no API that returns the size of an S3 bucket or the total number of objects. The only way to find the bucket size is to iteratively perform LIST API calls, each of which gives you information on 1000 objects. In fact the boto library offers a higher level function that abstracts the iterating for us, so all it takes is a few lines of python:
from boto.s3.connection import S3Connection

s3bucket = S3Connection().get_bucket(<name of bucket>)
size = 0

for key in s3bucket.list():
   size += key.size

 print "%.3f GB" % (size*1.0/1024/1024/1024)

However, when the above code is run against an S3 bucket with 25 million objects, it takes 2 hours to finish. A closer look at the boto network traffic confirms that the high level list() function is doing all the heavy lifting of calling the lower level S3 LIST (i.e. over 25,000 LIST operations for the bucket). With such a chatty protocol it explains why it takes 2 hours.

With a quick google search we found most people workaround it by either:
  • store the size of each object in a database
  • extract the size information from the AWS billing console
mysql> SELECT SUM(size) FROM s3objects;
| SUM(size)  |
| 8823199678 |
1 row in set (20 min 19.83 sec)
Note that both are not ideal. While it is much quicker to perform a DB query, we are not usually the creator of the S3 bucket (recall that the bucket, which is owned by another AWS account, stores job input files). Also, in many cases, we need other information such as the total number of files in the S3 bucket, so the total size shown in the billing console doesn't give us the complete picture. And lastly, we need the information of the bucket now and can't wait until the next day to start the Grid Engine cluster and run jobs.

Running S3 LIST calls in Parallel
Actually, the LIST S3 API takes the prefix parameter, which "limits the response to keys that begin with the specified prefix". Thus, we can run concurrent LIST calls, where one call handles the first half, and the other call handles the second half. Then it is just a matter of reducing and/or merging the numbers!

With 2 workers (we use the Python multiprocessing worker pool) running LIST in parallel, we reduced the run time to 59 minutes, and further to 34 minutes with 4 workers. We then maxed out the 2-core c1.medium instance as both processors were 100% busy. At that point we migrated to a c1.xlarge instance that has 8 cores. With 4 times more CPU cores and higher network bandwidth, we started with 16 workers, which took 11 minutes to finish. Then we upped the number of workers to 24, and it took 9 minutes and 30 seconds!

So what if we use even more workers? While our initial goal is to finish the pre-processing of the input bucket in less than 15 minutes, we believe we can extract more parallelism from S3, as S3 is not just 1 but multiple servers that are designed to scale with a large number of operations.

Taking Advantage of S3 Key Hashing
So as a final test, we picked a 32-vCPU instance and ran 64 workers. It only took 3 minutes and 37 seconds to finish. Then we observed that the work we distribute to the workers doesn't take advantage of S3 key hashing!

As the 25 million object-bucket has keys that start with UUID like names, the first character of the prefix can be 0-9 and a-z (36 characters in total). So key prefixes are generated like:
00, 01, ..., 09, 0a, ..., 0z
10, 11, ..., 19, 1a, ..., 1z

And workers running in parallel at any point in time can have a higher chance of hitting the same S3 server. We performed a loop interchange, and thus the keys become:
00, 10, ..., 90, a0, ..., z0
01, 11, ..., 91, a1, ..., z1
Now it takes 2 minutes and 55 seconds.