Recently, we have provisioned a 10,000-node Grid Engine cluster in
Amazon EC2 to test the scalability of Grid Engine. As
the official maintainer of open-source Grid Engine, we have the
obligation to make sure that Grid Engine continues to scale in the modern datacenters.
Grid Engine Scalability - From 1,000 to 10,000 Nodes
From time to time, we receive questions related to Grid Engine scalability, which is not surprising given that modern HPC clusters and compute farms continue to grow in size. Back in 2005, Hess
Corp. was
having issues when its Grid Engine cluster exceeded 1,000 nodes. We quickly fixed the limitation in the low-level Grid Engine communication library and contributed the code back to the open source
code base. In fact, our code continues to live in every fork of Sun Grid Engine (including Oracle Grid Engine and other commercial versions from smaller players) today.
So Grid Engine can handle thousands of nodes, but can it handle tens
of thousands of nodes? Also, how would Grid Engine perform when there are over 10,000 nodes in a single cluster? Previously, besides simulating virtual hosts, we could only take the reactive approach - i.e. customers report issues, and we collect the error logs remotely, and then try to
provide workarounds or fixes. In the Cloud age, shouldn't we take a proactive approach as hardware resources are more accessible?
Running Grid Engine in the Cloud
In 2008, my former coworker published a paper about
benchmarking Amazon EC2 for HPC workloads (many people have quoted the paper, including
me on the Beowulf Cluster mailing list back in 2009), so running Grid Engine in the Cloud is not something unimaginable.
Since we don't want
to maintain our own servers, using the Cloud for regression testing makes sense. In 2011, after joining the MIT StarCluster project, we started using
MIT StarCluster to help us test new releases, and indeed
the Grid Engine 2011.11 release was the first one tested solely in the Cloud. It makes sense to run scalability tests in the Cloud, as we also don't want to maintain a 10,000-node cluster
that sits idle most of the time!
Requesting 10,000 nodes in the Cloud
Most of us just request resources in the Cloud and never need to worry
about resource contention. But there are default soft limits in EC2 that are set by Amazon for catching run away resource requests. Specifically each account can only have 20 on-demand
and 100 spot instances per region. As running 10,000 nodes (instances) exceeds many times the soft limit, we asked Amazon to increase the upper limit of our account. Luckily the process was painless! (And thinking about it, the
limit actually makes sense for new EC2 customers - imagine what kind of financial damage an infinite loop with instance request can do. In the extreme case, what kind of damage a hacker with a stolen credit card can do by launching a 10,000-node botnet!)
Launching a 10,000-node Grid Engine Cluster
With the limit increased, we first started a master node on a c1.xlarge
(High-CPU Extra Large Instance, a machine with 8 virtual cores and 7GB of memory). We then started adding instances (nodes) in chunks - we picked chunk sizes that are always less than 1000, as we have the ability to change the instance type and the bid price with each request chunk. Also, we mainly requested for spot instances because:
- Spot instances are usually cheaper than on-demand instances. (As I am writing this blog, a c1.xlarge spot instance is only costing 10.6% of the on-demand price.)
- It is easier for Amazon to handle our 10,000-node Grid Engine
cluster, as Amazon can terminate some of our spot instances if there is a spike in demand for on-demand instances.
- We can also test the behavior of Grid Engine when some of the nodes
go down. With traditional HPC clusters, we needed to kill some of the nodes manually to simulate hardware failure, but spot termination does this for us. We are in fact taking advantage of spot termination!
All went well until we had around 2,000 instances in a region, and we
found that further spot requests all got fulfilled and then the
instances were terminated almost instantly. A quick look at the error
logs and
from the AWS Management Console, we found that we exceeded the EBS
volume limit, which has a value of 5000 (number of volumes) or a total
size of 20 TB. We were using the StarCluster EBS AMI (e.g.
ami-899d49e0 in us-east-1) that has a size of 10GB, so 2000 of those
instances running in a region definitely would exceed the 20TB limit!
Luckily, StarCluster offers S3 AMIs, but the OS version is older, and
has the older SGE 6.2u5. As all releases of Grid Engine released by the
Open Grid Scheduler project are wire-compatible with SGE 6.2u5, we
quickly launched a few of those S3 AMIs as a test, and
not surprisingly those new nodes joined the cluster without any issues,
so we continue to add the instance-store instances, and soon achieving our goal of creating a 10,000 node cluster:
A few things more findings:
- We kept sending spot requests to the us-east-1 region until "capacity-not-able" was returned to us. We were expecting the bid price to go sky-high when an instance type ran out but that did not happen.
- When a certain instance-type gets mostly used up, further spot requests for the instance type get slower and slower.
- At peak rate, we were able to provision over 2,000 nodes in less than 30 minutes. In total, we spent less than 6 hours constructing, debugging the issue caused by the EBS volume limit, running a small number of Grid Engine tests, and taking down the cluster.
- Instance boot time was independent of the instance type: EBS-backed c1.xlarge and t1.micro took roughly the same amount of time to boot.
HPC in the Cloud
To put a 10,000-node cluster into perspective, we can take a look at some of the recent TOP500 entries running x64 processors:
- SuperMUC, #6 in TOP500, has 9,400 compute nodes
- Stampede, #7 in TOP500, will have 6,000+ nodes when completed
- Tianhe-1A, #8 in TOP500 (was #1 till June 2011), has 7,168 nodes
In fact, a quick look at the TOP500 list, we found that over 90% of the entries have less than 10,000 nodes, so in terms of the
raw processing power, one can easily rent a supercomputer in the Cloud and get compatible compute power.
However, some HPC workloads are very sensitive to the network (MPI) latency, and we believe dedicated HPC clusters still have an edge when running those workloads (in the end, those clusters are designed to have a low latency network). It's also worth mentioning that the
Cluster Compute Instance-type with the 10GbE can reduce some of the latency.
Future work
With 10,000 nodes, embarrassingly parallel workloads can complete
using 1/10 of the time compare to running on a 1,000-node cluster. Also worth mentioning is that the working storage required to place
the input and output is needed for 1/10 the time as well (eg. instead
of using 10TB-Month, only 1TB-Month is needed). So not only that you get the results faster, but it actually costs less due to reduced storage costs!
With 10,000 nodes and a small number of submitted jobs, the multi-threaded Grid Engine qmaster process constantly uses 1.5 to 2
cores (at most 4 cores at peak) on the master node, and around 500MB of memory. With more tuning or a more powerful instance type, we believe Grid Engine can handle at least 20,000 nodes.
Note: we are already working on enhancements that
further reduce the CPU usage!
Summary
Cluster size |
10,000+ slave nodes (master did not run jobs) |
SGE versions |
Grid Engine 6.2u5 (from Sun Microsystems)
GE 2011.11 (from the Open Grid Scheduler Project) |
Regions |
us-east-1 (ran over 75% of the instances)
us-west-1
us-west-2 |
Instance types |
On-demand
Spot |
Instance sizes |
from c1.xlarge to t1.micro |
Operating systems |
Ubuntu 10
Ubuntu 11
Oracle Linux 6.3 with the Unbreakable Enterprise Kernel |
Other software |
Python, Boto |