tag:blogger.com,1999:blog-52650894736727712192024-03-12T20:13:08.030-07:00The Open Source Grid Engine BlogRaysonhttp://www.blogger.com/profile/13615612986236137908noreply@blogger.comBlogger9125tag:blogger.com,1999:blog-5265089473672771219.post-59467463296627122922014-01-29T22:43:00.001-08:002014-01-29T23:52:57.671-08:00Getting Size and File Count of a 25 Million Object S3 BucketAmazon S3 is a <span class="st">highly durable storage service offered by AWS. With eleven 9s (99.999999999%) durability, high bandwidth to EC2 instances and low cost, it is a popular </span><span class="st"><span class="st">input & output files storage location for </span>Grid Engine jobs. (As a side note, we were at Werner Vogels's AWS Summit 2013 NYC keynote where he disclosed that S3 stores 2 trillion objects and handles 1.1 million requests/second.)</span><br />
<br />
<a href="http://2.bp.blogspot.com/-LB5tAJHFGBk/Uun-EgYHK9I/AAAAAAAAACY/gdvINeGLoeM/s1600/AmazonS3GridEngine.JPG" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img alt="AWS Summit 2013 keynote - Werner Vogels announces over 2T objects stored in S3" border="0" src="http://2.bp.blogspot.com/-LB5tAJHFGBk/Uun-EgYHK9I/AAAAAAAAACY/gdvINeGLoeM/s1600/AmazonS3GridEngine.JPG" height="130" title="AWS Summit 2013 keynote" width="200" /></a><span class="st">On the other hand, S3 presents its own set of challenges. For example, unlike a POSIX filesystem, there are no directories in S3 - i.e. S3 buckets have a flat namespace, and files (AWS refers them as "objects") can have delimiters that are used to create pseudo directory structures.</span><br />
<br />
<div style="text-align: right;">
</div>
<span class="st">Further, there is no API that returns the size of an S3 bucket or the total number of objects. The only way to find the bucket size is to </span><span class="st"><span class="st">iteratively </span>perform LIST API calls, each of which gives you information on 1000 objects. In fact the boto library offers a higher level function that abstracts the iterating for us, so all it takes is a few lines of python:</span><br />
<blockquote class="tr_bq">
<span style="color: blue;"><span style="font-size: xx-small;"><span style="font-family: "Courier New",Courier,monospace;"><span class="st">from boto.s3.connection import S3Connection<br /><br />s3bucket = S3Connection().get_bucket(<span style="color: red;"><name of bucket></span>)<br />size = 0<br /><br />for key in s3bucket.list():<br /> size += key.size<br /> </span></span></span></span><br />
<span style="color: blue;"><span style="font-size: xx-small;"><span style="font-family: "Courier New",Courier,monospace;"><span class="st"> print "%.3f GB" % (size*1.0/1024/1024/1024)</span></span></span></span></blockquote>
<span class="st"><br /></span>
<span class="st">However, when the above code is run against an S3 bucket with </span><span class="st">25 million objects, it takes 2 hours to finish. A closer look at the boto network traffic confirms that the high level list() function is doing all the heavy lifting of calling the lower level S3 LIST (i.e. over 25,000 LIST operations for the bucket). With such a chatty protocol it explains why it takes 2 hours.</span><br />
<br />
<span class="st">With a quick google search we found most people workaround it by either:</span><br />
<ul>
<li><span class="st">store the size of each object in a database</span></li>
<li><span class="st">extract the size information from the AWS billing console</span></li>
</ul>
<blockquote class="tr_bq">
<span style="color: blue;"><span style="font-family: "Courier New",Courier,monospace;"><span class="st" style="font-size: xx-small;">mysql> SELECT SUM(size) FROM s3objects;</span><span style="font-size: xx-small;"><br /><span class="st">+------------+</span><br /><span class="st">| SUM(size) |</span><br /><span class="st">+------------+</span><br /><span class="st">| 8823199678 |</span><br /><span class="st">+------------+</span><br /><span class="st">1 row in set (20 min 19.83 sec)</span></span></span></span></blockquote>
<span class="st">Note that both are not ideal. While it is much quicker to perform a DB query, we are not usually the creator of the S3 bucket (recall that the bucket, which is owned by another AWS account, stores job input files). Also, in many cases, we need other information such as the total number of files in the S3 bucket, so the total size shown in the billing console doesn't give us the complete picture. And lastly, we need the information of the bucket now and can't wait until the next day to start the Grid Engine cluster and run jobs.</span><br />
<span class="st"><br /></span>
<b><span class="st">Running S3 LIST calls in Parallel</span></b><br />
<span class="st">Actually, the LIST S3 API takes the prefix parameter, which "limits the response to keys that begin with the specified prefix". Thus, we can run concurrent LIST calls, where one call handles the first half, and the other call handles the second half. Then it is just a matter of reducing and/or merging the numbers!</span><br />
<span class="st"><br /></span>
<span class="st">With 2 workers (we use the Python multiprocessing worker pool) running LIST in parallel, we reduced the run time to 59 minutes, and further to 34 minutes with 4 workers. We then maxed out the 2-core c1.medium instance as both processors were 100% busy. At that point we migrated to a c1.xlarge instance that has 8 cores. With 4 times more CPU cores and higher network bandwidth, we started with 16 workers, which took 11 minutes to finish. Then we upped the number of workers to 24, and it took 9 minutes and 30 seconds!</span><br />
<span class="st"><br /></span>
<span class="st">So what if we use even more workers? While our initial goal is to finish the pre-processing of the input bucket in less than 15 minutes, we believe we can extract more parallelism from S3, as S3 is not just 1 but multiple servers that are designed to scale with a large number of operations.</span><br />
<br />
<span class="st"><b>Taking Advantage of S3 Key Hashing</b></span><br />
<span class="st">So as a final test, we picked a 32-vCPU instance and ran 64 workers. It only took 3 minutes and 37 seconds to finish. Then we observed that the work we distribute to the workers doesn't take advantage of S3 key hashing!</span><br />
<span class="st"><br /></span>
<span class="st">As the 25 million object-bucket has keys that start with UUID like names, the first character of the prefix can be 0-9 and a-z (36 characters in total). So key prefixes are generated like:</span><br />
<blockquote class="tr_bq">
<span style="color: blue;"><span style="font-family: "Courier New",Courier,monospace;"><span class="st" style="font-size: x-small;">00, 01, ..., 09, 0a, ..., 0z</span></span></span><br />
<span style="color: blue;"><span style="font-family: "Courier New",Courier,monospace;"><span class="st" style="font-size: x-small;">10, 11, ..., 19, 1a, ..., 1z</span></span></span></blockquote>
<span class="st"><br /></span>
<span class="st">And workers running in parallel at any point in time can have a higher chance of hitting the same S3 server. We performed a loop interchange, and thus the keys become:</span><br />
<blockquote class="tr_bq">
<span style="color: blue;"><span style="font-family: "Courier New",Courier,monospace;"><span class="st" style="font-size: x-small;">00, 10, ..., 90, a0, ..., z0</span></span></span><br />
<span style="color: blue;"><span style="font-family: "Courier New",Courier,monospace;"><span class="st" style="font-size: x-small;">01, 11, ..., 91, a1, ..., z1</span></span></span></blockquote>
<span class="st">Now it takes 2 minutes and 55 seconds.</span><br />
<span class="st"><br /></span>
<br />
<span class="st"></span>Ceciliahttp://www.blogger.com/profile/09205618612839177335noreply@blogger.comtag:blogger.com,1999:blog-5265089473672771219.post-87900583000781826522014-01-07T22:40:00.004-08:002014-01-07T22:49:07.125-08:00Enhanced Networking in the AWS Cloud - Part 2We looked at the <a href="http://blogs.scalablelogic.com/2013/12/enhanced-networking-in-aws-cloud.html">AWS Enhanced Networking performance</a> in the previous blog entry, and this week we just finished benchmarking the remaining instance types in the C3 family. C3 instances are extremely popular as they offer the best price-performance for many big data and HPC workloads.<br />
<br />
<b>Placement Groups</b><br />
An additional detail we didn't mention before: we booted all SUT (System Under Test) pairs in their own AWS Placement Groups. By using Placement Groups, instances get higher full-bisection bandwidth, lower and predictable network latency for node-to-node communications.<br />
<br />
<b>Bandwidth</b><br />
With c3.8xlarge instances that have the 10Gbit Ethernet, Enhanced Networking offers 44% higher network throughput. With smaller C3 instance types that have lower network throughput capability, while Enhanced Networking offers better network throughput, the difference is not as big.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-fjaO2Esi_50/UszqnnvvZPI/AAAAAAAAABw/g-0q4OnRzPY/s1600/c3-4xlargeBW.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-fjaO2Esi_50/UszqnnvvZPI/AAAAAAAAABw/g-0q4OnRzPY/s1600/c3-4xlargeBW.png" height="212" width="400" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-pGy10cIdOo0/UszqnjevmwI/AAAAAAAAAB0/04ygpJYLADM/s1600/c3-8xlargeBW.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-pGy10cIdOo0/UszqnjevmwI/AAAAAAAAAB0/04ygpJYLADM/s1600/c3-8xlargeBW.png" height="212" width="400" /></a></div>
<br />
<b>Round-trip Latency</b><br />
The c3.4xlarge and c3.8xlarge have similar network latency as the c3.2xlarge. The network latency for those larger instance types are between 92 and 100 microseconds.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-7Wz6SY1Ckb4/Uszqw7aZuOI/AAAAAAAAACA/o_DaXc_qA6w/s1600/c3-4xlarge-Latency.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-7Wz6SY1Ckb4/Uszqw7aZuOI/AAAAAAAAACA/o_DaXc_qA6w/s1600/c3-4xlarge-Latency.png" height="212" width="400" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-VEzjUPBHwn8/UszqxH_Wg_I/AAAAAAAAACE/DIIDg0qgoBU/s1600/c3-8xlarge-Latency.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-VEzjUPBHwn8/UszqxH_Wg_I/AAAAAAAAACE/DIIDg0qgoBU/s1600/c3-8xlarge-Latency.png" height="212" width="400" /></a></div>
<br />
<b>Conclusion</b><br />
All C3 instance types with Enhanced Networking enabled offer half the latency in many cases for no additional cost.<br />
<br />
On the other hand, without Enhanced Networking, bandwidth sensitive applications running on c3.8xlarge instances won't be able to fully take advantage of the 10Gbit Ethernet when there is 1 thread handling network traffic -- which is a common problem decomposition method we have seen in our users' code: MPI for inter-node communication, and OpenMP or even Pthreads for intra-node communication. For those types of hybrid HPC code, there is only 1 MPI task handling network communication. Enhanced Networking offers over 95% of the 10Gbit Ethernet bandwidth for those hybrid code, but when Enhanced Networking is not enabled, the MPI task would only get 68% of the available network bandwidth.<br />
<br />Ceciliahttp://www.blogger.com/profile/09205618612839177335noreply@blogger.comtag:blogger.com,1999:blog-5265089473672771219.post-10881411309133046312013-12-31T00:51:00.003-08:002014-01-07T22:54:11.869-08:00Enhanced Networking in the AWS CloudAt re:Invent 2013, Amazon announced the C3 and I2 instance families that have the higher-performance Xeon Ivy Bridge processors and SSD ephemeral drives, together with support of the new <i>Enhanced Networking</i> feature.<br />
<br />
<b>Enhanced Networking - SR-IOV in EC2</b><br />
Traditionally, EC2 instances send network traffic through the Xen hypervisor. With SR-IOV (Single Root I/O Virtualization) support in the C3 and I2 families, each physical ethernet NIC virtualizes itself as multiple independent PCIe Ethernet NICs, each of which can be assigned to a Xen guest.<br />
<br />
Thus, an EC2 instance running on hardware that supports Enhanced Networking can "own" one of the virtualized network interfaces, which means it can send and receive network traffic without invoking the Xen hypervisor.<br />
<br />
Enabling Enhanced Networking is as simple as:<br />
<ul>
<li>Create a VPC and subnet</li>
<li>Pick an HVM AMI with the Intel <tt>ixgbevf </tt>Virtual Function driver<tt><br /></tt></li>
<li>Launch a C3 or I2 instance using the HVM AMI</li>
</ul>
<br />
<b>Benchmarking</b><br />
We use the Amazon Linux AMI, as it already has the <i>ixgbevf</i> driver installed, and Amazon Linux is available in all regions. We use netperf to benchmark C3 instances running in a VPC (ie. Enhanced Networking enabled) against non-VPC (ie. Enhanced Networking disabled).<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<b>Bandwidth</b><br />
<div class="separator" style="clear: both; text-align: center;">
</div>
Enhanced Networking offers up to 7.3% gain in throughput. Note that with or without enhanced networking, both c3.xlarge and x3.2xlarge almost reach 1 Gbps (which we believe is the hard limit set by Amazon for those instance types).<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-GkVCnUzbjXo/UsJ8ypyj7JI/AAAAAAAAAAg/lg6koCKEHDs/s1600/c3-large-BW.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-GkVCnUzbjXo/UsJ8ypyj7JI/AAAAAAAAAAg/lg6koCKEHDs/s400/c3-large-BW.png" height="212" width="400" /> </a> </div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://2.bp.blogspot.com/-N9HIuk8C8VA/UsJ8zAOGJFI/AAAAAAAAAA0/MEL82S_UTYU/s1600/c3-xlarge-BW.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-N9HIuk8C8VA/UsJ8zAOGJFI/AAAAAAAAAA0/MEL82S_UTYU/s400/c3-xlarge-BW.png" height="212" width="400" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-EtUP16sSxp4/UsJ8yZihQqI/AAAAAAAAAAo/ZpDRhg13NSk/s1600/c3-2xlarge-BW.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-EtUP16sSxp4/UsJ8yZihQqI/AAAAAAAAAAo/ZpDRhg13NSk/s400/c3-2xlarge-BW.png" height="212" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<br />
<b>Round-trip Latency</b><br />
Many message passing MPI & HPC applications are latency sensitive. Here Enhanced Networking support really shines, with a max. speedup of 2.37 over the normal EC2 networking stack.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://4.bp.blogspot.com/-J2LvU_3YKoI/Uszltfw7RLI/AAAAAAAAABY/nTnv00leUyc/s1600/c3-large-Latency.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-J2LvU_3YKoI/Uszltfw7RLI/AAAAAAAAABY/nTnv00leUyc/s1600/c3-large-Latency.png" height="212" width="400" /> </a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-lnoakwTAH6k/UszltT3wimI/AAAAAAAAABU/OaAmuPQFrPI/s1600/c3-xlarge-Latency.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-lnoakwTAH6k/UszltT3wimI/AAAAAAAAABU/OaAmuPQFrPI/s1600/c3-xlarge-Latency.png" height="212" width="400" /> </a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-Jlxp-dlQEBI/UszltcofY_I/AAAAAAAAABQ/JULdEPUTzfs/s1600/c3-2xlarge-Latency.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-Jlxp-dlQEBI/UszltcofY_I/AAAAAAAAABQ/JULdEPUTzfs/s1600/c3-2xlarge-Latency.png" height="212" width="400" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<b>Conclusion 1</b><br />
Amazon says that both the c3.large and c3.xlarge instances have "Moderate" network performance, but we found that c3.large peaks at around 415 Mbps, while c3.xlarge almost reaches 1Gbps. We believe the extra bandwidth headroom is for EBS traffic, as c3.xlarge can be configured as "EBS-optimized" while c3.large cannot.<br />
<br />
<b>Conclusion 2 </b><br />
Notice that c3.2xlarge with enhanced networking enabled has a around-trip latency of 92 millisecond, which is much higher that of the smaller instance types in the C3 family. We repeated the test in both the us-east-1 and us-west-2 regions and got idential results.<br />
<br />
Currently AWS has a shortage of C3 instances -- all c3.4xlarge and c3.8xlarge instance launch requests we issued so far resulted in "Insufficient capacity". We are closely monitoring the situration, and we are planning to benchmark the c3.4xlarge and c3.8xlarge instance types and see if we can reproduce the increased latency issue.<br />
<br />
<i>Updated Jan 8, 2014</i>: We have published <a href="http://blogs.scalablelogic.com/2014/01/enhanced-networking-in-aws-cloud-part-2.html">Enhanced Networking in the AWS Cloud (Part 2)</a> that includes the benchmark results for the remaining C3 types.<br />
<br />Ceciliahttp://www.blogger.com/profile/09205618612839177335noreply@blogger.comtag:blogger.com,1999:blog-5265089473672771219.post-30960855115012794752012-11-21T01:09:00.001-08:002014-08-14T15:38:23.590-07:00Running a 10,000-node Grid Engine Cluster in Amazon EC2<span id="goog_1871506235"></span><span id="goog_1871506236"></span>Recently, we have provisioned a 10,000-node Grid Engine cluster in
Amazon EC2 to test the scalability of Grid Engine. As <a href="http://gridscheduler.sourceforge.net/">the official maintainer of open-source Grid Engine</a>, we have the
obligation to make sure that Grid Engine continues to scale in the modern datacenters.<br />
<br />
<b>Grid Engine Scalability - From 1,000 to 10,000 Nodes</b><br />
From time to time, we receive questions related to Grid Engine scalability, which is not surprising given that modern HPC clusters and compute farms continue to grow in size. Back in 2005, Hess <span class="st">Corp. </span>was
having issues when its Grid Engine cluster exceeded 1,000 nodes. We quickly fixed the limitation in the low-level Grid Engine communication library and contributed the code back to the open source
code base. In fact, our code continues to live in every fork of Sun Grid Engine (including Oracle Grid Engine and other commercial versions from smaller players) today.<br />
<br />
So Grid Engine can handle thousands of nodes, but can it handle tens
of thousands of nodes? Also, how would Grid Engine perform when there are over 10,000 nodes in a single cluster? Previously, besides simulating virtual hosts, we could only take the reactive approach - i.e. customers report issues, and we collect the error logs remotely, and then try to
provide workarounds or fixes. In the Cloud age, shouldn't we take a proactive approach as hardware resources are more accessible?<br />
<br />
<b>Running Grid Engine in the Cloud</b><br />
<a href="http://3.bp.blogspot.com/-evU6S1wERjw/ULVU6ZoKNsI/AAAAAAAAACA/0JITauIFicA/s1600/GridEngineCloud.JPG" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="http://3.bp.blogspot.com/-evU6S1wERjw/ULVU6ZoKNsI/AAAAAAAAACA/0JITauIFicA/s1600/GridEngineCloud.JPG" /></a>In 2008, my former coworker published a paper about <a href="https://www.usenix.org/publications/login/october-2008-volume-33-number-5/benchmarking-amazon-ec2-high-performance">benchmarking Amazon EC2 for HPC workloads</a> (many people have quoted the paper, including <a href="http://www.beowulf.org/pipermail/beowulf/2009-May/025634.html">me on the Beowulf Cluster mailing list</a> back in 2009), so running Grid Engine in the Cloud is not something unimaginable.<br />
<br />
Since we don't want
to maintain our own servers, using the Cloud for regression testing makes sense. In 2011, after joining the MIT StarCluster project, we started using <a href="http://star.mit.edu/cluster/">MIT StarCluster</a> to help us test new releases, and indeed
the Grid Engine 2011.11 release was the first one tested solely in the Cloud. It makes sense to run scalability tests in the Cloud, as we also don't want to maintain a 10,000-node cluster
that sits idle most of the time!<br />
<br />
<b>Requesting 10,000 nodes in the Cloud</b><br />
Most of us just request resources in the Cloud and never need to worry
about resource contention. But there are default soft limits in EC2 that are set by Amazon for catching run away resource requests. Specifically each account can only have 20 on-demand
and 100 spot instances per region. As running 10,000 nodes (instances) exceeds many times the soft limit, we asked Amazon to increase the upper limit of our account. Luckily the process was painless! (And thinking about it, the
limit actually makes sense for new EC2 customers - imagine what kind of financial damage an infinite loop with instance request can do. In the extreme case, what kind of damage a hacker with a stolen credit card can do by launching a 10,000-node botnet!)<br />
<br />
<b>Launching a 10,000-node Grid Engine Cluster</b><br />
With the limit increased, we first started a master node on a c1.xlarge
(High-CPU Extra Large Instance, a machine with 8 virtual cores and 7GB of memory). We then started adding instances (nodes) in chunks - we picked chunk sizes that are always less than 1000, as we have the ability to change the instance type and the bid price with each request chunk. Also, we mainly requested for spot instances because:<br />
<ul>
<li>Spot instances are usually cheaper than on-demand instances. (As I am writing this blog, a c1.xlarge spot instance is only costing 10.6% of the on-demand price.)</li>
<li>It is easier for Amazon to handle our 10,000-node Grid Engine
cluster, as Amazon can terminate some of our spot instances if there is a spike in demand for on-demand instances.</li>
<li>We can also test the behavior of Grid Engine when some of the nodes
go down. With traditional HPC clusters, we needed to kill some of the nodes manually to simulate hardware failure, but spot termination does this for us. <u>We are in fact taking advantage of spot termination! </u></li>
</ul>
<br />
All went well until we had around 2,000 instances in a region, and we
found that further spot requests all got fulfilled and then the
instances were terminated almost instantly. A quick look at the error
logs and
from the AWS Management Console, we found that we exceeded the EBS
volume limit, which has a value of 5000 (number of volumes) or a total
size of 20 TB. We were using the StarCluster EBS AMI (e.g.
ami-899d49e0 in us-east-1) that has a size of 10GB, so 2000 of those
instances running in a region definitely would exceed the 20TB limit!<br />
<br />
Luckily, StarCluster offers S3 AMIs, but the OS version is older, and
has the older SGE 6.2u5. As all releases of Grid Engine released by the
Open Grid Scheduler project are wire-compatible with SGE 6.2u5, we
quickly launched a few of those S3 AMIs as a test, and
not surprisingly those new nodes joined the cluster without any issues,
so we continue to add the instance-store instances, and soon achieving our goal of creating a 10,000 node cluster:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-25eCOQvtIDw/UKyBJ4qsJiI/AAAAAAAAABc/QD9bCuzAeMA/s1600/GridEngineCloudComputing.png" imageanchor="1"><img alt="10,000-node Grid Engine cluster on Amazon EC2 Cloud" border="0" src="http://3.bp.blogspot.com/-25eCOQvtIDw/UKyBJ4qsJiI/AAAAAAAAABc/QD9bCuzAeMA/s1600/GridEngineCloudComputing.png" title="" /></a></div>
A few things more findings:<br />
<ul>
<li>We kept sending spot requests to the us-east-1 region until "capacity-not-able" was returned to us. We were expecting the bid price to go sky-high when an instance type ran out but that did not happen.</li>
<li>When a certain instance-type gets mostly used up, further spot requests for the instance type get slower and slower.</li>
<li>At peak rate, we were able to provision over 2,000 nodes in less than 30 minutes. In total, we spent less than 6 hours constructing, debugging the issue caused by the EBS volume limit, running a small number of Grid Engine tests, and taking down the cluster.</li>
<li>Instance boot time was independent of the instance type: EBS-backed c1.xlarge and t1.micro took roughly the same amount of time to boot.</li>
</ul>
<br />
<b>HPC in the Cloud</b><br />
To put a 10,000-node cluster into perspective, we can take a look at some of the recent TOP500 entries running x64 processors:<br />
<ul>
<li>SuperMUC, #6 in TOP500, has 9,400 compute nodes</li>
<li>Stampede, #7 in TOP500, will have 6,000+ nodes when completed</li>
<li>Tianhe-1A, #8 in TOP500 (was #1 till June 2011), has 7,168 nodes</li>
</ul>
In fact, a quick look at the TOP500 list, we found that over 90% of the entries have less than 10,000 nodes, so in terms of the <i>raw processing power</i>, one can easily rent a supercomputer in the Cloud and get compatible compute power.
However, some HPC workloads are very sensitive to the network (MPI) latency, and we believe dedicated HPC clusters still have an edge when running those workloads (in the end, those clusters are designed to have a low latency network). It's also worth mentioning that the <a href="http://aws.typepad.com/aws/2010/07/the-new-amazon-ec2-instance-type-the-cluster-compute-instance.html">Cluster Compute Instance-type</a> with the 10GbE can reduce some of the latency.<br />
<br />
<b>Future work</b><br />
With 10,000 nodes, embarrassingly parallel workloads can complete
using 1/10 of the time compare to running on a 1,000-node cluster. Also worth mentioning is that the working storage required to place
the input and output is needed for 1/10 the time as well (eg. instead
of using 10TB-Month, only 1TB-Month is needed). So not only that you get the results faster, but it actually costs less due to reduced storage costs!<br />
<br />
With 10,000 nodes and a small number of submitted jobs, the multi-threaded Grid Engine qmaster process constantly uses 1.5 to 2
cores (at most 4 cores at peak) on the master node, and around 500MB of memory. With more tuning or a more powerful instance type, we believe Grid Engine can handle at least 20,000 nodes. <br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://3.bp.blogspot.com/-7G8vXmRtucA/UKyF3090h0I/AAAAAAAAABs/8osvMZzpMb4/s1600/GridEngineQmaster.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-7G8vXmRtucA/UKyF3090h0I/AAAAAAAAABs/8osvMZzpMb4/s640/GridEngineQmaster.png" height="235" width="640" /></a></div>
<br />
Note: we are already working on enhancements that
further reduce the CPU usage!<br />
<br />
<b>Summary</b><br />
<br />
<table border="2" cellpadding="5px" style="background-color: while; border-collapse: collapse; margin: auto;">
<tbody>
<tr>
<td>Cluster size</td>
<td>10,000+ slave nodes (master did not run jobs)</td>
</tr>
<tr>
<td>SGE versions</td>
<td>Grid Engine 6.2u5 (from Sun Microsystems)<br />
GE 2011.11 (from the Open Grid Scheduler Project)</td>
</tr>
<tr>
<td>Regions</td>
<td>us-east-1 (ran over 75% of the instances)<br />
us-west-1<br />
us-west-2</td>
</tr>
<tr>
<td>Instance types</td>
<td>On-demand<br />
Spot</td>
</tr>
<tr>
<td>Instance sizes</td>
<td>from c1.xlarge to t1.micro</td>
</tr>
<tr>
<td>Operating systems</td>
<td>Ubuntu 10<br />
Ubuntu 11<br />
Oracle Linux 6.3 with the Unbreakable Enterprise Kernel</td>
</tr>
<tr><td>Other software</td>
<td>Python, Boto</td></tr>
</tbody></table>
Raysonhttp://www.blogger.com/profile/13615612986236137908noreply@blogger.comtag:blogger.com,1999:blog-5265089473672771219.post-78004494732161050402012-07-31T11:26:00.001-07:002012-07-31T12:06:22.699-07:00Using Grid Engine in the Xeon Phi EnvironmentThe <a href="http://www.intel.com/content/www/us/en/architecture-and-technology/many-integrated-core/intel-many-integrated-core-architecture.html">Intel Xeon Phi</a> (Codename: MIC - Many Integrated Core) is an interesting HPC hardware platform. While it is not in production yet, there are already lots of porting and optimization work done for Xeon Phi. In the beginning of this year, we also started the conversation with Intel - we wanted to make sure that Open Grid Scheduler/Grid Engine works in the Xeon Phi environment as we have received requests for Xeon Phi support at SC11.<br />
<br />
While a lot of information is still under NDA, there are already lots of papers published on the Internet, and even Intel engineers themselves have already disclosed information about the Xeon Phi architecture and programming models. For example, by now, it is widely known that Xeon Phi runs an embedded Linux OS image on the board - and in fact the users can log onto the board and use it as a multi-core Linux machine.<br />
<br />
One Xeon Phi execution model is the more traditional <i>offload execution model</i>, where
applications running on the host processors offload sections of code to
the Xeon Phi accelerator. Note that Intel also defines the Language
Extensions for Offload (LEO) compiler directives to ease porting. And yet with the <i>standalone execution model</i>, users execute code directly and natively on the board, and the host processors are not involved in the computation.<br />
<br />
Open Grid Scheduler/Grid Engine can be easily configured to handle the offload execution model, as the Xeon Phi used this way is very similar to a GPU device. Grid Engine can easily schedule jobs to the hosts that have Xeon Phi boards, and Grid Engine can make sure that the hardware resources are not oversubscribed. Yet the standalone execution model is more interesting, it is the Linux OS environment that most HPC users are familar with, but it adds a level of indirection to job execution. We don't have the design finalized yet, as the software environment is not released to the public, but our plan is to support both execution models in a future version of Open Grid Scheduler/Grid Engine.<br />
<br />Ceciliahttp://www.blogger.com/profile/09205618612839177335noreply@blogger.comtag:blogger.com,1999:blog-5265089473672771219.post-78415165591836275542012-07-09T23:18:00.002-07:002012-08-03T15:14:51.730-07:00Optimizing Grid Engine for AMD Bulldozer SystemsThe AMD Bulldozer series (including Piledriver, which was released recently) is very interesting from a micro-architecture point of view. A Bulldozer processor can have as many as 16-cores, and cores are further grouped into modules. With the current generation, each module contains 2 cores, so an 8-module processor has 16 cores, and a 4-module processor has 8 cores, etc.<br />
<br />
<ul>
<a href="http://3.bp.blogspot.com/-FF4zL35p-SM/T_vHD3XkKhI/AAAAAAAAAAw/l2hgAbmhQCk/s1600/GridEngine-AMD.JPG" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="211" src="http://3.bp.blogspot.com/-FF4zL35p-SM/T_vHD3XkKhI/AAAAAAAAAAw/l2hgAbmhQCk/s320/GridEngine-AMD.JPG" width="320" /></a>
<li>The traditional SMT (eg. Intel Hyper-Threading) pretty much duplicates the register file and the processor front-end, but as most of the execution pipeline is shared between the pair of SMT threads, performance can be greatly affected when the SMT threads are competing for hardware resources.</li>
<li>For Bulldozer, only the Floating-Point Unit, <a href="http://developer.amd.com/Assets/SharedL1InstructionCacheonAMD15hCPU.pdf">L1 instruction cache</a>, L2 cache, and a small part of the execution pipeline are shared, making resource contention a much smaller concern.</li>
</ul>
<div>
A lot of HPC clusters completely turn off SMT, as performance is the main objective for those installations. On the other hand, Bulldozer processors are less affected by a dumb OS scheduler, but it still helps if system software understands the architecture of the processor. For example, the patched Windows scheduler that understands the AMD hardware can <a href="http://blogs.amd.com/play/2012/01/11/early-results-achieved-with-amd-fx-processor-using-windows%C2%AE-7-scheduler-update/">boost the performance by 10%</a>.</div>
<div>
<br /></div>
<div>
And what does this mean for Grid Engine? The Open Grid Scheduler project implemented <a href="http://gridscheduler.sourceforge.net/projects/hwloc/GridEnginehwloc.html">Grid Engine Job Binding</a> with the hwloc library (initially for the AMD Magny-Cours Opteron 6100 series - the original PLPA library that was used by Sun Grid Engine was not able to handle newer AMD processors), and we also added Linux cpuset support (when the <a href="http://blogs.scalablelogic.com/2012/05/grid-engine-cgroups-integration.html">Grid Engine cgroups Integration</a> is enabled and with the cpuset controller present). In both cases, the execution daemon essentially becomes the local scheduler that dispatches processes to the execution cores. With a smarter execution daemon (execd), we can speed up job execution with no changes to any application code.<br />
<br />
(And yes, this feature will be available in the GE 2011.11 update 1 release.)</div>
<div>
<br /></div>Raysonhttp://www.blogger.com/profile/13615612986236137908noreply@blogger.comtag:blogger.com,1999:blog-5265089473672771219.post-70985018022031809692012-06-06T09:03:00.004-07:002012-06-06T09:11:39.594-07:00Grid Engine Cygwin PortOpen Grid Scheduler/Grid Engine will support Windows/Cygwin with the GE 2011.11u1 release. We found that many sites just need to submit jobs to the Unix/Linux compute farm from Windows workstations, so in this release only the client-side is supported under our <a href="http://www.scalablelogic.com/scalable-grid-engine-support">commercial Grid Engine support program</a>. For sites that need to run client-side Grid Engine commands (eg. qsub, qstat, qacct, qhost, qdel, etc), our Cygwin port totally fits their needs. We are satisfied with Cygwin until our true native Windows port is ready...<br />
<div>
<br /></div>
<div>
Running daemons under Cygwin is currently under technology preview. We've tested a small cluster of Windows execution machines, and the results look promising:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://2.bp.blogspot.com/-RdT09sVvyZI/T896_XCMK1I/AAAAAAAAAAk/PbGy0p7p3vA/s1600/GridEngine-Cygwin.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="http://2.bp.blogspot.com/-RdT09sVvyZI/T896_XCMK1I/AAAAAAAAAAk/PbGy0p7p3vA/s400/GridEngine-Cygwin.png" width="306" /></a></div>
<div>
<br /></div>
<div>
<div>
<br /></div>
<div>
Note that this is not the first attempt to port Grid Engine to the Cygwin environment. In 2003, Andy mentioned in the <a href="http://markmail.org/message/djdjqwvg4q4f3qj4">Compiling Grid Engine under Cygwin</a> thread that Sun/Gridware ported the Grid Engine CLI to Cygwin, but daemons were not supported in the original port. Ian Chesal (who worked for Altera at that time) ported the Grid Engine daemons as well, but did not release any code. In 2011 we started from scratch again, and we checked in code changes earlier this year - the majority of the code is already in the GE2011.11 patch 1 release, with the rest coming in the update 1 release.</div>
</div>
</div>
<div>
<br /></div>
<div>
So finally, the Cygwin port is in the source tree this time - no more out-of-tree patches.</div>Raysonhttp://www.blogger.com/profile/13615612986236137908noreply@blogger.comtag:blogger.com,1999:blog-5265089473672771219.post-47449104541987109852012-06-01T07:55:00.001-07:002012-11-22T15:33:25.393-08:00Giving Away a Cisco Live Full Conference PassBack in May, we attended a local Cisco event here in Toronto. Besides talking to Cisco engineers about their datacenter products and networking technologies, we also met with some technical UCS server people (more on Cisco UCS Blade Servers & Open Grid Scheduler/Grid Engine in later blog entry).<br />
<br />
<a href="http://3.bp.blogspot.com/-aWn0wBqO1_E/T8jalEkdZkI/AAAAAAAAAAQ/rErtiLhNrKQ/s1600/220px-CiscoUCS.JPG" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="http://3.bp.blogspot.com/-aWn0wBqO1_E/T8jalEkdZkI/AAAAAAAAAAQ/rErtiLhNrKQ/s1600/220px-CiscoUCS.JPG" /></a>We also received a <a href="http://www.ciscolive.com/us/registration-packages/">Cisco Live Conference Pass</a>, which allows us to attend everything at the conference (ie. the full experience) in San Diego, CA on June 10-14, 2012, and we are planning to give it to the first person who sends us the right answer to the following question:<br />
<br />
<i><b>When run with 20 MPI processes, what will the value of recvbuf[i][i] be for i=0..19 in MPI_COMM_WORLD rank 17 when this application calls MPI_Finalize?</b></i><br />
<br />
<br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;">#include <mpi.h></span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"><br /></span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;">int sendbuf[100];</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;">int recvbuf[20][100];</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;">MPI_Request reqs[40];</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"><br /></span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;">MPI_Request send_it(int dest, int len)</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;">{</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> int i;</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> MPI_Request req;</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> for (i = 0; i < len; ++i) {</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> sendbuf[i] = dest;</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> }</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> MPI_Isend(sendbuf, len, MPI_INT, dest, 0, MPI_COMM_WORLD, &req);</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> return req;</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;">}</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"><br /></span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;">MPI_Request recv_it(int src, int len)</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;">{</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> MPI_Request req;</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> MPI_Irecv(recvbuf[src], len, MPI_INT, src, 0, MPI_COMM_WORLD, &req);</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> return req;</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;">}</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"><br /></span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;">int main(int argc, char *argv[])</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;">{</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> int i, j, rank, size;</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> MPI_Init(NULL, NULL);</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> MPI_Comm_rank(MPI_COMM_WORLD, &rank);</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> MPI_Comm_size(MPI_COMM_WORLD, &size);</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"><br /></span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> /* Bound the number of procs involved, just so we can be lazy and</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> use a fixed-length sendbuf/recvbuf. */</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> if (rank < 20) {</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> for (i = j = 0; i < size; ++i) {</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> reqs[j++] = send_it(i, 5);</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> reqs[j++] = recv_it(i, 5);</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> }</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> MPI_Waitall(j, reqs, MPI_STATUSES_IGNORE);</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> }</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"><br /></span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> MPI_Finalize();</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;"> return 0;</span><br />
<span style="color: blue; font-family: 'Courier New', Courier, monospace; font-size: x-small;">}</span><br />
<br />
<br />
The code above & the question were written by <a href="http://blogs.cisco.com/category/performance/">Mr Open MPI, Jeff Squyres</a>, who has worked with us as early as the pre-Oracle Grid Engine days on PLPA, and suggested us to migrate to the hwloc topology library. (Side note: when the Open Grid Scheduler became the maintainer of the open source Grid Engine code base in 2011, <a href="http://gridscheduler.sourceforge.net/projects/hwloc/GridEnginehwloc.html">Grid Engine Multi-Core Processor Binding with hwloc</a> was one of the first major features we added in Open Grid Scheduler/Grid Engine to support discovery of newer system topologies).<br />
<br />
So send <a href="mailto:rayson@scalablelogic.com">us</a> the answer, and the first one who answers the question correctly will get the pass to attend the conference!<br />
<br />
<br />
<br />Raysonhttp://www.blogger.com/profile/13615612986236137908noreply@blogger.comtag:blogger.com,1999:blog-5265089473672771219.post-38812813677280021392012-05-22T09:30:00.000-07:002012-05-22T14:05:05.157-07:00Grid Engine cgroups IntegrationThe PDC (Portable Data Collector) in Grid Engine's job execution daemon tracks job-process membership for resource usage accounting purposes, for job control purposes (ie. making sure that jobs don't exceed their resource limits), and for signaling purposes (eg. stopping, killing jobs).<br />
<div>
<br /></div>
<div>
Since most operating systems don't have a mechanism to group of processes into jobs, Grid Engine adds an additional Group ID to each job. As normal processes can't change their GID membership, it is a safe way to tag processes to jobs. On operating systems where the PDC module is enabled, every so often the execution daemon scans all the processes running on the system, and then groups processes to jobs by looking for the additional GID tag.</div>
<div>
<br /></div>
<div>
<span style="font-size: large;"><b>So far so good, but...</b></span></div>
<div>
Adding an extra GID has side-effects. We have received reports that applications behave strangely with an unresolvable GID. For example, on Ubuntu, we get:</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: 'Courier New', Courier, monospace;">$ qrsh</span></div>
<div>
<span style="font-family: 'Courier New', Courier, monospace;">groups: cannot find name for group ID 20017</span></div>
</div>
<div>
<br /></div>
<div>
Another problem: it takes time for the PDC to warm up. For some short running jobs, you will find:</div>
<div>
<br /></div>
<div>
<span style="font-family: 'Courier New', Courier, monospace;">removing unreferenced job 64623.394 without job report from ptf</span></div>
<div>
<br /></div>
<div>
Third problem is that if the PDC runs too often, it takes too much CPU time. In SGE 6.2 u5, a memory accounting bug was introduced because the Grid Engine developers needed to reduce the CPU usage of the PDC on Linux by adding a workaround. (Shameless plug: we the Open Grid Scheduler developers fixed the bug back in 2010, way ahead of any other Grid Engine implementations that are still active these days.) Imagine running <span style="font-family: 'Courier New', Courier, monospace;">ps -elf</span> every second on your execution nodes. This is how intrusive the PDC is!</div>
<div>
<br /></div>
<div>
The final major issue is that the PDC is not accurate. Grid Engine itself does not trust on the information from the PDC at job cleanup. The end result is run-away jobs consuming resources on the execution hosts. The cluster administrators then need to enable the special flag to tell Grid Engine to do proper job cleanup (by default <span style="font-family: 'Courier New', Courier, monospace;">ENABLE_ADDGRP_KILL </span>is off). Quoting the <a href="http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_conf.html">Grid Engine sge_conf manpage</a>:</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: 'Courier New', Courier, monospace;">ENABLE_ADDGRP_KILL</span></div>
<div>
<span style="font-family: 'Courier New', Courier, monospace;"> If this parameter is set then Sun Grid Engine uses the</span></div>
<div>
<span style="font-family: 'Courier New', Courier, monospace;"> supplementary group ids (see gid_range) to identify all</span></div>
<div>
<span style="font-family: 'Courier New', Courier, monospace;"> processes which are to be terminated when a job is</span></div>
<div>
<span style="font-family: 'Courier New', Courier, monospace;"> deleted, or when sge_shepherd(8) cleans up after job</span></div>
<div>
<span style="font-family: 'Courier New', Courier, monospace;"> termination.</span></div>
</div>
<div>
<br /></div>
<div>
<span style="font-size: large;"><b>Grid Engine cgroups Integration</b></span></div>
<div>
In Grid Engine 2011.11 update 1, we switch to cgroups instead of the additional GID for the process tagging mechanism.</div>
<div>
<br /></div>
<div>
(We the <a href="http://gridscheduler.sourceforge.net/">Open Grid Scheduler / Grid Engine</a> developers wrote the PDC code for AIX, HP-UX, and the initial PDC code for MacOS X, which is used as the base for the FreeBSD and NetBSD PDC. We even wrote a PDC prototype for Linux that does not rely on GID. Our code was contributed to Sun Microsystems, and is used in every implementation of Grid Engine - whether it is commercial, or open source, or commercial open source like Open Grid Scheduler.)</div>
<div>
<br /></div>
<div>
As almost half of the PDCs were developed by us, we knew all the issues in PDC.</div>
<div>
<br /></div>
<div>
We are switching to cgroups now but not earlier because:</div>
<div>
<ol>
<li>Most Linux distributions ship kernels that have cgroups support.</li>
<li>We are seeing more and more cgroups improvements. Lots of cgroups performance issues were fixed in recent Linux kernels.</li>
</ol>
</div>
<div>
With the cgroups integration in Grid Engine 2011.11 update 1, all the PDC issues mentioned above are handled. Further, we have bonus features with cgroups:</div>
<div>
<ol>
<li>Accurate memory usage accounting: ie. shared pages are accounted correctly.</li>
<li>Resource limit at the job level, not at the individual process level.</li>
<li>Out of the box SSH integration.</li>
<li>RSS (real memory) limit: we all have jobs that try to use every single byte of memory, but capping their RSS does not hurt their performance. May as well cap the RSS such that we can take back the spare processors for other jobs.</li>
<li>With the cpuset cgroup controller, Grid Engine can set the processor binding and memory locality reliably. Note that jobs that change their own processor binding are not handled by the original <a href="http://gridscheduler.sourceforge.net/projects/hwloc/GridEnginehwloc.html">Grid Engine Processor Binding with hwloc</a> (Another shameless plug: we are again the first who switched to hwloc for processor binding) - it is very rare to encounter jobs that change their own processor binding, but if a job or external process decides to change its own processor mask, then this will affect other jobs running on the system.</li>
<li>Finally, with the freezer controller, we can have a safe mechanism for stopping and resuming jobs:</li>
</ol>
</div>
<div>
<blockquote class="tr_bq">
<span style="font-family: 'Courier New', Courier, monospace;">$ qstat<br />job-ID prior name user state submit/start at<br />queue slots ja-task-ID<br />-----------------------------------------------------------------------------------------------------------------<br /> 16 0.55500 sleep sgeadmin r 05/07/2012 05:44:12<br />all.q@master 1<br />$ cat /cgroups/cpu_and_memory/gridengine/Job16.1/freezer.state<br /><span style="color: red;">THAWED</span><br />$ qmod -sj 16<br />sgeadmin - suspended job 16<br />$ cat /cgroups/cpu_and_memory/gridengine/Job16.1/freezer.state<br /><span style="color: red;">FROZEN</span><br />$ qmod -usj 16<br />sgeadmin - unsuspended job 16<br />$ cat /cgroups/cpu_and_memory/gridengine/Job16.1/freezer.state<br /><span style="color: red;">THAWED</span></span></blockquote>
<div>
<br /></div>
<div>
We will be announcing more new features in <u>Grid Engine 2011.11 update 1</u> here on this blog. Stay tuned for our announcement.</div>
</div>
<div>
<br /></div>Raysonhttp://www.blogger.com/profile/13615612986236137908noreply@blogger.com