We looked at the AWS Enhanced Networking performance in the previous blog entry, and this week we just finished benchmarking the remaining instance types in the C3 family. C3 instances are extremely popular as they offer the best price-performance for many big data and HPC workloads.
Placement Groups
An additional detail we didn't mention before: we booted all SUT (System Under Test) pairs in their own AWS Placement Groups. By using Placement Groups, instances get higher full-bisection bandwidth, lower and predictable network latency for node-to-node communications.
Bandwidth
With c3.8xlarge instances that have the 10Gbit Ethernet, Enhanced Networking offers 44% higher network throughput. With smaller C3 instance types that have lower network throughput capability, while Enhanced Networking offers better network throughput, the difference is not as big.
Round-trip Latency
The c3.4xlarge and c3.8xlarge have similar network latency as the c3.2xlarge. The network latency for those larger instance types are between 92 and 100 microseconds.
Conclusion
All C3 instance types with Enhanced Networking enabled offer half the latency in many cases for no additional cost.
On the other hand, without Enhanced Networking, bandwidth sensitive applications running on c3.8xlarge instances won't be able to fully take advantage of the 10Gbit Ethernet when there is 1 thread handling network traffic -- which is a common problem decomposition method we have seen in our users' code: MPI for inter-node communication, and OpenMP or even Pthreads for intra-node communication. For those types of hybrid HPC code, there is only 1 MPI task handling network communication. Enhanced Networking offers over 95% of the 10Gbit Ethernet bandwidth for those hybrid code, but when Enhanced Networking is not enabled, the MPI task would only get 68% of the available network bandwidth.
Showing posts with label MPI. Show all posts
Showing posts with label MPI. Show all posts
Tuesday, January 7, 2014
Tuesday, December 31, 2013
Enhanced Networking in the AWS Cloud
At re:Invent 2013, Amazon announced the C3 and I2 instance families that have the higher-performance Xeon Ivy Bridge processors and SSD ephemeral drives, together with support of the new Enhanced Networking feature.
Enhanced Networking - SR-IOV in EC2
Traditionally, EC2 instances send network traffic through the Xen hypervisor. With SR-IOV (Single Root I/O Virtualization) support in the C3 and I2 families, each physical ethernet NIC virtualizes itself as multiple independent PCIe Ethernet NICs, each of which can be assigned to a Xen guest.
Thus, an EC2 instance running on hardware that supports Enhanced Networking can "own" one of the virtualized network interfaces, which means it can send and receive network traffic without invoking the Xen hypervisor.
Enabling Enhanced Networking is as simple as:
Benchmarking
We use the Amazon Linux AMI, as it already has the ixgbevf driver installed, and Amazon Linux is available in all regions. We use netperf to benchmark C3 instances running in a VPC (ie. Enhanced Networking enabled) against non-VPC (ie. Enhanced Networking disabled).
Bandwidth
Enhanced Networking offers up to 7.3% gain in throughput. Note that with or without enhanced networking, both c3.xlarge and x3.2xlarge almost reach 1 Gbps (which we believe is the hard limit set by Amazon for those instance types).
Round-trip Latency
Many message passing MPI & HPC applications are latency sensitive. Here Enhanced Networking support really shines, with a max. speedup of 2.37 over the normal EC2 networking stack.
Conclusion 1
Amazon says that both the c3.large and c3.xlarge instances have "Moderate" network performance, but we found that c3.large peaks at around 415 Mbps, while c3.xlarge almost reaches 1Gbps. We believe the extra bandwidth headroom is for EBS traffic, as c3.xlarge can be configured as "EBS-optimized" while c3.large cannot.
Conclusion 2
Notice that c3.2xlarge with enhanced networking enabled has a around-trip latency of 92 millisecond, which is much higher that of the smaller instance types in the C3 family. We repeated the test in both the us-east-1 and us-west-2 regions and got idential results.
Currently AWS has a shortage of C3 instances -- all c3.4xlarge and c3.8xlarge instance launch requests we issued so far resulted in "Insufficient capacity". We are closely monitoring the situration, and we are planning to benchmark the c3.4xlarge and c3.8xlarge instance types and see if we can reproduce the increased latency issue.
Updated Jan 8, 2014: We have published Enhanced Networking in the AWS Cloud (Part 2) that includes the benchmark results for the remaining C3 types.
Enhanced Networking - SR-IOV in EC2
Traditionally, EC2 instances send network traffic through the Xen hypervisor. With SR-IOV (Single Root I/O Virtualization) support in the C3 and I2 families, each physical ethernet NIC virtualizes itself as multiple independent PCIe Ethernet NICs, each of which can be assigned to a Xen guest.
Thus, an EC2 instance running on hardware that supports Enhanced Networking can "own" one of the virtualized network interfaces, which means it can send and receive network traffic without invoking the Xen hypervisor.
Enabling Enhanced Networking is as simple as:
- Create a VPC and subnet
- Pick an HVM AMI with the Intel ixgbevf Virtual Function driver
- Launch a C3 or I2 instance using the HVM AMI
Benchmarking
We use the Amazon Linux AMI, as it already has the ixgbevf driver installed, and Amazon Linux is available in all regions. We use netperf to benchmark C3 instances running in a VPC (ie. Enhanced Networking enabled) against non-VPC (ie. Enhanced Networking disabled).
Round-trip Latency
Many message passing MPI & HPC applications are latency sensitive. Here Enhanced Networking support really shines, with a max. speedup of 2.37 over the normal EC2 networking stack.
Amazon says that both the c3.large and c3.xlarge instances have "Moderate" network performance, but we found that c3.large peaks at around 415 Mbps, while c3.xlarge almost reaches 1Gbps. We believe the extra bandwidth headroom is for EBS traffic, as c3.xlarge can be configured as "EBS-optimized" while c3.large cannot.
Conclusion 2
Notice that c3.2xlarge with enhanced networking enabled has a around-trip latency of 92 millisecond, which is much higher that of the smaller instance types in the C3 family. We repeated the test in both the us-east-1 and us-west-2 regions and got idential results.
Currently AWS has a shortage of C3 instances -- all c3.4xlarge and c3.8xlarge instance launch requests we issued so far resulted in "Insufficient capacity". We are closely monitoring the situration, and we are planning to benchmark the c3.4xlarge and c3.8xlarge instance types and see if we can reproduce the increased latency issue.
Updated Jan 8, 2014: We have published Enhanced Networking in the AWS Cloud (Part 2) that includes the benchmark results for the remaining C3 types.
Friday, June 1, 2012
Giving Away a Cisco Live Full Conference Pass
Back in May, we attended a local Cisco event here in Toronto. Besides talking to Cisco engineers about their datacenter products and networking technologies, we also met with some technical UCS server people (more on Cisco UCS Blade Servers & Open Grid Scheduler/Grid Engine in later blog entry).
We also received a Cisco Live Conference Pass, which allows us to attend everything at the conference (ie. the full experience) in San Diego, CA on June 10-14, 2012, and we are planning to give it to the first person who sends us the right answer to the following question:
When run with 20 MPI processes, what will the value of recvbuf[i][i] be for i=0..19 in MPI_COMM_WORLD rank 17 when this application calls MPI_Finalize?
#include <mpi.h>
int sendbuf[100];
int recvbuf[20][100];
MPI_Request reqs[40];
MPI_Request send_it(int dest, int len)
{
int i;
MPI_Request req;
for (i = 0; i < len; ++i) {
sendbuf[i] = dest;
}
MPI_Isend(sendbuf, len, MPI_INT, dest, 0, MPI_COMM_WORLD, &req);
return req;
}
MPI_Request recv_it(int src, int len)
{
MPI_Request req;
MPI_Irecv(recvbuf[src], len, MPI_INT, src, 0, MPI_COMM_WORLD, &req);
return req;
}
int main(int argc, char *argv[])
{
int i, j, rank, size;
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
/* Bound the number of procs involved, just so we can be lazy and
use a fixed-length sendbuf/recvbuf. */
if (rank < 20) {
for (i = j = 0; i < size; ++i) {
reqs[j++] = send_it(i, 5);
reqs[j++] = recv_it(i, 5);
}
MPI_Waitall(j, reqs, MPI_STATUSES_IGNORE);
}
MPI_Finalize();
return 0;
}
The code above & the question were written by Mr Open MPI, Jeff Squyres, who has worked with us as early as the pre-Oracle Grid Engine days on PLPA, and suggested us to migrate to the hwloc topology library. (Side note: when the Open Grid Scheduler became the maintainer of the open source Grid Engine code base in 2011, Grid Engine Multi-Core Processor Binding with hwloc was one of the first major features we added in Open Grid Scheduler/Grid Engine to support discovery of newer system topologies).
So send us the answer, and the first one who answers the question correctly will get the pass to attend the conference!
When run with 20 MPI processes, what will the value of recvbuf[i][i] be for i=0..19 in MPI_COMM_WORLD rank 17 when this application calls MPI_Finalize?
#include <mpi.h>
int sendbuf[100];
int recvbuf[20][100];
MPI_Request reqs[40];
MPI_Request send_it(int dest, int len)
{
int i;
MPI_Request req;
for (i = 0; i < len; ++i) {
sendbuf[i] = dest;
}
MPI_Isend(sendbuf, len, MPI_INT, dest, 0, MPI_COMM_WORLD, &req);
return req;
}
MPI_Request recv_it(int src, int len)
{
MPI_Request req;
MPI_Irecv(recvbuf[src], len, MPI_INT, src, 0, MPI_COMM_WORLD, &req);
return req;
}
int main(int argc, char *argv[])
{
int i, j, rank, size;
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
/* Bound the number of procs involved, just so we can be lazy and
use a fixed-length sendbuf/recvbuf. */
if (rank < 20) {
for (i = j = 0; i < size; ++i) {
reqs[j++] = send_it(i, 5);
reqs[j++] = recv_it(i, 5);
}
MPI_Waitall(j, reqs, MPI_STATUSES_IGNORE);
}
MPI_Finalize();
return 0;
}
The code above & the question were written by Mr Open MPI, Jeff Squyres, who has worked with us as early as the pre-Oracle Grid Engine days on PLPA, and suggested us to migrate to the hwloc topology library. (Side note: when the Open Grid Scheduler became the maintainer of the open source Grid Engine code base in 2011, Grid Engine Multi-Core Processor Binding with hwloc was one of the first major features we added in Open Grid Scheduler/Grid Engine to support discovery of newer system topologies).
So send us the answer, and the first one who answers the question correctly will get the pass to attend the conference!
Subscribe to:
Posts (Atom)