Tuesday, May 22, 2012

Grid Engine cgroups Integration

The PDC (Portable Data Collector) in Grid Engine's job execution daemon tracks job-process membership for resource usage accounting purposes, for job control purposes (ie. making sure that jobs don't exceed their resource limits), and for signaling purposes (eg. stopping, killing jobs).

Since most operating systems don't have a mechanism to group of processes into jobs, Grid Engine adds an additional Group ID to each job. As normal processes can't change their GID membership, it is a safe way to tag processes to jobs. On operating systems where the PDC module is enabled, every so often the execution daemon scans all the processes running on the system, and then groups processes to jobs by looking for the additional GID tag.

So far so good, but...
Adding an extra GID has side-effects. We have received reports that applications behave strangely with an unresolvable GID. For example, on Ubuntu, we get:

$ qrsh
groups: cannot find name for group ID 20017

Another problem: it takes time for the PDC to warm up. For some short running jobs, you will find:

removing unreferenced job 64623.394 without job report from ptf

Third problem is that if the PDC runs too often, it takes too much CPU time. In SGE 6.2 u5, a memory accounting bug was introduced because the Grid Engine developers needed to reduce the CPU usage of the PDC on Linux by adding a workaround. (Shameless plug: we the Open Grid Scheduler developers fixed the bug back in 2010, way ahead of any other Grid Engine implementations that are still active these days.) Imagine running ps -elf every second on your execution nodes. This is how intrusive the PDC is!

The final major issue is that the PDC is not accurate. Grid Engine itself does not trust on the information from the PDC at job cleanup. The end result is run-away jobs consuming resources on the execution hosts. The cluster administrators then need to enable the special flag to tell Grid Engine to do proper job cleanup (by default ENABLE_ADDGRP_KILL is off). Quoting the Grid Engine sge_conf manpage:

ENABLE_ADDGRP_KILL
          If this parameter is set then Sun Grid Engine uses  the
          supplementary group ids (see gid_range) to identify all
          processes which are to be  terminated  when  a  job  is
          deleted,  or  when  sge_shepherd(8) cleans up after job
          termination.

Grid Engine cgroups Integration
In Grid Engine 2011.11 update 1, we switch to cgroups instead of the additional GID for the process tagging mechanism.

(We the Open Grid Scheduler / Grid Engine developers wrote the PDC code for AIX, HP-UX, and the initial PDC code for MacOS X, which is used as the base for the FreeBSD and NetBSD PDC. We even wrote a PDC prototype for Linux that does not rely on GID. Our code was contributed to Sun Microsystems, and is used in every implementation of Grid Engine - whether it is commercial, or open source, or commercial open source like Open Grid Scheduler.)

As almost half of the PDCs were developed by us, we knew all the issues in PDC.

We are switching to cgroups now but not earlier because:
  1. Most Linux distributions ship kernels that have cgroups support.
  2. We are seeing more and more cgroups improvements. Lots of cgroups performance issues were fixed in recent Linux kernels.
With the cgroups integration in Grid Engine 2011.11 update 1, all the PDC issues mentioned above are handled. Further, we have bonus features with cgroups:
  1. Accurate memory usage accounting: ie. shared pages are accounted correctly.
  2. Resource limit at the job level, not at the individual process level.
  3. Out of the box SSH integration.
  4. RSS (real memory) limit: we all have jobs that try to use every single byte of memory, but capping their RSS does not hurt their performance. May as well cap the RSS such that we can take back the spare processors for other jobs.
  5. With the cpuset cgroup controller, Grid Engine can set the processor binding and memory locality reliably. Note that jobs that change their own processor binding are not handled by the original Grid Engine Processor Binding with hwloc (Another shameless plug: we are again the first who switched to hwloc for processor binding) - it is very rare to encounter jobs that change their own processor binding, but if a job or external process decides to change its own processor mask, then this will affect other jobs running on the system.
  6. Finally, with the freezer controller, we can have a safe mechanism for stopping and resuming jobs:
$ qstat
job-ID  prior   name       user         state submit/start at
queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
    16 0.55500 sleep      sgeadmin     r     05/07/2012 05:44:12
all.q@master                       1
$ cat /cgroups/cpu_and_memory/gridengine/Job16.1/freezer.state
THAWED
$ qmod -sj 16
sgeadmin - suspended job 16
$ cat /cgroups/cpu_and_memory/gridengine/Job16.1/freezer.state
FROZEN
$ qmod -usj 16
sgeadmin - unsuspended job 16
$ cat /cgroups/cpu_and_memory/gridengine/Job16.1/freezer.state
THAWED

We will be announcing more new features in Grid Engine 2011.11 update 1 here on this blog. Stay tuned for our announcement.