Friday, February 13, 2015

Capsicumizing strings

Like many others, I had an initial reaction of boggle when our very own Michal Zalewski showed that the strings utility isn't safe to run on untrusted files (CVE-2014-8485), because of its use of libbfd to parse object file internals.

However, the strings code gives us a perfect target to demonstrate the process of applying a Capsicum sandbox to an existing small application (similar to Ben Laurie's example of Capsicumizing bzip2). The net result is a sandboxed version of the program, where any remote code execution (RCE) attack is significantly restricted in what it can do.

In particular, the Capsicum-contained code would only be able to read the input files (content and metadata) and output to stdout / stderr. Given that the attacker could already read the input file (since they provided it) and could already output arbitrary strings to stdout (just by including them in the input file), exploiting a Capsicum-protected strings doesn't gain an attacker much*.

[*: There are still some avenues of attack against the Capsicum-protected code. The two most obvious are that an attacker can spew out arbitrary data, not just ASCII strings, and that Capsicum provides little protection against local DoS / resource exhaustion attacks.]

Existing Code Structure

We start by looking at the main loop that does the work of string-hunting (here and below, I'm compacting the code and removing some error-checking arms to save space and help clarity):

  if (optind >= argc) {
      datasection_only = FALSE;
      print_strings("{standard input}", stdin, 0, 0, 0, (char *) NULL);
  } else {
      for (; optind < argc; ++optind) {
        if (strcmp(argv[optind], "-") == 0) {
          datasection_only = FALSE;
        } else {
          exit_status |= strings_file(argv[optind]) == FALSE;
        }
      }
  }

The datasection_only global variable governs whether the binary file descriptor (BFD) library is used for parsing, and is controlled by the -a and -d options. Each file specified on the command line is passed into strings_file() in turn, so that needs a closer examination:

static bfd_boolean strings_file(char *file)
{
  struct stat st;
  if (stat(file, &st) < 0)
    return FALSE;

  if (!datasection_only || !strings_object_file(file)) {
      FILE *stream = fopen(file, FOPEN_RB);
      if (stream == NULL) {
        fprintf(stderr, "%s: ", program_name); perror(file);
        return FALSE;
      }

      print_strings(file, stream, (file_ptr) 0, 0, 0, (char *) 0);

      if (fclose(stream) == EOF) {
        fprintf(stderr, "%s: ", program_name); perror(file);
        return FALSE;
      }
  }
  return TRUE;
}

In both the main loop and this per-file function, the print_strings() function is used to do a raw search of a particular FILE* for strings, but the strings_object_file() may get first crack at each file. Assuming the file looks like a supported object format, this function asks the BFD library to iterate over sections and call back into strings_a_section() for each section:

static bfd_boolean strings_object_file(const char *file)
{
  filename_and_size_t filename_and_size = {file, 0};
  bfd *abfd = bfd_openr(file, target);
  if (abfd == NULL)  /* Treat the file as a non-object file.  */
    return FALSE;

  if (!bfd_check_format(abfd, bfd_object)) {
    bfd_close(abfd);
    return FALSE;
  }

  got_a_section = FALSE;
  bfd_map_over_sections(abfd, strings_a_section, &filename_and_size);

  if (!bfd_close(abfd)) {
    bfd_nonfatal(file);
    return FALSE;
  }
  return got_a_section;
}

Finally, the callback function strings_a_section() limits its search to specific sections, gets the section information from the BFD library, and passes the location of the section to the workhorse print_strings() function:

static void strings_a_section(bfd *abfd, asection *sect, void *arg)
{
  filename_and_size_t *filename_and_sizep = (filename_and_size_t *)arg;
  bfd_size_type sectsize;
  void *mem;

  if ((sect->flags & DATA_FLAGS) != DATA_FLAGS)
    return;
  sectsize = bfd_get_section_size(sect);
  if (sectsize <= 0)
    return;

  mem = xmalloc(sectsize);
  if (bfd_get_section_contents (abfd, sect, mem, (file_ptr) 0, sectsize)) {
    got_a_section = TRUE;
    print_strings(filename_and_sizep->filename, NULL, sect->filepos,
                  0, sectsize, (char *) mem);
  }
  free(mem);
}

This also makes clear that print_strings() has two different modes – one where it reads characters from a location in a FILE*, and one where it processes already-read bytes in a buffer; the details of this are implemented in the get_char() utility function.

Plan of Attack

Now that we understand the code structure, we can plan how to Capsicumize the program. The default plan of attack for these kinds of applications works here:

  • Open all of the input files before entering the main loop.
  • Restrict the input files to just have CAP_READ and CAP_SEEK rights
  • Restrict the outputs (stdout and stderr here) to just have the CAP_WRITE right
  • Enter Capsicum capability mode before starting the main loop.

However, there are additional tasks that are specific to this particular application; in particular, functions that take a char *filename argument need to be converted to alternatively take a pre-opened int fd or FILE *stream argument. We also need to take care that the detailed output of strings is preserved – for example, by deferring error messages from failed file-open operations until the appropriate moment.

(It's worth mentioning that there are other design patterns for Capsicumizing more complicated applications. For example, a compartmentalized application might pass (capability) file descriptors around over local sockets.)

FILEs Not File Names

Converting strings to use FILE* streams rather than filenames internally turns out to be fairly straightforward, thanks to the richness of the BFD API. The main cause for concern is the call to bfd_openr(), which asks the BFD library to open a particular filename; however, the BFD API also includes the bfd_openstreamr() and bfd_fopen() entrypoints that allow the user to supply an existing FILE * stream or file descriptor instead of a filename.

One small wrinkle is that any call to bfd_close on the resulting bfd object also closes the underlying file. This is awkward if the BFD library decides that this isn't a valid object file after all; we can't fall back to byte-by-byte examination of the file if it has been closed already. So we dup() the underlying file descriptor before letting the BFD library have at it.

The remainder of the patch merely moves the file open/close code out of the strings_file() function, and into its caller, ready for the next stage.

From d662f77cb63c4b8247ba8bc2288c42576dc7f7a0 Mon Sep 17 00:00:00 2001
From: David Drysdale <drysdale@google.com>
Date: Tue, 25 Nov 2014 12:52:24 +0000
Subject: [PATCH 1/4] Pass around opened FILE* not a filename

strings_file() and strings_object_file() are converted to take a FILE*
to operate on, rather than just a filename that they open themselves.
---
 binutils/strings.c | 62 +++++++++++++++++++++++++++++++-----------------------
 1 file changed, 36 insertions(+), 26 deletions(-)

diff --git a/binutils/strings.c b/binutils/strings.c
index 2cf046fded1a..09eefc250e69 100644
--- a/binutils/strings.c
+++ b/binutils/strings.c
@@ -139,8 +139,8 @@ typedef struct
 } filename_and_size_t;

 static void strings_a_section (bfd *, asection *, void *);
-static bfd_boolean strings_object_file (const char *);
-static bfd_boolean strings_file (char *);
+static bfd_boolean strings_object_file (const char *, FILE *);
+static bfd_boolean strings_file (char *, FILE *);
 static void print_strings (const char *, FILE *, file_ptr, int, int, char *);
 static void usage (FILE *, int);
 static long get_char (FILE *, file_ptr *, int *, char **);
@@ -306,8 +306,26 @@ main (int argc, char **argv)
             datasection_only = FALSE;
           else
             {
+              FILE *stream;
+
               files_given = TRUE;
-              exit_status |= strings_file (argv[optind]) == FALSE;
+              stream = fopen (argv[optind], FOPEN_RB);
+              if (stream == NULL)
+                {
+                  fprintf (stderr, "%s: ", program_name);
+                  perror (argv[optind]);
+                  exit_status = TRUE;
+                }
+              else
+                {
+                  exit_status |= strings_file (argv[optind], stream) == FALSE;
+                  if (fclose (stream) == EOF)
+                    {
+                      fprintf (stderr, "%s: ", program_name);
+                      perror (argv[optind]);
+                      return FALSE;
+                    }
+                }
             }
         }
     }
@@ -383,16 +401,24 @@ strings_a_section (bfd *abfd, asection *sect, void *arg)
    FALSE if not (such as if FILE is not an object file).  */

 static bfd_boolean
-strings_object_file (const char *file)
+strings_object_file (const char *file, FILE *stream)
 {
   filename_and_size_t filename_and_size;
+  int fd;
   bfd *abfd;

-  abfd = bfd_openr (file, target);
+  fd = dup(fileno(stream));
+  if (fd < 0)
+    return FALSE;
+
+  abfd = bfd_fopen (file, target, FOPEN_RB, fd);

   if (abfd == NULL)
-    /* Treat the file as a non-object file.  */
-    return FALSE;
+    {
+      /* Treat the file as a non-object file.  */
+      close(fd);
+      return FALSE;
+    }

   /* This call is mainly for its side effect of reading in the sections.
      We follow the traditional behavior of `strings' in that we don't
@@ -420,13 +446,13 @@ strings_object_file (const char *file)
 /* Print the strings in FILE.  Return TRUE if ok, FALSE if an error occurs.  */

 static bfd_boolean
-strings_file (char *file)
+strings_file (char *file, FILE *stream)
 {
   struct stat st;

   /* get_file_size does not support non-S_ISREG files.  */

-  if (stat (file, &st) < 0)
+  if (fstat (fileno(stream), &st) < 0)
     {
       if (errno == ENOENT)
         non_fatal (_("'%s': No such file"), file);
@@ -440,26 +466,10 @@ strings_file (char *file)
      try to open it as an object file and only look at
      initialized data sections.  If that fails, fall back to the
      whole file.  */
-  if (!datasection_only || !strings_object_file (file))
+  if (!datasection_only || !strings_object_file (file, stream))
     {
-      FILE *stream;
-
-      stream = fopen (file, FOPEN_RB);
-      if (stream == NULL)
-        {
-          fprintf (stderr, "%s: ", program_name);
-          perror (file);
-          return FALSE;
-        }
-
       print_strings (file, stream, (file_ptr) 0, 0, 0, (char *) 0);

-      if (fclose (stream) == EOF)
-        {
-          fprintf (stderr, "%s: ", program_name);
-          perror (file);
-          return FALSE;
-        }
     }

   return TRUE;
--
1.9.1

Opening In Advance

The next change we need is to ensure that all of the files we need are opened before the main business of scanning (untrusted, user-provided) files starts. This is a fairly straightforward re-structure, but the need to preserve the existing behaviour and order of output leads to the creation of a new stream_into_t structure. This structure tracks:

  • the filename
  • the FILE* stream (if successfully opened)
  • whether this input is a direct stream (i.e. stdin) rather than a file-based stream
  • whether this input should trigger a switch to non-BFD mode
  • any errno value from a failed attempt to open the file

With an array of these structures describing the input files, the main loop has all the information needed to perform the work of strings, while preserving the output behaviour of the existing code.

From 5ab9c0f0797790d1184c54bf6221f6bc95c75088 Mon Sep 17 00:00:00 2001
From: David Drysdale <drysdale@google.com>
Date: Tue, 25 Nov 2014 17:08:03 +0000
Subject: [PATCH 2/4] Open all streams in advance

Preserve the order of output by saving errors on file open,
and emit the error message later.
---
 binutils/strings.c | 83 ++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 59 insertions(+), 24 deletions(-)

diff --git a/binutils/strings.c b/binutils/strings.c
index 09eefc250e69..e5dce2f5a8bc 100644
--- a/binutils/strings.c
+++ b/binutils/strings.c
@@ -138,6 +138,15 @@ typedef struct
   bfd_size_type filesize;
 } filename_and_size_t;

+typedef struct
+{
+  char * filename;
+  FILE * stream;
+  bfd_boolean datasection_only;
+  bfd_boolean direct;
+  int error;
+} stream_info_t;
+
 static void strings_a_section (bfd *, asection *, void *);
 static bfd_boolean strings_object_file (const char *, FILE *);
 static bfd_boolean strings_file (char *, FILE *);
@@ -155,6 +164,9 @@ main (int argc, char **argv)
   bfd_boolean files_given = FALSE;
   char *s;
   int numeric_opt = 0;
+  int ii;
+  int num_streams;
+  stream_info_t *streaminfo;

 #if defined (HAVE_SETLOCALE)
   setlocale (LC_ALL, "");
@@ -291,45 +303,68 @@ main (int argc, char **argv)
   bfd_init ();
   set_default_bfd_target ();

+  /* Pre-open all of the streams involved, and save any errors */
+  num_streams = (optind >= argc) ? 1 : (argc - optind);
+  streaminfo = xmalloc (sizeof (stream_info_t) * num_streams);
   if (optind >= argc)
     {
-      datasection_only = FALSE;
+      streaminfo[0].filename = "{standard input}";
+      streaminfo[0].stream = stdin;
+      streaminfo[0].datasection_only = FALSE;
+      streaminfo[0].direct = TRUE;
+      streaminfo[0].error = 0;
       SET_BINARY (fileno (stdin));
-      print_strings ("{standard input}", stdin, 0, 0, 0, (char *) NULL);
       files_given = TRUE;
     }
   else
     {
-      for (; optind < argc; ++optind)
+      for (ii = 0; ii < num_streams; ++ii)
         {
-          if (strcmp (argv[optind], "-") == 0)
-            datasection_only = FALSE;
+          streaminfo[ii].filename = argv[optind + ii];
+          streaminfo[ii].stream = NULL;
+          streaminfo[ii].datasection_only = TRUE;
+          streaminfo[ii].direct = FALSE;
+          streaminfo[ii].error = 0;
+          if (strcmp (streaminfo[ii].filename, "-") == 0)
+            streaminfo[ii].datasection_only = FALSE;
           else
             {
-              FILE *stream;
-
               files_given = TRUE;
-              stream = fopen (argv[optind], FOPEN_RB);
-              if (stream == NULL)
-                {
-                  fprintf (stderr, "%s: ", program_name);
-                  perror (argv[optind]);
-                  exit_status = TRUE;
-                }
-              else
-                {
-                  exit_status |= strings_file (argv[optind], stream) == FALSE;
-                  if (fclose (stream) == EOF)
-                    {
-                      fprintf (stderr, "%s: ", program_name);
-                      perror (argv[optind]);
-                      return FALSE;
-                    }
-                }
+              streaminfo[ii].stream = fopen (streaminfo[ii].filename, FOPEN_RB);
+              if (streaminfo[ii].stream == NULL)
+                streaminfo[ii].error = errno;
             }
         }
     }

+  for (ii = 0; ii < num_streams; ++ii)
+    {
+      if (streaminfo[ii].datasection_only == FALSE)
+        datasection_only = FALSE;
+      if (streaminfo[ii].stream)
+        {
+          if (streaminfo[ii].direct)
+            print_strings (streaminfo[ii].filename, streaminfo[ii].stream,
+                           0, 0, 0, (char *) NULL);
+          else
+            exit_status |= strings_file (streaminfo[ii].filename,
+                                         streaminfo[ii].stream) == FALSE;
+          if (fclose (streaminfo[ii].stream) == EOF)
+          {
+            fprintf (stderr, "%s: ", program_name);
+            perror (argv[optind]);
+            return FALSE;
+          }
+        }
+      else if (streaminfo[ii].error)
+        {
+          fprintf (stderr, "%s: %s: %s\n", program_name,
+                   streaminfo[ii].filename, strerror(streaminfo[ii].error));
+          exit_status = TRUE;
+        }
+    }
+  free(streaminfo);
+
   if (!files_given)
     usage (stderr, 1);

--
1.9.1

Restricting Rights

With all the preparatory work done, the first pass at adding Capsicum support is straightforward.

From 036f463b51b03f17cb0ea15025d69840a7d13ab0 Mon Sep 17 00:00:00 2001
From: David Drysdale <drysdale@google.com>
Date: Tue, 25 Nov 2014 17:42:48 +0000
Subject: [PATCH 3/4] Add initial rights restriction on file descriptors

---
 binutils/strings.c   | 15 +++++++++++++++
 1 files changed, 15 insertions(+)

diff --git a/binutils/strings.c b/binutils/strings.c
index e5dce2f5a8bc..24bec7c68895 100644
--- a/binutils/strings.c
+++ b/binutils/strings.c
@@ -72,6 +72,8 @@
 #include "safe-ctype.h"
 #include "bucomm.h"

+#include <sys/capsicum.h>
+
 #define STRING_ISGRAPHIC(c) \
       (   (c) >= 0 \
        && (c) <= 255 \
@@ -337,6 +339,19 @@ main (int argc, char **argv)
         }
     }

+  {
+    cap_rights_t rights;
+    cap_rights_limit(fileno(stdout), cap_rights_init(&rights, CAP_WRITE));
+    cap_rights_limit(fileno(stderr), cap_rights_init(&rights, CAP_WRITE));
+    cap_rights_init(&rights, CAP_READ, CAP_SEEK, CAP_FSTAT);
+    for (ii = 0; ii < num_streams; ++ii)
+      {
+        if (streaminfo[ii].stream)
+          cap_rights_limit(fileno(streaminfo[ii].stream), &rights);
+      }
+  }
+  cap_enter();
+
   for (ii = 0; ii < num_streams; ++ii)
     {
       if (streaminfo[ii].datasection_only == FALSE)
--
1.9.1

Hunting Other Operations

However, running the resulting strings binary over a few test files quickly reveals that not all is well. Files that are processed in -a mode are scanned fine, but files that should be parsed as object files are not generating output.

To fine-tune the Capsicumization process, strace is your friend. Repeating the failed run with strace quickly shows some ENOTCAPABLE errors (which show up as ERRNO_135 because strace isn't yet Capsicum-aware):

    fcntl(7, F_GETFL)                       = -1 ERRNO_135 (Unknown error 135)
    fcntl(3, F_GETFL)                       = -1 ERRNO_135 (Unknown error 135)
    fstat(1, 0x7fffab1bac40)                = -1 ERRNO_135 (Unknown error 135)

The last of these just means that stdout needs CAP_FSTAT as an additional right, and the earlier errors tell us that the input files need CAP_FCNTL to allow the BFD library to check for file status flags. We don't want to open up all possible fcntl(2) operations, so we further restrict the CAP_FCNTL right so that only F_GETFL is allowed. Our penultimate code patch tweaks these rights, and the resulting strings binary works fine.

From 7d8d0de149a7e6811d3f499db5fcacd361415cbd Mon Sep 17 00:00:00 2001
From: David Drysdale <drysdale@google.com>
Date: Tue, 25 Nov 2014 17:44:40 +0000
Subject: [PATCH 4/4] Tweak rights needed

On examining strace output, there are some additional rights
needed:
 - CAP_FSTAT is needed for stdout
 - CAP_FCNTL is needed for the input file descriptors

For the latter, only allow the F_GETFL operation.
---
 binutils/strings.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/binutils/strings.c b/binutils/strings.c
index 24bec7c68895..cd02bf5a7b41 100644
--- a/binutils/strings.c
+++ b/binutils/strings.c
@@ -341,13 +341,15 @@ main (int argc, char **argv)

   {
     cap_rights_t rights;
-    cap_rights_limit(fileno(stdout), cap_rights_init(&rights, CAP_WRITE));
+    cap_rights_limit(fileno(stdout), cap_rights_init(&rights, CAP_WRITE, CAP_FSTAT));
     cap_rights_limit(fileno(stderr), cap_rights_init(&rights, CAP_WRITE));
-    cap_rights_init(&rights, CAP_READ, CAP_SEEK, CAP_FSTAT);
+    cap_rights_init(&rights, CAP_READ, CAP_SEEK, CAP_FSTAT, CAP_FCNTL);
     for (ii = 0; ii < num_streams; ++ii)
       {
-        if (streaminfo[ii].stream)
+        if (streaminfo[ii].stream) {
           cap_rights_limit(fileno(streaminfo[ii].stream), &rights);
+          cap_fcntls_limit(fileno(streaminfo[ii].stream), CAP_FCNTL_GETFL);
+        }
       }
   }
   cap_enter();
--
1.9.1

Closing Holes

At this point, we think we've restricted all of the file descriptors for the application, but it's best to make sure. To make this easy, we temporarily insert a stop in the program (kill(getpid(), SIGSTOP);), just after the call to cap_enter(). We can now explore information about the stopped program, to check everything is covered.

The first thing to check is that capability mode has been engaged. In the Linux implementation of Capsicum, this is (mostly) implemented as a seccomp-BPF filter program, so running cat /proc/6620/status | grep Seccomp shows the value 2 (SECCOMP_MODE_FILTER), as expected.

More interesting is to examine the open file descriptors of the process, via Linux's /proc filesystem:

    % cd /proc/6620/fdinfo
    % ls
    0  1  2  3  4  5
    % more *
    ::::::::::::::
    0
    ::::::::::::::
    pos:        0
    flags:        0100002
    mnt_id:        18
    ::::::::::::::
    1
    ::::::::::::::
    pos:        0
    flags:        0100002
    mnt_id:        18
    rights:        0x200000000080002        0x400000000000000
     fcntls: 0x000000
    ::::::::::::::
    2
    ::::::::::::::
    pos:        0
    flags:        0100002
    mnt_id:        18
    rights:        0x200000000000002        0x400000000000000
     fcntls: 0x000000
    ::::::::::::::
    3
    ::::::::::::::
    pos:        0
    flags:        0100000
    mnt_id:        20
    rights:        0x20000000008800d        0x400000000000000
     fcntls: 0x000008
    ::::::::::::::
    4
    ::::::::::::::
    pos:        0
    flags:        0100000
    mnt_id:        20
    rights:        0x20000000008800d        0x400000000000000
     fcntls: 0x000008
    ::::::::::::::
    5
    ::::::::::::::
    pos:        0
    flags:        0100000
    mnt_id:        20
    rights:        0x20000000008800d        0x400000000000000
     fcntls: 0x000008

This shows that rights have been limited for all of the file descriptors, except for file descriptor zero – stdin. So our final patch just closes stdin when it's not needed.

From cc972e504a187d21a37857a720f1f5a91240e7d9 Mon Sep 17 00:00:00 2001
From: David Drysdale <drysdale@google.com>
Date: Fri, 5 Dec 2014 12:09:51 +0000
Subject: [PATCH] Close stdin if not needed

---
 binutils/strings.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/binutils/strings.c b/binutils/strings.c
index cd02bf5a7b41..5a81e2d813ab 100644
--- a/binutils/strings.c
+++ b/binutils/strings.c
@@ -337,6 +337,7 @@ main (int argc, char **argv)
                 streaminfo[ii].error = errno;
             }
         }
+      fclose(stdin);
     }

   {
--
1.9.1

And that's the end of the changes needed to provide Capsicum protection for strings. Again, we should emphasize that this doesn't prevent the RCE – Michal's original test file would still crash strings. However, if the exploit were to actually try to do anything "useful" (scan the local system, contact a remote C&C server, join a DDoS botnet, …), those operations would fail.

Onward

Having made the fairly straightforward code changes for Capsicumizing strings, the next step is considerably harder: updating the autotools configuration to detect whether Capsicum support is available at compile-time. But that's a story for another day…

An Overview of Capsicum

[2015-04-27: edited to add section on discretionary access controls]

[2015-04-07: edited to add section on POSIX capabilities]

Although the original Capsicum papers are very readable, I thought it might be useful to have an accessible overview of Capsicum, together with some musing about how it compares with other Linux security technologies.

Capability-Based Security

Capsicum is a security framework that takes concepts from capability-based security, and applies those concepts in a pragmatic way to UNIX applications. This gives many of the benefits of a well-analysed security model, but allows for gradual migration and full interoperation with non-Capsicum applications – in other words, it's much easier than re-writing for a full capability-based operating system (such as seL4).

In capability-based security, any access to an object needs an unforgeable token, the capability which identifies the object in question, and defines what rights the capability holder has for that object – that is, what operations they're allowed to perform with the object.

This goes hand-in-hand with an absence of object naming schemes – there is no way to refer to an object by some parseable name ("/etc/passwd", 10.1.2.3:53, uid:root, pid:3145, …) in order to get a reference to the object.

Taken together, these attributes give capability-based systems a security model that is simple to analyze: the capabilities that a program holds completely describe what the program is capable of doing, and who it is capable of communicating with. This in turn assists the process of privilege separation for improved security – interactions between different components of a program can be analysed by examining the capabilities that are exchanged (also, those interactions are then less susceptible to confused deputy problems).

In real-world terms, applications and services that interact with untrusted and remote users are potentially vulnerable to a particularly scary style of attack – the remote code execution (RCE) attack, where the attacker subverts the application and runs their own code with full access to the entire ambient authority of the subverted application. As a result, the RCE exploit code can access any object that the application is able to access.

As with other sandboxing techniques, Capsicum's capability-based approach often de-fangs these RCE attacks: although the attacker may still be able to run arbitrary code, what that code is capable of doing is severely limited. Scanning the local system, contacting a remote C&C server, joining a DDoS botnet, snooping on other programs – all of these operations require access to some global namespace or other (e.g. names on the filesystem, IP addresses and ports, process IDs).

Object capabilities can be passed around or inherited from a parent, but are only created by taking an existing capability and subsetting its rights. This raises a difficult bootstrap question: if you can't name objects, where does an ordinary application get its capabilities from?

In a pure object-capability system, this bootstrap problem is typically addressed by having capabilities cascade down from an initial uber-capability that covers the entire system at start-of-day – but this is a radically different model from how current UNIX-based systems work. However, because Capsicum is a pragmatic hybrid of capability-based security with normal POSIX semantics, the bootstrap problem is neatly sidestepped: the application can create all of its capabilities before it shifts into capability mode.

(Of course, Capsicum still allows the normal ways that a program would acquire capabilities in a pure capability system – the capabilities could be inherited from a parent process, or could be explicitly passed to the program across a local socket – both of which are useful if the program is untrusted from its inception.)

Capsicum Capabilities

Several explanations of capabilities use the analogy of a UNIX file descriptor (FD) to describe the key concepts of capabilities in familiar terms:

  • When the kernel opens a file on behalf of a userspace application, it just hands an integer token (the file descriptor) to the application; the kernel object that tracks the status of the open file is entirely internal to the kernel. The application can't get a new token without the help of the kernel, and any operations on the object (read(), write(), close(), …) have to specify the token.
    • The UNIX model of "all the world's a file" also means that various sorts of kernel-managed objects use the same file descriptor mechanism for access – open files, sockets (both network and local), message queues, timers, file system notifications, etc.
  • File descriptors can be passed between processes, over local sockets with sendmsg(2) / recvmsg(2). (Of course, the local sockets are also described by file descriptors – so they too are capabilities.)
  • Depending on the mode in which the file was opened, the kernel only allows certain operations to be performed on the file (e.g. no write() for a file opened O_RDONLY). However, the existing UNIX restrictions are very simplistic, just read-write-execute, and positively misleading in their implementation – for example, fchmod(2) will happily change the permissions of a read-only file descriptor.

Capsicum makes this capability analogy into a reality by associating a much more fine-grained set of rights with file descriptors, and extending the kernel so that it accurately polices those rights. A normal file descriptor implicitly has all possible rights, thus preserving all existing behaviour; a new system call (cap_rights_limit()) is then used to restrict the rights associated with a file descriptor to be a subset of those it already has (i.e. rights can never be extended, only restricted). A file descriptor that has had its rights so restricted is referred to as a Capsicum capability.

If an application attempts to perform an operation on a capability file descriptor that is not allowed by its associated rights, the kernel will fail the operation with ENOTCAPABLE (which is a new Capsicum-specific errno value, which makes the process of applying Capsicum to an existing application easier).

The current Capsicum implementations define around 60 distinct rights, as previous experience has shown that permissions models become more granular as time goes on, so being more granular at the beginning helps maintainability. Commonly-used rights include:

  • CAP_READ: allow operations that read the content of the object
  • CAP_WRITE: allow operations that write the content of the object
  • CAP_SEEK: allow operations that operate at an arbitrary offset within the object, or which explicitly alter the file offset (i.e. lseek(2))
  • CAP_LOOKUP: allow operations that search within a directory
  • CAP_CREATE: allow operations that create files within a directory
  • CAP_FCHMOD: allow operations that change the file permissions
  • CAP_FSTAT: allow operations that retrieve file metadata/status
  • CAP_ACCEPT: allow accept socket operations
  • CAP_BIND: allow bind socket operations
  • CAP_CONNECT: allow connect socket operations
  • CAP_FCNTL: allow file control (fcntl(2)) operations (but see below)
  • CAP_IOCTL: allow I/O control (ioctl(2)) operations (but see below)

Because the fcntl(2) and (particularly) ioctl(2) system calls act as multiplexors whose precise arguments determine the operation to be performed, Capsicum also includes the ability to restrict these syscalls to particular operations (with cap_fcntls_limit() and cap_ioctls_limit()).

Finally, one detail to note about Capsicum capabilities is that the restricted rights apply to file descriptors, not open file objects. This means that different file descriptors can refer to the same underlying kernel object (e.g. after dup(2)) but have different rights associated with them. This is particularly useful when passing capability file descriptors between processes – a file descriptor passed across to a service can be tightly restricted before doing so, while the sender can continue to hold a more-capable FD for the same object.

Capability Mode

Of course, restricting the operations that can be performed with existing file descriptors is of little use if an attacker can just mint new file descriptors. Capsicum therefore includes capability mode, which (permanently) restricts the system calls available to the current process (and any future descendents). In particular, system calls that allow the creation of new file descriptors from scratch by referring to an object via a global namespace are banned, failing with a new ECAPMODE errno value.

This doesn't completely remove all FD-creating system calls; there are a couple of system calls that create file descriptors without using a global name, by referencing an existing file descriptor:

  • The openat(int dfd, char *path, int flags) system call allows a file to be opened relative to an existing (directory) file descriptor. This syscall is allowed in capability mode (as long as the directory FD has the CAP_CREATE right), but is policed so that the path cannot be used to escape the directory – no leading / or .. components (nor symlinks likewise).
  • The accept(int sockfd, …) system call allows a connected socket to be extracted from a listening socket; this syscall is allowed in capability mode (as long as the listening socket has the CAP_ACCEPT right).

In both of these cases, the newly-minted file descriptor inherits the rights of the parent file descriptor (directory file descriptor or listening socket respectively). This simplifies the implementation, both of the application using Capsicum and Capsicum itself, but has the downside that the parent FD must have all the rights that are needed for any future derived FDs.

(Tighter restrictions on rights can be achieved by having a separate process that opens/accepts things, then restricts the rights and passes the resulting capability to the main worker process – but that's obviously at a cost of much more complexity, particularly for an application that is not already compartmentalized.)

Putting It Together

The core primitives of Capsicum apply particularly well to traditional UNIX utilities that have a central core that processes input from a set of work sources, generating a combined output. For these cases, the application needs straightforward modifications so that:

  • All of the input work sources and output destinations are opened before entering the main loop.
  • Each input file descriptor is Capsicum-restricted to only allow read operations (CAP_READ and friends).
  • The output destinations are Capsicum-restricted to allow write operations.
  • Any unnecessary file descriptors are closed.
  • The application enters capability mode before starting its main loop (and thus before it reads any untrusted, potentially attacker-supplied, inputs).

(An upcoming article will cover this process in excruciating detail.)

However, this isn't the only possible model for using Capsicum. More generally, if a larger application is compartmentalized into distinct security domains, those domains can then be individually sandboxed and Capsicum capabilities passed between them.

[This leads naturally on to an aspect of Capsicum that hasn't been discussed here, namely process descriptors. These are (roughly) file descriptors that can be used in place of pid_t values, allowing processes to be controlled despite the fact that capability mode disables process manipulation functions that use the global pid_t namespace. However, I'll discuss process descriptors at a later date.]

Conditions May Apply

Capsicum is not a panacea for all security problems, so in this section we cover some of the limitations of Capsicum, and how they affect real-world use.

The first limitation of Capsicum is that it can only protect those objects in a UNIX system that are associated with file descriptors, or which are nameable via a global namespace. This covers a lot of ground, but around half of the system calls on a Linux box are still enabled in capability mode, and only around half of those are file-descriptor based. In particular, Capsicum provides no protection from some kinds of resource exhaustion attacks. Subverted code can still spin the CPU(s) and allocate arbitrary amounts of virtual memory, as these resources are not associated with any kind of file-descriptor model. (This is a Capsicum-specific restriction – in a pure capability system, these kinds of resources are governed by capabilities.)

The second limitation of Capsicum quickly becomes clear when real applications are sandboxed using it – many underlying libraries, both system and third-party, rely on the use of global namespaces "under the covers". This was illustrated by the first program to be Capsicumized, tcpdump: in its normal mode of operation (without the -n option), tcpdump tries to convert IP addresses to names by performing reverse-DNS lookups. However, the library functions for DNS operations need access to all sorts of global namespaces: reading /etc/resolv.conf to find name servers, access to port 53 on those nameservers, local connections to DNS caches, and so on.

The FreeBSD implementation of Capsicum includes some efforts to mitigate this problem, notably Casper – a system daemon that applications can connect to before entering capability mode, and can then be used to provide those services (e.g. DNS, group/user enumeration, random number generation) that would otherwise need global namespace access. The FreeBSD dynamic linker has also been upgraded to allow the library path to be specified as a list of (capability) directory file descriptors, rather than path names, so dynamically linked programs can still be executed from within a Capsicum sandbox. However, there is no denying that this limitation makes the process of Capsicumizing an existing application more complicated, and there is more work to be done in this area.

Finally, the use of Capsicum does incur a small performance overhead. This is minimal for capability rights checks – a few bitmask checks – but may be higher for capability mode policing, as each syscall may require additional checking. [The FreeBSD implementation was originally measured to only show a ~10% overhead, but the current Linux implementation is likely to be slower.]

Compare and Contrast

In this section we discuss the comparison between Capsicum and a variety of other Linux security technologies, attempting to highlight the pros and cons of Capsicum compared to each. Note also that the second half of the 2010 Usenix paper on Capsicum also discusses the comparison between Capsicum and other sandboxing technologies.

However, before moving on to individual comparisons it is worth pointing out that all of these different technologies can be composed, allowing for defense in depth. Capsicum on Linux is not implemented as an LSM, so it can interoperate with LSM-based MAC frameworks; capability mode is (mostly) implemented as a seccomp-BPF filter, and such filters can be combined.

POSIX.1e Capabilities

For clarity, the first thing to note is that Linux already includes a feature named capabilities, covering entirely different functionality. These existing Linux capabilities are based on a withdrawn POSIX.1e draft, and effectively divide up the privileges of root into distinct areas of functionality, which can be enabled and disabled independently on a process-wide basis

As with Capsicum capabilities, this drives towards the principle of least authority: if a (setuid) program doesn't need root's full authority, it should drop the parts it doesn't need. However, the privileges that remain are still ambient authority for the program, potentially available for nefarious purposes should the program be compromised – and many escalations from one POSIX capability to full root authority have been observed. Also, in practice a large fraction of behaviour has ended up being controlled by the single CAP_SYS_ADMIN POSIX capability, making it almost as powerful as root's ambient authority, even without escalations.

Discretionary Access Control (DAC)

Traditional UNIX security is based around a discretionary access control (DAC) model: access to files and processes is policed according to the associated user and group IDs (and the POSIX.1e capabilities of the previous section are a more fine-grained example of this model).

Historically, this model was designed around a goal of protecting different users of the same system from each other, but a more significant problem for modern single-user systems is to protect the user against their own programs. Under a naive DAC model, a subverted program has access to everything that the user has access to – and consequently many systems use DAC in a more sophisticated way, running some (or even all, as in Android) applications under specially-created role accounts.

Capsicum is implemented as an additional layer of policing on top of the existing UNIX DAC model, rather than instead of of the DAC model. In concrete terms, this means that if a discretionary access check (such as a uid check) would prevent an operation, holding an appropriate Capsicum capability does not override the check, and the operation fails. This has the advantage that the existing security properties of the system are unaffected by Capsicum, but with the downside that the resulting mechanism is a less pure object-capability system.

Mandatory Access Control (MAC) frameworks

Linux includes the option to configure one of any number of mandatory access control (MAC) frameworks, such as SELinux, AppArmor, Smack or Tomoyo. Each of these frameworks is implemented using the kernel's Linux Security Module (LSM) hooks, which ensure that kernel code consults with the LSM at key points during kernel processing. An LSM-based MAC framework then typically consults its own configuration to decide whether the processing should continue, generate an error, or fail, based on factors like the path names being accessed, the program being run, the user/group IDs involved, and the operation to be performed.

One big advantage of MAC frameworks is that an application can be effectively sandboxed without requiring code changes in the application itself. An administrator can observe the behaviour of the application running normally (e.g. with tools like strace or lsof), and use this information to craft a MAC configuration that only allows the application's "normal" behaviours. Some MAC frameworks also include a learning mode, which helps automate the process of generating a MAC configuration for the application.

A tightly-specified MAC configuration can also achieve the de-fanging of RCE exploits, as described for Capsicum in the first section of this document. A configuration that denies all access to unexpected IP addresses, ports, files and other processes can implement roughly similar constraints to those imposed by Capsicum.

However, this separation of code and configuration can also be a problem. A policy that is not generated by the application developer may only be an approximation of the app's behaviour, covering the most common code paths. Such a policy is then brittle against the use of less common options and code paths, and is likely to drift as new versions of the application are developed (and this drift is usually in the direction of a more lenient policy rather than a stricter policy).

A deeper knowledge of the application is needed to apply Capsicum, because code changes are involved. However, the resulting changes are likely to be less brittle – partly because they are applied by someone with understanding of the code, and partly because the changes flow logically from the design of the application (what objects does the application access, and why?).

This alignment of the Capsicum sandbox with the design of the application also potentially allows for domain-specific protection, which is difficult or impossible to encode as a MAC policy. For example, a web server could use a different sub-process for different virtual domains, each with its own specific set of capabilities.

seccomp-bpf Syscall Sandboxing

Modern versions of Linux include seccomp-bpf, a secure computing framework that allows the creation of flexible sandboxes that police the specific system calls allowed for a process. The sandbox is specified as a Berkeley Packet Filter (BPF) program; this program is executed on every system call, receiving inputs of the system call number and arguments, and generating a return code that indicates whether the syscall should go ahead, fail, log or terminate.

These sandboxes can be extremely flexible, as the BPF program can restrict which syscalls are allowed, and with what explicit arguments. There are some limitations, however – in particular, user memory cannot be examined, so seccomp-BPF sandboxes cannot police pathnames or the internals of structures (such as struct sockaddr or struct msghdr) that are passed as pointer arguments to a syscall.

This flexibility in BPF specification allows for sandboxes that are extremely tight, and which drastically reduce the kernel attack surface exposed to the application. However, generating such a precise filter program for an existing application is a difficult job, and results in a sandbox configuration that needs effort to keep in sync with changes to the code. The process of applying a seccomp-bpf sandbox is much easier for applications that have been designed with compartmentalization in mind, for example Chrome's separation of renderer processes from the rest of the application – but that's also the case for many other security technologies, including Capsicum and MAC frameworks.

Applying a Capsicum sandbox to an existing application is generally easier than a seccomp-BPF sandbox, because the code changes flow from an investigation of the kernel objects that the application manipulates, rather than trying to enumerate every system call that the application (and the library code that it links to) uses.

Capsicum also reduces the kernel attack surface, as capability mode disables roughly half of the system calls on Linux. However, this is more of a side effect than a primary goal: capability mode's aim is to remove the ability to globally name objects, which happens to involve about half of the syscall attack surface.

Namespaces

Another security-related technology available in recent versions of the Linux kernel is namespaces. Namespaces have a goal that is related to, but different from Capsicum's capability mode – where capability mode disallows access to global namespaces, Linux namespaces instead give individual processes the illusion that they are operating on a global namespace when in fact they are not.

This approach immediately has the advantage of requiring fewer code changes: existing code can continue to work under the illusion that it is able to enumerate the users on the system, or that it can access particular IP addresses and ports, while actually being contained within a tightly specified subset.

However, setting up these namespaces (potentially across six distinct categories), and the requisite mappings between in-namespace and outside identifiers, is complex, and typically involves configuration that is maintained separately from the application. Although recent developments in the world of Linux containers have helped with this, it is still an area that involves considerable effort.

[Also, as a comparatively recent feature of the kernel, and a complicated one at that, namespaces also have the disadvantage that they expose a new, and comparatively un-hardened, area of kernel attack surface. However, this situation can only improve over time.]

Control Groups

The previous section mentioned Linux containers, which are built on top of two kernel features, namespaces and control groups. Control groups allow resource limits to be applied to groups of processes; memory, CPU, I/O operations etc. Although not directly a security feature, control groups allow the effects of denial-of-service style attacks to be limited, if the cgroup configuration for a vulnerable program is specified appropriately.

As such, control groups are potentially a useful feature to combine with Capsicum, which offers little protection against resource exhaustion attacks (as discussed above).

Closing Remarks

Capsicum brings another tool to the Linux security toolbox: one rooted in the concepts of capability-based security, with the aim of being a reasonable compromise between the ease of application, the tightness of the resulting protection, and the long-term maintainability of the result.

Capsicum applies particularly well to some classes of application, where the capabilities involved align naturally with the objects that the application manipulates – notably traditional style UNIX command line applications and applications which are compartmentalized into distinct security domains.

For objects and resources that are identified by file descriptors (which for UNIX is most of them), the capability approach also brings a security model that is simple to analyze: enumerating a program's capabilities tells you what it can do and who it can talk to. For the latter, (recursively) examining the capabilities of the program's communication peers then gives an overall list of the operations that the system can perform.