[2015-04-27: edited to add section on discretionary access controls]
[2015-04-07: edited to add section on POSIX capabilities]
Although the original Capsicum papers are very readable, I thought it might be useful to have an accessible overview of Capsicum, together with some musing about how it compares with other Linux security technologies.
Capsicum is a security framework that takes concepts from capability-based security, and applies those concepts in a pragmatic way to UNIX applications. This gives many of the benefits of a well-analysed security model, but allows for gradual migration and full interoperation with non-Capsicum applications – in other words, it's much easier than re-writing for a full capability-based operating system (such as seL4).
In capability-based security, any access to an object needs an unforgeable token, the capability which identifies the object in question, and defines what rights the capability holder has for that object – that is, what operations they're allowed to perform with the object.
This goes hand-in-hand with an absence of object naming schemes – there is no way to refer to an object by some parseable name ("/etc/passwd", 10.1.2.3:53, uid:root, pid:3145, …) in order to get a reference to the object.
Taken together, these attributes give capability-based systems a security model that is simple to analyze: the capabilities that a program holds completely describe what the program is capable of doing, and who it is capable of communicating with. This in turn assists the process of privilege separation for improved security – interactions between different components of a program can be analysed by examining the capabilities that are exchanged (also, those interactions are then less susceptible to confused deputy problems).
In real-world terms, applications and services that interact with untrusted and remote users are potentially vulnerable to a particularly scary style of attack – the remote code execution (RCE) attack, where the attacker subverts the application and runs their own code with full access to the entire ambient authority of the subverted application. As a result, the RCE exploit code can access any object that the application is able to access.
As with other sandboxing techniques, Capsicum's capability-based approach often de-fangs these RCE attacks: although the attacker may still be able to run arbitrary code, what that code is capable of doing is severely limited. Scanning the local system, contacting a remote C&C server, joining a DDoS botnet, snooping on other programs – all of these operations require access to some global namespace or other (e.g. names on the filesystem, IP addresses and ports, process IDs).
Object capabilities can be passed around or inherited from a parent, but are only created by taking an existing capability and subsetting its rights. This raises a difficult bootstrap question: if you can't name objects, where does an ordinary application get its capabilities from?
In a pure object-capability system, this bootstrap problem is typically addressed by having capabilities cascade down from an initial uber-capability that covers the entire system at start-of-day – but this is a radically different model from how current UNIX-based systems work. However, because Capsicum is a pragmatic hybrid of capability-based security with normal POSIX semantics, the bootstrap problem is neatly sidestepped: the application can create all of its capabilities before it shifts into capability mode.
(Of course, Capsicum still allows the normal ways that a program would acquire capabilities in a pure capability system – the capabilities could be inherited from a parent process, or could be explicitly passed to the program across a local socket – both of which are useful if the program is untrusted from its inception.)
Several explanations of capabilities use the analogy of a UNIX file descriptor (FD) to describe the key concepts of capabilities in familiar terms:
- When the kernel opens a file on behalf of a userspace application, it just hands an integer token (the file descriptor) to the application; the kernel object that tracks the status of the open file is entirely internal to the kernel. The application can't get a new token without the help of the kernel, and any operations on the object (read(), write(), close(), …) have to specify the token.
- The UNIX model of "all the world's a file" also means that various sorts of kernel-managed objects use the same file descriptor mechanism for access – open files, sockets (both network and local), message queues, timers, file system notifications, etc.
- File descriptors can be passed between processes, over local sockets with sendmsg(2) / recvmsg(2). (Of course, the local sockets are also described by file descriptors – so they too are capabilities.)
- Depending on the mode in which the file was opened, the kernel only allows certain operations to be performed on the file (e.g. no write() for a file opened O_RDONLY). However, the existing UNIX restrictions are very simplistic, just read-write-execute, and positively misleading in their implementation – for example, fchmod(2) will happily change the permissions of a read-only file descriptor.
Capsicum makes this capability analogy into a reality by associating a much more fine-grained set of rights with file descriptors, and extending the kernel so that it accurately polices those rights. A normal file descriptor implicitly has all possible rights, thus preserving all existing behaviour; a new system call (cap_rights_limit()) is then used to restrict the rights associated with a file descriptor to be a subset of those it already has (i.e. rights can never be extended, only restricted). A file descriptor that has had its rights so restricted is referred to as a Capsicum capability.
If an application attempts to perform an operation on a capability file descriptor that is not allowed by its associated rights, the kernel will fail the operation with ENOTCAPABLE (which is a new Capsicum-specific errno value, which makes the process of applying Capsicum to an existing application easier).
The current Capsicum implementations define around 60 distinct rights, as previous experience has shown that permissions models become more granular as time goes on, so being more granular at the beginning helps maintainability. Commonly-used rights include:
- CAP_READ: allow operations that read the content of the object
- CAP_WRITE: allow operations that write the content of the object
- CAP_SEEK: allow operations that operate at an arbitrary offset within the object, or which explicitly alter the file offset (i.e. lseek(2))
- CAP_LOOKUP: allow operations that search within a directory
- CAP_CREATE: allow operations that create files within a directory
- CAP_FCHMOD: allow operations that change the file permissions
- CAP_FSTAT: allow operations that retrieve file metadata/status
- CAP_ACCEPT: allow accept socket operations
- CAP_BIND: allow bind socket operations
- CAP_CONNECT: allow connect socket operations
- CAP_FCNTL: allow file control (fcntl(2)) operations (but see below)
- CAP_IOCTL: allow I/O control (ioctl(2)) operations (but see below)
Because the fcntl(2) and (particularly) ioctl(2) system calls act as multiplexors whose precise arguments determine the operation to be performed, Capsicum also includes the ability to restrict these syscalls to particular operations (with cap_fcntls_limit() and cap_ioctls_limit()).
Finally, one detail to note about Capsicum capabilities is that the restricted rights apply to file descriptors, not open file objects. This means that different file descriptors can refer to the same underlying kernel object (e.g. after dup(2)) but have different rights associated with them. This is particularly useful when passing capability file descriptors between processes – a file descriptor passed across to a service can be tightly restricted before doing so, while the sender can continue to hold a more-capable FD for the same object.
Of course, restricting the operations that can be performed with existing file descriptors is of little use if an attacker can just mint new file descriptors. Capsicum therefore includes capability mode, which (permanently) restricts the system calls available to the current process (and any future descendents). In particular, system calls that allow the creation of new file descriptors from scratch by referring to an object via a global namespace are banned, failing with a new ECAPMODE errno value.
This doesn't completely remove all FD-creating system calls; there are a couple of system calls that create file descriptors without using a global name, by referencing an existing file descriptor:
- The openat(int dfd, char *path, int flags) system call allows a file to be opened relative to an existing (directory) file descriptor. This syscall is allowed in capability mode (as long as the directory FD has the CAP_CREATE right), but is policed so that the path cannot be used to escape the directory – no leading / or .. components (nor symlinks likewise).
- The accept(int sockfd, …) system call allows a connected socket to be extracted from a listening socket; this syscall is allowed in capability mode (as long as the listening socket has the CAP_ACCEPT right).
In both of these cases, the newly-minted file descriptor inherits the rights of the parent file descriptor (directory file descriptor or listening socket respectively). This simplifies the implementation, both of the application using Capsicum and Capsicum itself, but has the downside that the parent FD must have all the rights that are needed for any future derived FDs.
(Tighter restrictions on rights can be achieved by having a separate process that opens/accepts things, then restricts the rights and passes the resulting capability to the main worker process – but that's obviously at a cost of much more complexity, particularly for an application that is not already compartmentalized.)
Putting It Together
The core primitives of Capsicum apply particularly well to traditional UNIX utilities that have a central core that processes input from a set of work sources, generating a combined output. For these cases, the application needs straightforward modifications so that:
- All of the input work sources and output destinations are opened before entering the main loop.
- Each input file descriptor is Capsicum-restricted to only allow read operations (CAP_READ and friends).
- The output destinations are Capsicum-restricted to allow write operations.
- Any unnecessary file descriptors are closed.
- The application enters capability mode before starting its main loop (and thus before it reads any untrusted, potentially attacker-supplied, inputs).
(An upcoming article will cover this process in excruciating detail.)
However, this isn't the only possible model for using Capsicum. More generally, if a larger application is compartmentalized into distinct security domains, those domains can then be individually sandboxed and Capsicum capabilities passed between them.
[This leads naturally on to an aspect of Capsicum that hasn't been discussed here, namely process descriptors. These are (roughly) file descriptors that can be used in place of pid_t values, allowing processes to be controlled despite the fact that capability mode disables process manipulation functions that use the global pid_t namespace. However, I'll discuss process descriptors at a later date.]
Conditions May Apply
Capsicum is not a panacea for all security problems, so in this section we cover some of the limitations of Capsicum, and how they affect real-world use.
The first limitation of Capsicum is that it can only protect those objects in a UNIX system that are associated with file descriptors, or which are nameable via a global namespace. This covers a lot of ground, but around half of the system calls on a Linux box are still enabled in capability mode, and only around half of those are file-descriptor based. In particular, Capsicum provides no protection from some kinds of resource exhaustion attacks. Subverted code can still spin the CPU(s) and allocate arbitrary amounts of virtual memory, as these resources are not associated with any kind of file-descriptor model. (This is a Capsicum-specific restriction – in a pure capability system, these kinds of resources are governed by capabilities.)
The second limitation of Capsicum quickly becomes clear when real applications are sandboxed using it – many underlying libraries, both system and third-party, rely on the use of global namespaces "under the covers". This was illustrated by the first program to be Capsicumized, tcpdump: in its normal mode of operation (without the -n option), tcpdump tries to convert IP addresses to names by performing reverse-DNS lookups. However, the library functions for DNS operations need access to all sorts of global namespaces: reading /etc/resolv.conf to find name servers, access to port 53 on those nameservers, local connections to DNS caches, and so on.
The FreeBSD implementation of Capsicum includes some efforts to mitigate this problem, notably Casper – a system daemon that applications can connect to before entering capability mode, and can then be used to provide those services (e.g. DNS, group/user enumeration, random number generation) that would otherwise need global namespace access. The FreeBSD dynamic linker has also been upgraded to allow the library path to be specified as a list of (capability) directory file descriptors, rather than path names, so dynamically linked programs can still be executed from within a Capsicum sandbox. However, there is no denying that this limitation makes the process of Capsicumizing an existing application more complicated, and there is more work to be done in this area.
Finally, the use of Capsicum does incur a small performance overhead. This is minimal for capability rights checks – a few bitmask checks – but may be higher for capability mode policing, as each syscall may require additional checking. [The FreeBSD implementation was originally measured to only show a ~10% overhead, but the current Linux implementation is likely to be slower.]
Compare and Contrast
In this section we discuss the comparison between Capsicum and a variety of other Linux security technologies, attempting to highlight the pros and cons of Capsicum compared to each. Note also that the second half of the 2010 Usenix paper on Capsicum also discusses the comparison between Capsicum and other sandboxing technologies.
However, before moving on to individual comparisons it is worth pointing out that all of these different technologies can be composed, allowing for defense in depth. Capsicum on Linux is not implemented as an LSM, so it can interoperate with LSM-based MAC frameworks; capability mode is (mostly) implemented as a seccomp-BPF filter, and such filters can be combined.
For clarity, the first thing to note is that Linux already includes a feature named capabilities, covering entirely different functionality. These existing Linux capabilities are based on a withdrawn POSIX.1e draft, and effectively divide up the privileges of root into distinct areas of functionality, which can be enabled and disabled independently on a process-wide basis
As with Capsicum capabilities, this drives towards the principle of least authority: if a (setuid) program doesn't need root's full authority, it should drop the parts it doesn't need. However, the privileges that remain are still ambient authority for the program, potentially available for nefarious purposes should the program be compromised – and many escalations from one POSIX capability to full root authority have been observed. Also, in practice a large fraction of behaviour has ended up being controlled by the single CAP_SYS_ADMIN POSIX capability, making it almost as powerful as root's ambient authority, even without escalations.
Discretionary Access Control (DAC)
Traditional UNIX security is based around a discretionary access control (DAC) model: access to files and processes is policed according to the associated user and group IDs (and the POSIX.1e capabilities of the previous section are a more fine-grained example of this model).
Historically, this model was designed around a goal of protecting different users of the same system from each other, but a more significant problem for modern single-user systems is to protect the user against their own programs. Under a naive DAC model, a subverted program has access to everything that the user has access to – and consequently many systems use DAC in a more sophisticated way, running some (or even all, as in Android) applications under specially-created role accounts.
Capsicum is implemented as an additional layer of policing on top of the existing UNIX DAC model, rather than instead of of the DAC model. In concrete terms, this means that if a discretionary access check (such as a uid check) would prevent an operation, holding an appropriate Capsicum capability does not override the check, and the operation fails. This has the advantage that the existing security properties of the system are unaffected by Capsicum, but with the downside that the resulting mechanism is a less pure object-capability system.
Mandatory Access Control (MAC) frameworks
Linux includes the option to configure one of any number of mandatory access control (MAC) frameworks, such as SELinux, AppArmor, Smack or Tomoyo. Each of these frameworks is implemented using the kernel's Linux Security Module (LSM) hooks, which ensure that kernel code consults with the LSM at key points during kernel processing. An LSM-based MAC framework then typically consults its own configuration to decide whether the processing should continue, generate an error, or fail, based on factors like the path names being accessed, the program being run, the user/group IDs involved, and the operation to be performed.
One big advantage of MAC frameworks is that an application can be effectively sandboxed without requiring code changes in the application itself. An administrator can observe the behaviour of the application running normally (e.g. with tools like strace or lsof), and use this information to craft a MAC configuration that only allows the application's "normal" behaviours. Some MAC frameworks also include a learning mode, which helps automate the process of generating a MAC configuration for the application.
A tightly-specified MAC configuration can also achieve the de-fanging of RCE exploits, as described for Capsicum in the first section of this document. A configuration that denies all access to unexpected IP addresses, ports, files and other processes can implement roughly similar constraints to those imposed by Capsicum.
However, this separation of code and configuration can also be a problem. A policy that is not generated by the application developer may only be an approximation of the app's behaviour, covering the most common code paths. Such a policy is then brittle against the use of less common options and code paths, and is likely to drift as new versions of the application are developed (and this drift is usually in the direction of a more lenient policy rather than a stricter policy).
A deeper knowledge of the application is needed to apply Capsicum, because code changes are involved. However, the resulting changes are likely to be less brittle – partly because they are applied by someone with understanding of the code, and partly because the changes flow logically from the design of the application (what objects does the application access, and why?).
This alignment of the Capsicum sandbox with the design of the application also potentially allows for domain-specific protection, which is difficult or impossible to encode as a MAC policy. For example, a web server could use a different sub-process for different virtual domains, each with its own specific set of capabilities.
seccomp-bpf Syscall Sandboxing
Modern versions of Linux include seccomp-bpf, a secure computing framework that allows the creation of flexible sandboxes that police the specific system calls allowed for a process. The sandbox is specified as a Berkeley Packet Filter (BPF) program; this program is executed on every system call, receiving inputs of the system call number and arguments, and generating a return code that indicates whether the syscall should go ahead, fail, log or terminate.
These sandboxes can be extremely flexible, as the BPF program can restrict which syscalls are allowed, and with what explicit arguments. There are some limitations, however – in particular, user memory cannot be examined, so seccomp-BPF sandboxes cannot police pathnames or the internals of structures (such as struct sockaddr or struct msghdr) that are passed as pointer arguments to a syscall.
This flexibility in BPF specification allows for sandboxes that are extremely tight, and which drastically reduce the kernel attack surface exposed to the application. However, generating such a precise filter program for an existing application is a difficult job, and results in a sandbox configuration that needs effort to keep in sync with changes to the code. The process of applying a seccomp-bpf sandbox is much easier for applications that have been designed with compartmentalization in mind, for example Chrome's separation of renderer processes from the rest of the application – but that's also the case for many other security technologies, including Capsicum and MAC frameworks.
Applying a Capsicum sandbox to an existing application is generally easier than a seccomp-BPF sandbox, because the code changes flow from an investigation of the kernel objects that the application manipulates, rather than trying to enumerate every system call that the application (and the library code that it links to) uses.
Capsicum also reduces the kernel attack surface, as capability mode disables roughly half of the system calls on Linux. However, this is more of a side effect than a primary goal: capability mode's aim is to remove the ability to globally name objects, which happens to involve about half of the syscall attack surface.
Another security-related technology available in recent versions of the Linux kernel is namespaces. Namespaces have a goal that is related to, but different from Capsicum's capability mode – where capability mode disallows access to global namespaces, Linux namespaces instead give individual processes the illusion that they are operating on a global namespace when in fact they are not.
This approach immediately has the advantage of requiring fewer code changes: existing code can continue to work under the illusion that it is able to enumerate the users on the system, or that it can access particular IP addresses and ports, while actually being contained within a tightly specified subset.
However, setting up these namespaces (potentially across six distinct categories), and the requisite mappings between in-namespace and outside identifiers, is complex, and typically involves configuration that is maintained separately from the application. Although recent developments in the world of Linux containers have helped with this, it is still an area that involves considerable effort.
[Also, as a comparatively recent feature of the kernel, and a complicated one at that, namespaces also have the disadvantage that they expose a new, and comparatively un-hardened, area of kernel attack surface. However, this situation can only improve over time.]
The previous section mentioned Linux containers, which are built on top of two kernel features, namespaces and control groups. Control groups allow resource limits to be applied to groups of processes; memory, CPU, I/O operations etc. Although not directly a security feature, control groups allow the effects of denial-of-service style attacks to be limited, if the cgroup configuration for a vulnerable program is specified appropriately.
As such, control groups are potentially a useful feature to combine with Capsicum, which offers little protection against resource exhaustion attacks (as discussed above).
Capsicum brings another tool to the Linux security toolbox: one rooted in the concepts of capability-based security, with the aim of being a reasonable compromise between the ease of application, the tightness of the resulting protection, and the long-term maintainability of the result.
Capsicum applies particularly well to some classes of application, where the capabilities involved align naturally with the objects that the application manipulates – notably traditional style UNIX command line applications and applications which are compartmentalized into distinct security domains.
For objects and resources that are identified by file descriptors (which for UNIX is most of them), the capability approach also brings a security model that is simple to analyze: enumerating a program's capabilities tells you what it can do and who it can talk to. For the latter, (recursively) examining the capabilities of the program's communication peers then gives an overall list of the operations that the system can perform.