Linux Kernel 3.9 – The Kernel Column with Jon Masters
Jon Masters summarises the latest happenings in the Linux kernel community
Linus Torvalds has announced the first several 3.9 kernel release candidates, following the closing of the 3.9 ‘merge window’ (period of time during which disruptive changes to the kernel and new features are merged). Merge windows are typically up to two weeks in duration (and seldom longer), though Linus has gone to great pains over the past few years to push developers not to post patches for inclusion at the very end of the window. Features merged into the kernel should instead have received heavy testing in the linux-next kernel and elsewhere, be largely complete, and posted for inclusion as early as possible during the two-week window of frantic development for a given release cycle. This is the theory, at any rate.
In his mail simultaneously closing the merge window, and also announcing 3.9-rc1, Linus said, “I don’t know if it’s just me, but this merge window had more ‘Uhhuh’ moments than I’m used to. I stopped merging a couple of times, because we had bugs that looked really scary, but thankfully each time people were on them like paparazzi on Justin Bieber.” Those features that made it into 3.9 include support for Synopsys’s 32-bit ‘ARC’ architecture (following a third posting of reviewed patches from Vineet Gupta), which is designed for use in embedded and DSP applications, particularly those wherein the ability to extend the CPU pipeline with custom instructions is beneficial (though this author notes that butchering a CPU pipeline comes with its own hidden costs and so isn’t something recommended for non- embedded applications).
Another very interesting new feature in the 3.9 kernel is support for Intel’s ‘PowerClamp’ driver. PowerClamp aims to constrain the maximum amount of power used by the system by forcing the CPUs to enter an idle state for a certain percentage of their operating time. At first glance, this may seem to be less than useless. After all, having paid good money for a powerful modern CPU (with multiple cores), most users expect to get all the oomph they can out of it. There are some users, however, where this desire is balanced by overall power constraints – in particular in data centres where there is a hard limit (often 10 or 15kW) of available power for a given rack of server equipment. Exceeding the power available to a rack can cause all of the servers in it to shut down, which is generally not what data centre users want. Google and others have encountered such problems over the past few years, and have used their own custom solutions to (presumably) good effect.
PowerClamp helps to generally solve the problem of hard-limits at the rack level by allowing an administrator or software management tool (‘agent’) to configure the system such that it will inject a certain number of idle states onto a given CPU. Typically, the kernel’s idle thread (idle task or idle process) will run only when there is nothing else to do. It calls a special machine instruction that will efficiently transition the processor into a lower power state from which it can be woken when there is work to do (usually through an external interrupt). In the case of PowerClamp, additional kidle_inject threads are created to run at specific times when there is a need to inject additional idle states (over and above the regular idle thread) in order to remain overall idle for the percentage of time configured by the user. Typically, PowerClamp will be used with some higher-level management software that looks to the whole rack and dynamically tunes many different systems for optimal overall power use.
David Howells announced that “The end is nigh!” for his ongoing UAPI work. UAPI is a near year- long effort by David to clean up the kernel’s internal header files (source code containing definitions and small inline code functions typically included into kernel or application code), splitting out those parts that pertain only to the API (application programming interface) used by non-kernel ‘user space’ applications, such as the Bash shell or Firefox web browser. A typical Linux system includes many such header files within the /usr/include/linux directory. These are installed as part of compiling the kernel, during the ‘make kernel-headers’ stage.
Until now, the process of building user application usable header files involved selective copying of a limited number of kernel headers (most are not intended to be used by application code) and judicious use of special conditionals within those files to ensure the right thing would happen when they were used by non-kernel code. After David’s work, the ‘user- space API’ is more clearly defined and these pieces are separated out into files specifically intended for direct use by non-kernel code. David’s latest email suggests that only a few changes pertaining to video framebuffers remain “now that the SCSI stuff has gone in”.
Miklos Szeredi posted patches implementing a new ‘overlayfs’ file system. Overlay and Union file systems have been a topic of much debate for many years, particularly because they never seem to work quite right. The problem they generally attempt to solve is one of allowing several distinct file systems to be joined together, with the net result being a virtual file system (only existing as a whole at runtime) that contains selective pieces of each of its constituent parts. A typical use case is one of embedded routers or live CDs. Both contain some storage (flash memory or optical media) that is read-only, and some storage (on a RAM disk or a separate piece of storage – a USB stick, another flash etc) that can be written to. A special file system is then used to present what appears to be a selection of the content of both of these underlying stores. As files from the read-only media are modified or deleted, deltas are written to the separate writable storage instead (including special markers that indicate a file has actually been deleted). Unlike other efforts to do overlays, Miklos’s code tries to be as small as possible by generally passing through operations on open files to the underlying file systems as quickly as possible. It will be interesting to see where this goes.
John Stultz (of Linaro) has posted an RFC patch-in-progress that would pull support for Android’s sync driver into the staging tree. Staging is a part of the kernel source tree where experimental and not-quite-baked drivers can sit inside the kernel source while they are being cleaned up. These drivers are available only if specifically configured for use. The sync driver provides a collection of synchronisation primitives (code routines that can be used by other code to ensure operations happen in the correct order) for use with drivers that provide different parts of the graphics pipeline used with the Android SurfaceFlinger compositor.
Anton Vorontsov posted the latest version of a memory control groups (memcg) patch implementing memory-pressure-level event support. With this patch, applications that “want to maintain the interactivity/memory
allocation cost can use the new pressure-level notifications”. What this means is that it is possible for an application to be aware of the overall pressure a system is under for memory. When Linux runs low on available free memory, it will do one of several things. This includes failing new memory allocations, swapping out certain parts of applications to disk, and freeing up internal caches. All of these have a cost in terms of overall system performance (particularly when the system hits a point of swapping large amounts of data out to slow rotational disks and is said to enter a state of ‘thrashing’) that has not historically been easily visible to end software. Now, applications can easily monitor a special event file descriptor within a memory control group and be made aware of ‘low’, ‘medium’ and ‘critical’ levels of memory as well as set specific limits at which a notification will be given. This allows an application to take steps to release unneeded in-memory caches (such as web browser pages) before the system grinds to a crawl.