The Kernel Column – Linux Kernel 3.8
Jon Masters summarises the latest happenings in the Linux kernel community, including the closing of the development ‘merge window’ for the 3.8 kernel
Linus Torvalds closed the 3.8 kernel ‘merge window’ (the period of time during which disruptive changes are allowed into the kernel, and are then stabilised before final release) just prior to the Christmas holiday. In his announcement of the first 3.8 ‘release candidate’, Linus said, “The longest night of the year is upon us (and by ‘us’ I mean mainly people in the same time zone and hemisphere as I am. Because I’m too self-centred to care about anybody else), and what better thing to do than get yourself some nice mulled wine, sit back, relax, and play with the most recent RC kernel?” Some readers might question whether this is truly the most relaxing course of action, but nobody can fault Linus for trying to motivate developers to spend some holiday time testing code.
The 3.8 merge window was, according to Linus himself, the biggest merge window in the 3.x kernel series so far (in terms of raw number of changes going into the kernel codebase). It will contain a number of new and exciting features. Two that interest this author in particular are the support for transparent huge zero pages, and newly added support for the AllWinner ‘A1X’ series of system-on-chip ARM processors. The latter are very popular, inexpensive and more capable (in terms of compute) bigger brothers than the chip used in the Raspberry Pi, while being used in systems of similar price. It is possible, for example, to purchase one of the popular ‘MK802’ plug-in TV dongles built using the AllWinner A10 CPU for $35. That yields a full ARM-based Linux system running at 1GHz, with 1GB RAM, USB, Wi-Fi, an SD card interface and full HDMI output. Even more capable systems of a similar price point are appearing all the time, so the A1X will remain popular.
Transparent zero huge pages are another 3.8 kernel feature that will be popular with users, although if the feature is working correctly, users who use it may never realise that it is even there. Huge pages are a hardware feature of modern CPUs in which the built-in CPU virtual memory translation caches, known as TLBs (translation lookaside buffers), support both the conventional smallest unit of virtual memory page size of (typically) 4KB, as well as a much larger ‘huge’ page of 2 or 4MB or more. This is useful because the CPU has only a limited number of these much faster TLB caches that it uses to store previously looked- up (translated) virtual memory mappings from addresses used by applications to those of the underlying hardware. By using huge pages, the hardware can keep more translations cached and improve performance. But, there is a performance catch to using huge pages.
Historically, huge pages had to be manually assigned, but support for ‘transparent’ or automatic huge pages was added to the kernel some time ago and has been present in distributions for a number of releases. With the introduction of transparent huge pages came the unintended side effect that some systems would actually waste memory in the process. This is because when allocating conventional pages, the kernel has the option of using the ‘zero’ page, a special page that is read-only and full of zeros. When applications attempt to write to it, a process known as copy-on-write actually allocates and sets up the real page entries in the kernel. The transparent huge pages code did not have a similar concept, so applications mapping large amounts of contiguous memory might have a large number of huge pages filled with zeros allocated that were never used. Linux 3.8 addresses this situation by sharing a ‘huge’ zero page, similar to the regular zero page.
During the merge window, some changes to the Video4Linux (V4L) code were merged that broke a user-space application (PulseAudio) by altering the return codes passed by a system call. Linus got particularly angry about this, telling the kernel developer concerned to “SHUT THE F*** UP!” when responding to protests that the user-space application was doing something wrong. Linus reminded everyone of longstanding policy by saying, “[If] a change results in user programs breaking, it’s a bug in the kernel. We never EVER blame the user programs. How hard can this be to understand?” Strong responses aside, there is established history of never questioning even the weirdest of application behaviour, always endeavouring to retain compatibility. The patch in question was reverted by Linus and an alternative reworked.
With the merge window closed, development has returned to a combination of new patch development and refinement of the existing 3.8 release candidates, which are several weeks in as of this writing. There are typically seven or eight release candidate kernels (spread over several months’ duration) for a typical kernel, meaning that we can expect 3.8 final sometime in February.
This past month saw an interesting series of conversations around hash collisions in next- generation file systems. Many modern file systems use a hash-based approach to store the names of individual filename entries within normal directories. A given name, such as ‘passwd’ (as in /etc/passwd) is passed through a hashing algorithm which generates a finite number of possible numeric values. This value is then used internally within the file system to determine where in the ‘data structure’ within the file-system metadata the given entry will be stored. If multiple files hash to the same location, a list is created. Such lists (buckets) are not typically very large because the hashing algorithm does a good job at keeping hash ‘collisions’ to a minimum. Sometimes, however, these lists can be artificially enlarged by creating special file-system entries that are known to generate collisions. This is what Pascal Junod blogged about in December. He raised a number of known issues with Btrfs and discovered a new bug in the code, which Chris Mason (the author) has now posted a patch intended to address the concern.
A lot of memory (virtual and otherwise) work is ongoing. Minchan Kim has continued working on support for volatile memory mappings. Using special parameters, applications can explicitly mark regions of memory as being volatile (the kernel is allowed to trash them at will), and unmark them as volatile when they are needed again. The application is able to determine whether the volatile memory was actually destroyed in the meantime. Related work includes a user-space memory shrinker from Anton Vorontsov, which allows applications to use a mempressure cgroup to register reclaimable chunks. When the system is low on memory, the application will be asked to reclaim a number of chunks. It will then update the kernel as to what was reclaimed. Both this and the volatile work are useful, for example, in applications that retain large caches (eg of webpage content) that can easily be regenerated.
Linux 3.7 introduced support for ARM’s new ‘AArch64’ 64-bit ARM architecture. Previous kernel cycles have sometimes introduced more than one new architecture. Although it seems as if this won’t be the case for 3.8, it does look like 3.9 could have two new architectures. James Hogan posted pretty comprehensive support for Imagination’s ‘Meta’ processor cores (hybrid CPU/DSP cores capable of running multiple RTOSs and regular kernels on hardware threads at the same time), while Vineet Gupta reworded the older 3.2 kernel support for Synopsys’s ARC processors to bring it up to date. The latter is interesting because it is intended to be a highly configurable and extensible processor architecture. Those implementing the (licensable) ARC processor can customise the number of instructions, registers and many other features, in a manner described online as being “like Lego blocks”.
Finally this month, a number of kernel developers have been considering putting the ‘Kernel Hacking’ menu options within the kernel Kconfig “on a diet”. Dave Hansen, as well as other developers, consider that 120 possible options is now too large and that a number of these should be removed, or split out. In particular, Dave posted a cleanup patch to move ‘debugfs’ out into the file-systems menu.