Linux Kernel 3.15 development – the kernel column
Jon Masters summarises the latest happenings in the Linux kernel community, as Linux 3.14 final is released and the 3.15 merge window opens up
Linus Torvalds announced the release of Linux 3.14, saying that he was “feeling pretty good about it all”. The new 3.14 kernel includes a number of new features, among them deadline scheduling for real-time tasks. Traditional Linux systems have extended the concept of scheduling priorities to thos special tasks that run in the real-time scheduling classes. Like their non real-time brethren, real-time tasks would then be scheduled according to priority, with the highest receiving time first. Unlike regular tasks, real-time tasks running with the SCHED_FIFO class are actually able to lock up a Linux system by hogging all of the available CPU time at maximum priority, which is one reason why real- time scheduling is a privileged operation.
There are alternative approaches to real-time scheduling that can be applied, including the concept of using deadlines in lieu of priorities. The new scheduling policy, SCHED_DEADLINE aims to do just this. Tasks (processes) provide three parameters, known as ‘runtime’, ‘period’, and ‘deadline’. The Linux scheduler will then ‘guarantee’ (subject to various constraints) that a task receives a certain amount of runtime, every period of microseconds, within some deadline margin from the beginning of that scheduling period. Deadline scheduling is hardly new. Research into deadline scheduling has been one of those topics of academic research for many years, but it is interesting that Linux now has some initial support for applying these concepts in real real-time systems. A full rundown of features available in 3.14 kernel is available over at the (always excellent) ‘Kernel Newbies’ website.
With the release of Linux 3.14 came the opening of the merge window (period of time during which disruptive changes to the kernel are allowed) for 3.15. Among the patches merged so far by Linus are included those that implement the new ‘renameat2()’ system call we have covered in previous issues of this magazine. Using the new system call, developers can ensure that two files are exchanged on disk ‘atomically’. Thus replacing one file with another, or replacing a file with a directory, a directory with a symbolic link to another, or a similar operation, can now be made to appear as a single action. This is particularly useful in the implementation of various overlay filesystems, but there are other use cases as well. Additional high-profile features in 3.15 include ‘active/inactive list balancing’ (automatic adjusting of the ‘Least Recently Used’ page lists to reduce situations in which pages are swapped out and then swapped back in very shortly afterward), and the usual number of architecture specific changes.
There is a tradition in the kernel community that every April 1 a believably convincing yet highly controversial April Fools posting is made. Some years, it’s a really good one, other years less so. This year fell more toward the former, more for the visceral reactions it generated than the idea itself. Chris Mason (who works over at Facebook) proposed a New Linux Patch Review Group, on Facebook (naturally), that would be used to perform all future Linux kernel patch review activity. Especially amusing was the suggestion that developers could now have “One-click patch or comment approval”, and “Comments enhanced with pictures and video”. Chris also proposed a new ‘Liked-by’ code tag to be used in future kernel patch submssions.
Blowing the kernel stack
There have been some renewed concerns of late around blowing the kernel’s (fixed) stack limit. Stacks are used by applications (including the kernel) to store local variables within functions. Typically, applications do not worry themselves about allocating the memory used for the stack. It is known as an automatically growing structure that is extended by the kernel whenever an application attempts to access beyond the current stack limit. Within the kernel itself, however, such a capability does not exist. Thus, the kernel uses a fixed (eg 8K) stack for each different kernel context. Depending upon how nested the execution is within the kernel, the kernel may use more or less of this 8K. Some function call chains within the kernel can be extreme in terms of complexity, especially filesystem-related activity if the filesystem in question is sitting above an underlying device mapper or network volume. In such cases, it is already possible to come dangerously close to exceeding the kernel stack limit, which minimally will result in an oops.
Avoiding kernel stack overrun is a long- standing topic. Every so often a concerted effort is made to reduce the outlying worse case call chain scenarios, but it is generally understood that there can be ways to tickle the kernel into exceeding the limit. Since the kernel stack frames are statically allocated, simply enlarging the stack is generally not preferred, although some architectures have increased this size (by a whole number of underlying memory pages) over the years as kernel complexity has grown. The kernel actually can be built with stack limit checking, in which case it will output those ‘used greatest depth’ kernel log messages showing how big the kernel stack became under certain situations.
The latest discussions around kernel stack limits began in reaction to a patch series from Eric W Biederman aimed at addressing a longstanding ‘DoS’ issue within Linux in which a file or directory that is a mount point within one namespace could prevent the same file or directory from being removed in another namespace if that location had a filesystem mounted over it. The patches allow for ‘Detaching mounts on unlink’, but they also modify some code paths with the Linux Virtual Filesystem (VFS) layer in such a way as to make stack use dangerously high. Al Viro and Eric had a productive conversation and some tweaks were made to the patch (changing the mntput kernel function) to address these concerns.
Alejandra Morales posted about a project called Cryogenic that is concerned with ‘Enabling Power-Aware Applications on Linux’. The implementation is crude and modifies many kernel interfaces in ways that are bound to be impossible to get past upstream maintainers, but the fundamental idea is not a bad one. The intent of the work is to divide IO operations into those that are urgent and those that can wait a certain amount of time. Non-urgent IO operations are coalesced and scheduled to run alongside urgent ones, which trigger access to the underlying hardware. The idea is to save energy by powering down (or putting into a quiescent state) hardware that is not actually being used between IO ops.
Matthew Wilcox posted the latest version of a patch that allows ext4 to gain direct access to NV-DIMMs as a backing storage device. NV (Non-Volatile) DIMMs are a memory technology that retain their contents when power is removed from a system, but in other ways behave similarly to regular memory DIMMs (they are fast). A primary use for NV-DIMMs is to provide a block device upon which a filesystem can be mounted, but until now the kernel has featured wasteful overhead in managing reads and writes via the page cache. Willy’s patches allow the ext4 filesystem to bypass the page cache when mounted on NV-DIMMs.
Ulrich Drepper posted a question about NUMA pages. He wanted to know if the kernel provides an easy way for applications to determine which NUMA node a given page is located on. NUMA (Non-Uniform Memory Architecture) is an increasingly common technology, especially on high-end servers, in which certain portions of physical memory are seen as being “closer” (having a stronger affinity with) particular processors. Memory is still coherent, meaning that the same physical memory is seen by all processors, along with any changes to it, but access is not uniform in terms of latency. When applications become aware of NUMA proximity information, they can help to make intelligent choices around which processors should touch certain memory. Normally the scheduler handles this process automatically by trying to keep processes local to their NUMA nodes and through in-progress work on automatic NUMA balancing, but there can be benefit to certain applications having a greater awareness of the underlying physical system topology.
Finally this month, a reminder that the Embedded Linux Conference (ELC) in San Jose is upon us. ELC generally features a large number of useful embedded kernel development talks. Videos will be available online after the event.