Kernel 3.15 merge window – the kernel column
Jon Masters summarises the latest goings on in the Linux Kernel community, including the Linux 3.15 merge window, and ongoing development
Linus Torvalds released the fourth ‘Release Candidate’ (RC) kernel for Linux 3.15, saying there was “[N]othing particularly unusual going on” in that the majority of changes were now in driver and architecture code rather than core kernel code. This is to be expected at the mid- point between kernel releases. A typical Linux kernel development cycle will complete around RC7 or RC8, so we can expect 3.15 for next issue.
The ‘cooling down’ period we are now in contrasts well with the hectic time of the ‘merge window’ just a few weeks ago, during which disruptive changes were pulled in for Linux 3.15. Among the changes were support for the new renameat2() system call (used to atomically swap two files and for building support for overlay filesystems), and a new instruction set for the JIT (Just In Time) compiler used by the in-kernel BPF (Berkley Packet Filter – a concept borrowed from BSD) network packet filtering code. We’ll summarise Linux 3.15 in the next issue.
Smarter system suspend
Rafael J Wysocki posted a patch series recently aimed at improving the suspend-resume experience of runtime-suspended devices across a system suspend event. Linux supports several notions of hardware suspend, amongst them the concept of whole system, or per- device runtime suspend. Whole system refers to the capability to power down every attached hardware device (such as the disk, screen, and USB devices on a laptop) in a certain order, and then place the whole system into a special low-power state from which it can be resumed later without a reboot. This is what enables you to close your laptop lid and wander off to your favourite coffee shop, only to reopen the laptop and resume from where you left off after acquiring an ice-cold Frappuccino. Whole system contrasts with per-device runtime suspend. The latter is concerned with using the same suspend code path to buy more battery runtime (or to generally save energy) by shutting down devices that are not being used at runtime. This might include powering down the disk in your laptop while it sits mostly idle displaying a webpage or text document.
Runtime suspend is a nifty feature that certainly improves the overall Linux-using experience, especially on laptops. But it introduces a complexity to the traditional whole system suspend use case. When runtime- suspended devices enter the mix, Linux needs to determine what their impact is upon the process of placing the whole system into a lower- power state. For example, placing the whole system into a suspend state entails shutting off devices and later reinitialising them in a slightly more aggressive way that might require saving additional state information from those devices before they are powered off. But if those devices are already runtime suspended, it might be necessary to power them back up, only to shut them down again as the system suspends.
In many existing cases (PCI devices being the common example), the kernel punts on being too clever and instead wakes up every runtime-suspended device to ensure that its state has been properly saved. However, as Rafael points out, many devices don’t actually in reality have much need for a different set of settings to be applied for the wakeup case from a whole system suspend vs what was already saved during the process of runtime suspend. Thus, he adds new flags to various kernel data structures that allow device drivers to indicate that it is not necessary to runtime resume a device before a whole system suspend event takes effect. This should reduce suspend times, and lead to more expeditious enjoyment of that ice-cold Frappuccino.
Andy Lutomirski (of AMA Capital, who presumably has a desire for certain real- time behavioural characteristics in kernel performance due to financial trades or the like) posted a patch that modifies kernel entry and exit code on 64-bit Intel x86 platforms. Kernel entry and exit refers to the mechanical process of entering into kernel code from userspace, to or from a hardware interrupt handler context, and so on. It involves saving and restoring various register state so that kernel and userspace can exist independently.
When the kernel itself is running on behalf of a user process (known as a ‘task’ within the kernel proper), usually because that process requested the kernel perform an activity through a system call, the kernel may receive a hardware interrupt or fault event that results in other kernel code running to service the cause of this interrupt while still within the kernel. Hardware interrupts are the main example case since the kernel itself is non-pageable, meaning that it won’t cause the kinds of memory faults that userspace will (though there are ways in which the kernel will generate such faults). The interrupt handler will diligently preserve state as part of running, and conventionally performs an ‘IRET’ machine instruction upon completion.
IRET is a context synchronising state modifying instruction that causes the processor to return from interrupt register context back to whatever it was doing prior to the interrupt occurring. But when the return is not back into userspace and is instead returning to the kernel, it can be faster to instead directly restore the register state in place of the IRET call (which is a lot more heavyweight in its action, and is not required in this instance). By changing this return path for in-kernel interrupts, Andy is able to save 100ns on each interrupt and trap that occurs within the kernel, which speeds up his kernel_pf microbenchmark by a whopping 17 per cent. Linus liked this optimisation, gave his sign-off, and requested review.
Luis Rodriguez, Stephen Hemminger and others debated the relative merits of the common practice of using network bridges to connect virtual machines to a physical network device. Often, Linux systems will create a virtual bridge during boot and then later add a physical network interface to the bridge (through which virtual machines will talk to the outside world, for example). When the bridge is first created, it will be assigned a randomised MAC address, which is then replaced with the MAC address of the network interface assigned to it. But if that network interface is removed, the bridge will have a MAC address filled with zeros rather than returning to the previous random one. This is a problem for a number of use cases (such as port blocking) which has led to the suggestion that the previous address be restored instead.
Peter Zijlstra posted an updated version of the sched_set_attr manpage, which describes how to set the scheduling policies applied by the kernel. This now includes the new SCHED_ DEADLINE option. Deadline scheduling in Linux is an implementation of Global Earliest Deadline First (GEDF), which aims to ensure that tasks do not over-run their specified scheduling budget while giving priority to those with time-critical requirements.
Calls for papers have gone out in support of various events that will take place concurrently with the Linux Foundation’s LinuxCon conference in Chicago, in mid-August. These include James Morris’s call for papers for the Linux Security Summit, and Ted Ts’o’s call for papers for the Kernel Summit. The latter event is being run similarly to the revised format that was introduced last year: invitations will be provided to a small set of established contributors, while others will be able to nominate themselves to attend, especially in support of topics of great interest to attendees. As in previous years, the size of Kernel Summit is capped at 100 people.