The Road to 3.10 – The Kernel Column
Jon Masters describes some of the features coming in Linux 3.10, and summarises the past month of activity in the Linux kernel community
Ed’s note – this article was originally published in LUD 128, when Kernel 3.10 was still in development
Linus Torvalds has announced four release candidate (rc) 3.10 kernels since he closed the merge window (period of time during which intrusive changes are taken into the kernel). The merge window for 3.10 is the busiest in Linux history, with over 12,000 changesets (collections of patches implementing a specific new feature, bug fix or other change) merged in two weeks. He subsequently announced several additional RCs, with the fourth being the most recent as we went to press (there are typically seven or eight total, one per week, during a kernel release). In announcing 3.10- rc4, Linus says, “rc4 is smaller than rc3 (yay!). But it could certainly be smaller still (boo!).”
Features merged into 3.10 include the full ‘dynamic tick’ support that we discussed in a previous issue. This will allow Linux systems to fully disable the periodic (housekeeping) timer tick under special circumstances. The timer tick is typically used by the Linux kernel to cause special kernel code to run at a certain interval, although the kernel already supports the notion of disabling the timer when a CPU is idle. The full ‘nohz’ solution takes this a step further by allowing the timer tick (for other than the first ‘boot’ CPU) to be disabled if a CPU is only running one process. Therefore, it is possible for a system administrator or programmer to carefully manage a given CPU (perhaps using cgroups) such that it will not experience the added latency of a periodic timer interrupt (caveat: it will still run once per second, but not 100-1,000 times per second as is traditional).
The second most interesting feature (to this author) is that of the ‘mempressure control group’ support that has been worked on for some time. The mempressure control group allows suitably modified applications to use a special new API in order to register to receive notifications from the kernel whenever it is experiencing memory pressure (ie it is low on available memory and is going into ‘reclaim’). Memory pressure occurs frequently as modern systems have huge amounts of memory, but also large numbers of in-memory caches that will grow to consume all of the space available. The kernel has visibility into – and control of – its own internal caches, which it will grow and shrink as needed, but it traditionally has had no idea how much memory applications were using for caches that could be voluntarily relinquished as necessary.
Using the new mempressure cgroup, an application will be able to inform the kernel about its use of memory in the form of the number of reclaimable chunks (of a specified unit size) that can be freed if necessary. The kernel will then use this information when it experiences a memory crunch, trying nicely to ask applications to free memory (based upon knowledge of how much each has to work with) before the situation becomes dire enough to force the invocation of the dreaded ‘out of memory’ killer sledgehammer.
Will Deacon (ARM) posted a patch entitled ‘Remove any correlation between IPC and BogoMIPs value’, the purpose of which was to address an issue with the presented ‘BogoMIPs’ value on modern ARM-based systems. BogoMIPs is a value that is familiar to many Linux users because it is printed during system boot and is often used in discussing the ‘speed’ of a system. But it actually means very little, especially on modern computers. BogoMIPs is simply a number that measures how quickly the system can do nothing – it’s a value used in a delay loop calculation that determines how many non-operations the system can perform in a second, such that small delays can be implemented as loops that do nothing. A system might claim to have a very high BogoMIPs value and in fact be ‘slow’ compared to another system with a lower BogoMIPs score.
Recent work by Will had converted ARM systems over to what are known as ‘architected timers’, which can be used in place of such delay loops. But in so doing, Will had drastically reduced the numeric BogoMIPs value printed on such systems. Although this means nothing in practice, it did result in some confused users, and further complaints and protests, usually by those who were unaware that BogoMIPs in fact means nothing useful. But in order to prevent too many more complaints, Will posted a patch providing a configurable means to tune the BogoMIPs value presented to the user. Options include BOGOMIPS_SLOW, BOGOMIPS_MEDIUM, BOGOMIPS_FAST (which he describes as the ‘marketing’ option) and BOGOMIPS_RANDOM (which does what it implies, generating random values). These modify the presented value of BogoMIPs, with BOGOMIPS_MEDIUM being the default. As Will noted in sending this out in early May, the patches “may look a little over a month late [for April Fool’s], but there is a serious reason for posting it”. If so doing will reduce confusion, then it’s probably worth the new options.
Peter Anvin pondered aloud whether it was time to remove the ‘cpuinit markup’ from the kernel. This is a special annotation, used to mark a function as being of use only during CPU init and safe to remove from kernel memory afterward (to free up kernel memory). But Peter drew a parallel with the recent removal of ‘devinit’ annotations (previously used to allow freeing of memory used for code only executed at device insertion), which had been removed because most Linux systems are now sufficiently dynamic that the very notion of not plugging and unplugging devices on the fly seems quaint. Similarly, CPUs are hot-plugged and hot-unplugged sufficiently often these days (especially on servers, but also on other devices where the hot-plug infrastructure is used during suspend, virtual machine migration etc) that all kernel features should presume that this is normal behaviour.
Jiannan Ouyang (of the Pittsburgh University Computer Science Department) posted about new research into ‘Pre-emptible Ticket Spinlocks’ for virtual machine guests that implement a “spinlock algorithm that can adapt to pre-emption”. Readers will perhaps be aware that virtual machines leverage hardware and software capabilities of modern microprocessors in order to provide VMs with slices of CPU time, much as regular applications are multitasked rapidly to create the illusion of concurrently executing programs. Because VMs are themselves timesliced, a few optimisations are possible, such as detecting when a virtual machine CPU (vCPU) is spinning waiting for another vCPU to release a spinlock. In this case, there is no need for the vCPU to actually spin: that vCPU can be descheduled in favour of another until the spinlock is released.
Jiannan is among those who are researching further optimisations, such as ticket spinlocks and the like. He says his particular algorithm “improves VM performance by 5:32X on average, when running on a non-paravirtual VMM, and by 7:91X when running on a VMM that supports a paravirtual locking interface (using a pv pre-emptible ticket spinlock), when executing a set of microbenchmarks as well as a realistic eCommerce benchmark”. His mail triggered an interesting conversation between Rik van Riel and Peter Zjilstra, and some suggestions on improvements. The paper as available online.
Federico Vaga proposed that some specific network Address Family numbers be reserved for local (development and test) use. These numbers, which begin AF_ and include AF_UNIX (UNIX domain sockets), AF_INET (Internet Protocol), and AF_INET6 (IPv6), are used by modern UNIX and UNIX-like operating systems to refer to the type of a network connection. Qiaowei Ren posted an updated (third) version of an Intel ‘TXT’ (Trusted Execution Technology) driver. This “will provide higher assurance of system configuration and initial state as well as data reset protection”. Effectively, TXT leverages the presence of a hardware TPM (Trusted Platform Module) specific to a given system and provides additional hardware that can be used to verify the authenticity of an OS, as well as the environment that it is executing within. Therefore, TXT can be used to prevent tampering, providing a provable chain of trust between firmware, to bootloader, to the OS.
Finally, several proposals for Linux Plumbers Conf (to be held in September) mini-summits have been announced. These include one dedicated to ‘fastboot’ (reducing bootup latencies), and others related to security (Secure Boot of the form introduced by Microsoft into UEFI), and further topics besides.