The kernel column #92 with Jon Masters
This month Linux kernel legend Jon Masters talks about the release period of 2.6.35 and the opening of the merge window. Also this month: old security vulnerabilities, AppArmor, SELinux and the ongoing suspend blockers debate continues…
The release last month of the 2.6.35 kernel brought with it the implicit opening of the ‘merge window’, the time following a kernel release during which Linus Torvalds will accept intrusive changes for what will become the 2.6.36 kernel, in another month or so.
The merge window began life as a loosely defined concept, back in the days when it was still assumed that there might be a 2.7 or 3.0 kernel release not too far down the line. As time went on, increasing numbers of changes fed into the 2.6 kernel and Linus (as well as others) sought to add quality controls aimed at reducing the quantity of regressions in functionality experienced by end-users.
One way to reduce regressions was to refuse to take intrusive changes once the merge window had been closed. Following a few releases in which Linus tested the waters, he came down hard on those who attempted to push big changes late into 2.6.35. And Linus was very satisfied with the results, so we can only expect to see more of this going forward. That’s actually a great thing for the Linux community.
Linux isn’t a toy operating system any more and there are real, actual users out there who depend upon it not randomly having horrible performance regressions and bugs. Even if users don’t directly run kernels released by Linus, the vendors which do ship those kernels must surely be thankful for extra quality control.
The 2.6.36 merge window
With the opening of the 2.6.36 merge window came the usual dozens of ‘pull’ requests from individual subsystem maintainers who have shiny new features tested and ready for Linus to pull into his official kernel tree – at least, for the most part. There were the usual rants. Picking on the SCSI tree maintained by James Bottomley has been the ‘cool’ thing to do for the past few releases, but this time not because the ‘pull’ request was last minute (it wasn’t), but because James had allegedly been sitting on some trivial fix for more than 24 hours, which seems an excessively pedantic complaint, even for those who are accustomed to unrealistically fast turnaround.
Moaning aside, there were some useful requests to merge features, such as Nick Piggin’s VFS (Virtual FileSystem – the generic file system abstraction layer used by Linux) scalability patches, a new OOM (out-of-memory) Killer that is better able to isolate and target rogue processes when the system runs out of memory, the fanotify mechanism (useful for anti-virus and anti-malware companies to receive notification on file status events), a switch in the ext3 file system from ‘writeback’ mode back to the safer ‘ordered’ mode default, and a request to pull the AppArmor Linux Security Module popularised on SUSE Linux systems after four years of effort.
The VFS scalability improvements provided a rare opportunity to see Linus Torvalds “ticked pink” (in his own words) about a new feature enough to call it out even before he had really ‘opened’ the merge window for 2.6.36. The VFS enhancements include a new in-kernel lock type known as a ‘local global lock’ (lg_lock), which is used by the VFS to allow relatively inexpensive (in terms of compute time) access to a per-CPU list of open files, with a relatively more expensive access at close time if not done so from the same CPU that first opened the file. Still, in spite of Linus’s enthusiasm for the work, Nick asked that the bulk of the changes wait until 2.6.37. Some of the less intrusive bits made it into 2.6.36, but more will be following.
An old security vulnerability
Every once in a while (and sadly these days, that ‘once in a while’ seems to be becoming more frequent) the Linux community faces a nasty and embarrassing security exploit that requires immediate attention. We had one this month courtesy of Rafal Wojtczuk at ‘Invisible Things Lab’, who pointed out that Linux support for automatic stack extension (a common operating system feature) could be abused to gain privilege escalation in the X server, and in certain other software too.
The X server usually runs with root privilege, but allows untrusted clients to connect and store data within X heap memory. If it can be arranged for the X server stack (which automatically grows down from higher memory locations into lower ones) to grow down and run into the heap data stored by a client connecting to that server, then arbitrary code stored in that heap area can be executed with root privilege. The fix (which was deliberately made cryptic in the logs) was to add a ‘guard page’ or a small piece of padding to the stack in order to help avoid a collision.
Speaking of security, I mentioned earlier that AppArmor is finally to be merged into the kernel in 2.6.36. James Morris announced as much in his ‘Preview of changes to the Security subsystem for 2.6.36’, alongside mention that he planned to merge another security module named Yama. James was persuaded to change his mind on the latter, a security module written by Kees Cook at Canonical that catches certain misuses of symbolic links in ‘sticky’ directories such as /tmp and attempts to enforce tighter controls on ptrace use, because (in the words of Christoph Hellwig) Yama consists of “a random set of hacks” and isn’t a fully coherent security policy.
Somewhat painfully, Kees had originally been asked to write an LSM (problems with stacking this alongside full frameworks such as SELinux notwithstanding), but now his efforts were thwarted and Kees seemed positively confused at just where to go with the effort next. One hopes he doesn’t have to wait four years to have a resolution, as the AppArmor folks have done since first posting in 2006.
AppArmor now joins SELinux and TOMOYO as an official security framework. Unlike SELinux, AppArmor doesn’t attempt to provide the level of objects, rules and granularity that is both very powerful but also hard for mere mortals to understand on occasion. Instead, AppArmor uses such things as path-based security to protect specific file system locations from attack.
Whereas SELinux doesn’t care that the system password database is stored in /etc/passwd, since it cares only that the file have the right context and rules associated with it (allowing for policy to be preserved on copy), AppArmor takes the view that most real-world systems care more about having protection on the specific path location /etc/passwd. Each solution has its devotees, as does TOMOYO, and more choice should provide even more chance for lengthy mail-list argument about the ideal approach to system security.
Ongoing development: suspend blockers
The past month has seen lengthy debate continue on the best way for Linux to provide applications with finer-grained power management controls. The Android platform provides ‘suspend blockers’ – special kernel extensions that allow applications to specifically request that the system not go into a suspend state – that are used in combination with an aggressive default suspend policy that tries to put a system into a suspended state whenever possible (to save power).
There is a lot of support for some kind of mechanism like this in the official Linux kernel, but that support is contingent on there being a non-invasive way to do what Google achieves through making some fairly heavy changes to the internals of the kernel. Debate has largely centred around whether the kernel should provide a means to regulate the behaviour of applications, or whether applications should be trusted to make good choices – the former generally winning.
Paul McKenney – author of the RCU mechanism and a famous IBM engineer now working on the Linaro project – weighed in on the ‘suspend blockers’ debate with an effort to characterise the problem space in a series of informative emails entitled ‘Attempted summary of suspend-blockers LKML thread’.
Meanwhile, power management guru Rafael Wysocki worked on a related but distinct problem of implementing ‘wakelocks’ (targeting 2.6.36) which are used to ensure that a system currently suspending doesn’t lose track of events which would wake it up if they are received during the process of putting the system into a suspended state. We expect to see even more conversation on suspend blockers in the coming months.