The kernel column #94 by Jon Masters
In his latest column Jon Masters covers the new Tile architecture in the 2.6.36 kernel, scalability concerns in the IMA security subsystem, and looks ahead to 2.6.37 development…
This month saw the final release of kernel 2.6.36, and the closing of the following ‘merge window’ for new features to be merged into what will become the 2.6.37 kernel (more details about the latter in a moment). The 2.6.36 kernel features concurrency-managed workqueues, preliminary support for the fanotify mechanism discussed here in the past, final merging of the AppArmor security system used by some distributions for many years, and support for a new architecture, among many dozens of other significant improvements. The new kernel received patches from over 1,100 engineers for a total of nearly 11,000
changesets (collections of related changes to various kernel files) overall.
Workqueues are one of the not particularly sexy but necessary pieces of kernel infrastructure code that provide a means for developers to schedule execution of some function at a more convenient time. They run that function within a special kernel thread – visible using the ‘top‘ and ‘ps’ commands – in what is known as ‘process’ context (Linux has historically differentiated more between the limited ‘interrupt’ and ‘process’ context for what features code could use at the time, but this is changing slowly with the mixture of newer threaded interrupt support). Over the years, workqueue use has become so profuse within drivers and other kernel code that a typical system may have many hundreds – or even thousands – of kernel threads dedicated to them, often competing at unfortunate times for the CPU, and at other times doing nothing other than existing and clogging up lists of running processes.
Tejun Heo has spent a lot of time and energy reworking workqueues into a more generic framework utilising a real thread-pool manager – really a form of special scheduling class combined with thread management – that creates and destroys special kernel threads as required. Workqueue items are scheduled to run within these worker threads, of which the number will vary dynamically with system load conditions, and available CPU resources, such that there are now many fewer kernel threads working together more smartly than before. Just like web and email servers using thread pools, workqueue thread pools should yield better performance and improve scalability on very large systems where there was more competition between individual worker threads in the past.
Tejun had to solve a number of hard problems in the process of reworking workqueues, including the addition of special ‘rescuer’ threads to address the potential for nasty lock-ups when pending workers are waiting on others that are not able to run themselves (because they are blocked in memory reclaim operations) to complete. Rather than have the system grind to a halt, Linux will recognise that workqueue worker threads aren’t making sufficient progress and kick off rescuers for those workqueues that have been defined using the WQ_RESCUER flag to indicate that they might need help in such situations.
Many Tiles, one processor
Perhaps one of the most interesting features in the latest kernel is support for yet another new kernel architecture. Architectures are fundamental types of computer from which entire families of related systems are built using various processors supporting the same architecture. Intel’s x86 is one kind of architecture, from which the ‘PC’ platform is built; IBM’s server POWER is another, while ARM Holdings’ ARM is popular in embedded devices.
An entirely new architecture requires a lot of time and monetary investment to develop, along with the accompanying physical hardware, as well as supporting tools to actually build code for the new architecture, which is why it doesn’t happen every day. Having said that, we’ve seen three new architectures added to Linux in the space of about a year, which might be an all-time high. These benefit from work done by Arnd Bergmann on his asm-generic reference ‘how-to’ example.
The latest architecture addition to Linux provides support for ‘Tile’ processors. These are highly scalable chips built from many ‘tiles’ that each function like a smaller computer. Each tile has the basic arithmetic and processing units, as well as a high-performance network-like router that allows it to communicate with other tiles, and to share certain resources (such as a large virtual higher-level cache) for a total of 64 tiles in the first public implementation. Although each tile runs at around 900MHz, the total 64‑core processor can easily power through very complex and scalable computing problems. Tile is produced by Tilera Corporation, which was co-founded by MIT. Professor Anant Agarwal, who had previously developed highly scalable multiprocessors like the MIT ‘Alewife’ (allegedly named after a type of herring – or perhaps the Alewife subway station near MIT) back in the 1990s.
Even more impressive than the actual Tile architecture is the apparent quality of the work done to perform the actual ‘port’ of Linux to it, and the speed with which necessary changes to the code were made by Tilera engineers in order to get it accepted into the official kernel. I am very reliably informed that the Tile should be regarded as an excellent example of how to do a new architecture port correctly. You can take a look at the code by browsing through the arch/tile subdirectory of the 2.6.36 kernel source, perhaps using the Cscope utility, or using an online tool such as LXR.
It’s about here that I usually like to insert something that went horribly wrong this month, typically of the nasty security gotcha variety. While there weren’t the horrible security vulnerabilities this month that we have seen in recent times, there was a security-related issue that bit a number of users. That came in the IMA (Integrity Measurement Architecture) code – a facility present in recent kernels that allows Linux to use the hardware ‘tamperproof’ TPM (security) chip present in modern systems to determine the correct value of file checksums and detect system intrusion attempts. One of the kernel.org folks noticed that a system running Fedora was wasting several GB of memory on SLAB caches – regions of kernel memory used for various data structures.
The problem was that the IMA design utilised a very inefficient data structure (a radix tree with very sparse nodes) that resulted in a lot of data wasted for every inode (a per-file object storing data about the file) in the system. For systems with millions of files, this wasted space would quickly add up, especially as it was used even if IMA itself had not been enabled at runtime. The latter issue arose because IMA might be enabled at a later time and would need to have cached data for every file touched since boot, well before it was known that the user planned to turn IMA on. Kyle McMartin of the Fedora project posted a patch that helped with the immediate issue by making IMA boot-time disable options for those who never planned to use it, while others refactored the code to use alternative data structures (rbtrees). It was remarked how this issue should never have slipped through code reviews.
The 2.6.37 kernel merge window closed a little early (being ten days in duration as opposed to the customary two weeks) as Linus Torvalds, and many of the other kernel developers were on their way over here to Cambridge, Massachusetts, for the 2010 Linux Kernel Summit and Linux Plumbers Conference. Each year, the Kernel Summit (KS for short) gives an opportunity for developers spread across the globe (everywhere from Europe to the US, to Japan and all places in between) to have some time meeting face-to-face with other developers. There are over a thousand active developers working on the kernel today, and of these, the most active core set (usually less than a hundred) are invited based on merit. The agenda for this year’s KS included lofty sessions on ‘Core Vision’ and ‘What we do and don’t like about kernel development’.
Meanwhile, kernel development continues. This includes work on the newly open sourced (official) Broadcom brcm80211 wireless driver currently in the special ‘staging-next’ tree. It works for me (though not with suspend), but still needs work.
Other features being developed include the ongoing reduction of use of the Big Kernel Lock (and hopeful removal very shortly), a new feature for running work in a hardware interrupt context (kind of the opposite of workqueues), and of course many more things besides. We’ll cover some of those next time, and the outcome of Linus’s dunking in a ‘shark’ tank at the Linux Plumbers Conference.
For more info on improvements in the 2.6.36 kernel, as well as instructions for downloading and building it, visit kernelnewbies.org.