LCA 2008 Kernel Mini-conf lightning talks

The lightning talks gave 60 / n minutes to talk to anyone who volunteered. Any subject was allowed, but the talk had to be sans slideware. On this particular occasion n = 6 people took the opportunity and all had some very interesting stuff to say.

Grant Grundler from Google kicked off with the intention of countering perceptions about the “Google black hole” - the idea that free software goes into Google but none comes out. He described a number of the contributions that Google is making to the kernel and talked about which areas of development are important for Google and which are not.

First up are containers - Google is interested in this so they don’t have to use a full virtualization solution such as Xen or kvm. Apparently this kind of solution doesn’t provide any benefit to them, though exactly why wasn’t made clear.

Kernel filesystems are also a priority. Google uses ext2, but they would like to move to something else. Unfortunately ext3 performance isn’t good enough, and the journalling features of ext3 are redundant in Google’s environment because everything is mirrored anyway. Google have backported a few changes from ext3 to ext2, but Grant didn’t drop any hints about Google sponsoring a new filesystem anytime soon.

Grant mentioned Google’s role in fighting the good fight for Linux drivers - the company is constantly evaluating new technologies and pushing vendors for Linux support. Because of the volume of hardware Google buy, they have more leverage than most.

Google are also interested in CPU performance tools and are sponsoring perfmon2 development because oprofile is not adequate for their needs.

Matthew Wilcox then described a very interesting project he’s working on to eliminate un-killable tasks from Linux. This annoying situation is quite common and is caused when a task calls down() to take a semaphore and then goes to sleep waiting for some event to occur. In this state the task cannot receive a signal; if the expected event does not occur the task cannot be killed. For this reason it’s preferable to call down_interruptible() rather than down()- in this state a signal can be received- but there are quite a few situations where the task just can’t be interrupted.

Matthew’s patch adds a third variant of the function called down_killable(). Once a task is in this state, it will be interrupted only by fatal signals. After receiving such a signal, the task will die as soon as it returns to user-space, so it will never see the effect of the terminated system call.

The somewhat tedious task of implementing down_killable() for 22 architectures is now complete, but there is still the larger task of changing all the calls to down() (430 according to a helpful audience member) to down_killable(). In each case, the call has to be changed, the return code checked, and if a signal was received the task must unwind whatever it was doing gracefully. There are also the 449 calls to lock_kernel() that should be changed to lock_kernel_killable(). Although this adds up to a pile of work, it can be done incrementally as with moving away from the Big Kernel Lock.

Matthew mentioned that both Ingo Molnar and Nick Piggin are in favour of the patch because they’re responsible for the OOM killer. This patch should allow the OOM killer to work more effectively because currently a task chosen for termination in a OOM situation may not actually be killable.

Next up was Zach Brown from Oracle who gave a brief teaser for his talk on Friday on the Coherent Remote Filesystem (CRFS). Zach described it as a new network filesystem that can be used in place of NFS "if you want it to be reliable and perform well", which drew a few weary laughs from the audience.

Zach is trying to drum up interest in and contributions to the filesystem, which is still under heavy development. Cool tricks such as a cache coherency protocol are being used and it has groovy features such as checksums, snapshots and a unique way of handling filesystem metadata that gives big performance gains over NFS. And doesn’t use the BKL! Zach has some preliminary performance data available in this blog post.

I usually enjoy filesystem talks- disks are such ornery beasts that the solutions people come up with are invariably interesting- so I think I’ll be attending Zach’s talk on Friday.

Val Henson took the mike next to muse about a pet theory she has about disk IO scheduling: that it’s possible to have much more information than currently available about how to submit IO requests to a storage device. With such information, instead of needing multiple schedulers, it would be possible to have a generic scheduler with tuning parameters that could be tweaked for a specific device. Val pointed out that while the things people know about disk operating parameters – e.g. assumptions like “sequential IO is fast”- have been true for a long time, they are changing very quickly as large-capacity solid-state storage becomes more common.

Val suggested a few parameters that might be interesting:

  • The number of IOs the device prefers to have outstanding
  • The maximum possible IOs per second
  • Preferred size for writes/reads
  • The exact tradeoff for sequential vs random IO. Random IO still incurs a penalty with SSDs, but it’s not as severe as with magnetic drives
  • The time taken to switch between IO at two different addresses
  • The device’s preferred alignment.

Val also speculated on how this information could be obtained – it could be specified in a configuration file or the kernel might determine the device parameters experimentally by profiling the device. Either way, the kernel currently has a very simple model for IO which could be improved greatly. Hopefully there will be some interesting developments around these ideas in the near future.

Paul McKenney next posed a question for architecture maintainers about a possible problem with RCU in situations where the system is returning from a low power state and a NMI or SMI handler performs a specific type of RCU operation. That’s what I managed to get anyway, Paul talks fast and was not catering for the uninitiated. Most of the talk consisted of a high-bandwidth data exchange between Paul and Dave Miller, who thought the problem might occur on SPARC. I wish I was able to follow more of what was said but I don’t have the background knowledge.

Last up was Dave Miller himself who gave us an overview of what’s been going on with networking. He has just made a pull request to Linus for 2.6.25 which contains just under 1500 patches, 700 for non-driver changes. A large number of these are for the network namespaces feature which is required to support containers.

Dave outlined the recent changes in the data structures for NAPI (as described in this LWN article) that severs the one-to-one relationship between network devices and interrupt lines. Modern network devices have multiple transmit and receive channels and multiple interrupts, and the driver must support this for best performance. Dave mentioned a new Neptune device (presumably this) that allows 32 RX channels, and 24 TX channels! It has a hardware packet classifier so that RX interrupts for certain packet types can be routed to specific CPUs.

Unfortunately, handling multiple transmit channels is not so easy because of the presence of the packet scheduler layer- load balancing on transmit can break the prioritisation done by the scheduler. Fixing the problem may involve a change in the default queueing discipline.

Zack Brown asked if there were any automated mechanisms for assigning a process to the CPU where the packets destined for it are being received. Dave has queried Ingo Molnar about this, and apparently the scheduler will push processes to CPUs where their wakeup events occur. However, this is not a panacea as a process will then lose locality- it will no longer be enjoying the benefits of a hot CPU cache.

A question was also asked about that hardy LCA perennial, netchannels. Dave described the idea as "not dead but it is smoldering." Netchannels introduce some difficult problems with packet filtering, and it’s a very big change with no evolutionary path. From Dave’s response it seems unlikely we’ll see movement on this soon, notwithstanding the work done by Evgeniy Polyakov.

linux.conf.au: where too much kernel is barely enough!