Jan 31st, 2008 | Free Software | 1 Comment
The lightning talks gave 60 / n minutes to talk to anyone who volunteered. Any subject was allowed, but the talk had to be sans slideware. On this particular occasion n = 6 people took the opportunity and all had some very interesting stuff to say.
Grant Grundler from Google kicked off with the intention of countering perceptions about the “Google black hole” - the idea that free software goes into Google but none comes out. He described a number of the contributions that Google is making to the kernel and talked about which areas of development are important for Google and which are not.
First up are containers - Google is interested in this so they don’t have to use a full virtualization solution such as Xen or kvm. Apparently this kind of solution doesn’t provide any benefit to them, though exactly why wasn’t made clear.
Kernel filesystems are also a priority. Google uses ext2, but they would like to move to something else. Unfortunately ext3 performance isn’t good enough, and the journalling features of ext3 are redundant in Google’s environment because everything is mirrored anyway. Google have backported a few changes from ext3 to ext2, but Grant didn’t drop any hints about Google sponsoring a new filesystem anytime soon.
Grant mentioned Google’s role in fighting the good fight for Linux drivers - the company is constantly evaluating new technologies and pushing vendors for Linux support. Because of the volume of hardware Google buy, they have more leverage than most.
Google are also interested in CPU performance tools and are sponsoring perfmon2 development because oprofile is not adequate for their needs.
Matthew Wilcox then described a very interesting project he’s working on to eliminate un-killable tasks from Linux. This annoying situation is quite common and is caused when a task calls down() to take a semaphore and then goes to sleep waiting for some event to occur. In this state the task cannot receive a signal; if the expected event does not occur the task cannot be killed. For this reason it’s preferable to call down_interruptible() rather than down()- in this state a signal can be received- but there are quite a few situations where the task just can’t be interrupted.
Matthew’s patch adds a third variant of the function called down_killable(). Once a task is in this state, it will be interrupted only by fatal signals. After receiving such a signal, the task will die as soon as it returns to user-space, so it will never see the effect of the terminated system call.
The somewhat tedious task of implementing down_killable() for 22 architectures is now complete, but there is still the larger task of changing all the calls to down() (430 according to a helpful audience member) to down_killable(). In each case, the call has to be changed, the return code checked, and if a signal was received the task must unwind whatever it was doing gracefully. There are also the 449 calls to lock_kernel() that should be changed to lock_kernel_killable(). Although this adds up to a pile of work, it can be done incrementally as with moving away from the Big Kernel Lock.
Matthew mentioned that both Ingo Molnar and Nick Piggin are in favour of the patch because they’re responsible for the OOM killer. This patch should allow the OOM killer to work more effectively because currently a task chosen for termination in a OOM situation may not actually be killable.
Next up was Zach Brown from Oracle who gave a brief teaser for his talk on Friday on the Coherent Remote Filesystem (CRFS). Zach described it as a new network filesystem that can be used in place of NFS "if you want it to be reliable and perform well", which drew a few weary laughs from the audience.
Zach is trying to drum up interest in and contributions to the filesystem, which is still under heavy development. Cool tricks such as a cache coherency protocol are being used and it has groovy features such as checksums, snapshots and a unique way of handling filesystem metadata that gives big performance gains over NFS. And doesn’t use the BKL! Zach has some preliminary performance data available in this blog post.
I usually enjoy filesystem talks- disks are such ornery beasts that the solutions people come up with are invariably interesting- so I think I’ll be attending Zach’s talk on Friday.
Val Henson took the mike next to muse about a pet theory she has about disk IO scheduling: that it’s possible to have much more information than currently available about how to submit IO requests to a storage device. With such information, instead of needing multiple schedulers, it would be possible to have a generic scheduler with tuning parameters that could be tweaked for a specific device. Val pointed out that while the things people know about disk operating parameters – e.g. assumptions like “sequential IO is fast”- have been true for a long time, they are changing very quickly as large-capacity solid-state storage becomes more common.
Val suggested a few parameters that might be interesting:
- The number of IOs the device prefers to have outstanding
- The maximum possible IOs per second
- Preferred size for writes/reads
- The exact tradeoff for sequential vs random IO. Random IO still incurs a penalty with SSDs, but it’s not as severe as with magnetic drives
- The time taken to switch between IO at two different addresses
- The device’s preferred alignment.
Val also speculated on how this information could be obtained – it could be specified in a configuration file or the kernel might determine the device parameters experimentally by profiling the device. Either way, the kernel currently has a very simple model for IO which could be improved greatly. Hopefully there will be some interesting developments around these ideas in the near future.
Paul McKenney next posed a question for architecture maintainers about a possible problem with RCU in situations where the system is returning from a low power state and a NMI or SMI handler performs a specific type of RCU operation. That’s what I managed to get anyway, Paul talks fast and was not catering for the uninitiated. Most of the talk consisted of a high-bandwidth data exchange between Paul and Dave Miller, who thought the problem might occur on SPARC. I wish I was able to follow more of what was said but I don’t have the background knowledge.
Last up was Dave Miller himself who gave us an overview of what’s been going on with networking. He has just made a pull request to Linus for 2.6.25 which contains just under 1500 patches, 700 for non-driver changes. A large number of these are for the network namespaces feature which is required to support containers.
Dave outlined the recent changes in the data structures for NAPI (as described in this LWN article) that severs the one-to-one relationship between network devices and interrupt lines. Modern network devices have multiple transmit and receive channels and multiple interrupts, and the driver must support this for best performance. Dave mentioned a new Neptune device (presumably this) that allows 32 RX channels, and 24 TX channels! It has a hardware packet classifier so that RX interrupts for certain packet types can be routed to specific CPUs.
Unfortunately, handling multiple transmit channels is not so easy because of the presence of the packet scheduler layer- load balancing on transmit can break the prioritisation done by the scheduler. Fixing the problem may involve a change in the default queueing discipline.
Zack Brown asked if there were any automated mechanisms for assigning a process to the CPU where the packets destined for it are being received. Dave has queried Ingo Molnar about this, and apparently the scheduler will push processes to CPUs where their wakeup events occur. However, this is not a panacea as a process will then lose locality- it will no longer be enjoying the benefits of a hot CPU cache.
A question was also asked about that hardy LCA perennial, netchannels. Dave described the idea as "not dead but it is smoldering." Netchannels introduce some difficult problems with packet filtering, and it’s a very big change with no evolutionary path. From Dave’s response it seems unlikely we’ll see movement on this soon, notwithstanding the work done by Evgeniy Polyakov.
linux.conf.au: where too much kernel is barely enough!
Jan 29th, 2008 | Free Software | 1 Comment
Full on. Today felt like about 3 three conference days in one. Between the Distro Roundup, the Kernel Mini-conf Lightning talks, the Kernel Panel discussion and other sessions I must have heard close to 20 people speak.
A lifetime ago at 8:30 this morning I sat down at breakfast across the table from Paul McKenney. Now to me he seemed like just a J. Random Bearded Hacker, but he’s actually the main guy behind the RCU implementation in the Linux kernel. Val Henson introduced him this afternoon at the kernel lightning talks as “one of the best computer science researchers I know”. Apparently he’s already done too many talks on RCU, so to get on the conference schedule this years he’s talking on his involvement in adding concurrency to the terribly exciting C++0x standard. Now that’s one talk I will be attending. Don’t you wish you were at LCA now?
I was thus slightly late for the first session of the day as I lost track of time chanting “We are not worthy” while Mr McKenney was trying to eat his cereal. I wandered into the Distro Roundup where community members representing various distros gave an overview of the history and current status of their distribution. Representatives from Oracle, Mandriva and Gentoo gave useful reports in the time that I was present. Mr Debian spent some time talking about the difficult political/ideological issues that have caused friction within the Debian community - how to deal with firmware “binary blobs” and the status of documentation covered by the GNU Free Documentation License. Binary blobs are not just an issue for Debian, but because of the project’s strict adherence to the Debian Free Software Guidelines, they have taken the problem very seriously and now will not ship such non-free firmware. Similarly, Debian regards the GNU FDL as a non-free license. It was clear from the talk that not all members of the community agree with these decisions, so the controversy could continue in spite of the current policies.
After morning tea I stuck with the Distro summit to hear Shane Owenby, Senior Director for Linux and Open Source at Oracle talk on “Why would a large corporate create their own distro?” I should probably have migrated to the Kernel Mini-conf at this point but Shane was an engaging speaker and it was interesting to hear about Oracle’s goals for their Linux products apart from making money. Oracle wants to promote the adoption of Linux in the data centre by lowering the barriers to entry, which given the size and scope of their customer base they’re uniquely positioned to do. Shane engaged in some lively discussion with Bdale Garbee on Oracle’s Premier Backporting service. Bdale’s question, I think, concerned how Oracle can backport fixes to stable releases when other ISVs will only guarantee their applications on certain (unpatched) Oracle Enterprise Linux versions. No clear answer was given to this.
These and other discussions made Shane’s talk go overtime, so Jonathan Oxer didn’t have time for the full version of his very useful talk on Release Monkey. Simplifying, this is a set of scripts to help build packages for more than one distribution. This is a very common problem for small ISVs who want to distribute their products for Linux, as the time and cost in building for multiple distros can be prohibitive. I’ve stumbled over Release Monkey before when I was looking for a solution to just this problem for one of my previous employers. We were attempting to distribute a single product for Suse, Redhat 9, Debian 3.0, etc, etc and it was not a pleasant experience. James cooked up a system that worked pretty well, but I think there is a real need for a ready-made, full-featured tool for this task.
Jonathan emphasised that one of the main problems when packaging for multiple distros is that there’s no good way to capture the metadata required- stuff like package dependencies, version numbers, build instructions, etc. Release Monkey has adopted the (hackish) solution of using the Debian metadata and munging it for other distros. In our case, we maintained separate files for each type of package - .spec files for building RPMs and control/rules files for building Debian packages. This obviously introduced some maintenance overhead. Jonathan suggested that the ideal solution would be to define a distro-agnostic metadata format, but little progress has been made on this so far.
At this point I’d had my fill of distro-talk so I wandered over to the Kernel Mini-conf hoping to hear Arnd Bergmann talk on “How not to invent kernel interfaces”, but his talk had been moved to 9:15 so I lucked out. Instead I listened to Jörn Engel speak on “Cache-efficient Data Structures”. This is a very interesting topic but since I missed the start of the talk, I couldn’t quite follow the comparative performance numbers he had on his slides. There were a few interesting comments from the audience, including from Dave Miller and Linus (no link required). Dave is the kernel networking maintainer and knows a few things about hash tables as they are used extensively in the network subsystem for stuff like holding socket descriptors. Discussion followed on the problems involved in resizing hash tables. Currently several (large) hash tables are allocated at kernel boot time in one of two sizes, depending on the memory installed on the system. Some thought has been given to making these re-sizeable at runtime to allow for both minimal memory usage and best performance, but synchronization issues make this very difficult. It sounds like there’s a fun project here for anyone who’s game enough.
After lunch I stuck with the Kernel Mini-conf to hear Jesse Barnes from Intel’s Open Source Technology Center talk on “Enhancing Linux Graphics”, or alternatively “Why Graphics on Linux suck and what we are doing about it”. Jesse described some of the major enhancements that are taking place to rationalize the motley assortment of software components involved in graphics on a Linux system- the kernel fb layer, DRM, X, Mesa, DirectFB, etc. This work (described here) will enable graphics without X, since things like modesetting will be handled by the kernel. From comments made by Dave Airlie, this is something of a holy grail for the graphics guys. Perhaps more importantly, Jesse’s work will finally allow displaying a “Blue Penguin of Death” when a kernel oops occurs, the absence of which has long hampered Linux’s ability to compete with rival operating systems.
Next up was Joshua Root from Gelato UNSW talking on “The state of the Elevator I/O scheduling in Linux”. The Gelato guys want to create documentation to help system administrators choose and tune an IO scheduler. Obviously, the performance of the 4 different schedulers in the kernel varies greatly with different load profiles. In particular Gelato have been looking at IO scheduler performance when software and hardware RAID are in use. Along the way they have found (and fixed) a number of bugs in the schedulers.
One thing I didn’t realize is the number of tools available for doing this kind of performance analysis on Linux. The blktrace tool (built into the kernel) can record everything that is happening in the block layer for later analysis using btt, the block trace timeline tool. btreplay can replay an event trace recorded with blktrace, or iomkc can be used to generate a Markov chain model of the trace so that workloads can be reproduced (or emailed) in kB rather than GB. Joshua showed some graphs (Yay!) of his performance results. Interestingly, while the more complex schedulers (anticipatory and CFQ) give better throughput in most situations, the simpler schedulers can give much lower average latency in some tests. As with much performance analysis, “it depends”.
This blog post has now dragged on far too long, and I still haven’t covered the very interesting kernel lightning talks or the kernel developer’s panel. I’ve got extensive notes on both, but they’ll have to wait.
Jan 28th, 2008 | Free Software | 3 Comments




I arrived at linux.conf.au 2008 at the University of Melbourne last night, but didn’t manage to register until this morning. Everything went smoothly as usual except for the friendly (female) registration person addressing me as “Madam”. No doubt it had been a stressful morning.
The conference swag was pretty good this year, the bag is a good size and I can never have too many Redhat caps or Trolltech beer coolers. The t-shirt is also a great design, easily the best of the LCA shirts I have lying around. This one can actually be worn in public without looking too uncool, a considerable achievement. It makes sense what with Melbourne being Australia’s fashion capital.
I kicked off with a presentation by Stuart Middleton as part of the Embedded Mini-conf. Stuart is a type of geek previously unknown to me, a “robotics artist”. Hexapod creations are his speciality. He told us a great story about convincing the Wellcome Trust to give him $2million to build a giant hexapod walking platform for Stelarc. The first version costing $1million twice tore itself apart as soon as it was started because the design “wasn’t quite right.” Such expensive failures can be embarrassing, but apparently this is not too much of a problem because according to Stuart “being artists we can usually come up with some bullshit to explain it”. Very entertaining.
In the second session I stayed with the Embedded Mini-conf for Ben Leslie explaining how to port the OKL4 operating system to a new platform - in this case the Goldfish simulator provided with the Android SDK. I’ve seen Ben present before at SLUG and he always pulls off a slick talk. But he moves fast! This talk was a good introduction to both OKL4 and embedded programming in general.
I then jumped ship to the Security Mini-conf to hear Enno Davids talk on “Self Healing networking”. After a general introduction to network security threats and countermeasures he started talking about the most severe current threat to modern networks- DoS and DDoS attacks. There are currently few effective countermeasures available to deal with the huge botnets that are now being created for profit by well-organized criminal groups. Enno claimed that large botnets can now create aggregate data rates of up to 24Gbps, which is more than the total bandwidth connecting Australia to the rest of the ‘net!
Enno presented some defensive strategies that use ICMP redirect packets to force the botnet zombies to redirect their traffic somewhere else (say 127.0.0.1), but this is not trivial to do and in any case not effective against the largest botnets. He also proposed some small extensions to ICMP that if implemented could help mitigate against such attacks in the future. There was some discussion with the audience of the possibility of distributed responses to DDoS attacks, i.e. calling on friendly networks to help repel an attack. At some point this boils down to “my botnet versus your botnet”, which some wit announced is “coming soon to a Fox channel near you”
All up a very interesting talk.
After lunch I headed to the Fedora Mini-conf to see Eugene Teo talk on “Writing SystemTap Scripts”. The talk was a good basic introduction to this very useful tool. I attended a similar talk last year at LCA in Sydney, and I’m sorry to say I haven’t actually used SystemTap in the intervening time. But I still think it’s way cool. Eugene also showed us some of the SystemTap scripts he’s been writing, which was fine, but I would have liked it better if he had used the scripts to generate some data suitable for munging into pretty graphs. But that’s just me, I really like graphs.
Next up I checked in on the Community Wireless Mini-conf to hear James Cameron speak on Wireless Design & Testing for the One Laptop Per Child project. James is a resident of somewhere in rural and regional Australia and was sent some XO units to do wireless testing because of the quiet radio environment, similar to areas in the developing world where the XO will be deployed. He also tested an antenna extension gadget that seems to be still in development. James presented some numbers on the achievable range using XO. With two machines 1.5m above the ground, they can communicate as far as 1.6km 95% of the time, which sounds pretty impressive. Unfortunately, due to reasons known only to RF gurus the range drops off significantly when the XOs are closer to the ground. Jim Gettys was in the audience which made for a great Q&A session as he could fill in any gaps in James’ knowledge of the project.
Following afternoon tea I saw Mikko Leppanen talk on “Adventures in Consumer Electronics with GStreamer” as part of the Multimedia Mini-conf. I should probably have spent more time in this mini-conf since I do multimedia stuff for a living now, but that’s just how it worked out. Mikko works for Nokia, specifically writing media playback software for the n810 Internet Tablet. gstreamer is used extensively in the product and Mikko is obviously a big fan, praising gstreamer for being popular, scalable, pluggable and hackable. During question time I asked Mikko how he would compare gstreamer to other multimedia frameworks that he’s used- he commented that the key to a good multimedia framework is a good codec abstraction, and compared to others he’s used such as Helix and the Symbian multimedia framework, gstreamer is clearly superior. He also claimed that Openmax has taken quite a few ideas from gstreamer, which he considers a strong endorsement of gstreamer’s design.
Last up in today’s open-source onslaught was Richard Keech from Redhat talking on “Provisioning Red Hat/Fedora systems using custom builds and Kickstart” as part of the Fedora Mini-conf. Frankly this is not the sort of thing I do on a daily basis, but I like automation and packaging so I had to go. Richard laid out the considerable benefits of his approach- it becomes very easy to reproduce the same machine configuration for testing, development, disaster recovery, etc, but you still get much more flexibility than when creating HD images. During the talk he built, installed (on vmware) and booted a custom build of RHEL. This can be done quickly with a reduced number of packages in the installation.
All up the day was a strong start to what should be another fantastic LCA.