If only lawyers could write Perl modules

With so many powerful programming languages freely available, it’s very common for large software systems to use more than one. Write some C, use a scripting language and do some database access and there’s three already. Even if the deployed code is only in one language, test scripts and harnesses often use another. Multiple languages are a good thing if it means the right tool is used for the right job.

But there are annoyances. In particular, violations of the Single Point of Truth (SPOT) rule are common. For example, here’s a C++ enum containing error codes:

enum FooErrors
{
    FOOERR_OK = 0,
    FOOERR_FILE_NOT_FOUND = 1,
    FOOERR_IT_JUST_BORKED = 2,
    // further constants follow
};

If the same constants are used by a part of the system written in a different language, the cheap and cheerful solution is just to declare them again:

class FooErrors
{
    public static final int FOOERR_OK = 0;
    public static final int FOOERR_FILE_NOT_FOUND = 1;
    public static final int FOOERR_IT_JUST_BORKED = 2;
    // contains all the same constants as in C++
}

This is fine as far as it goes, and I’ll admit that most developers are dealing with bigger problems than a few duplicate declarations. But if the constants are used by more than two languages, keeping everything in sync become a maintenance burden.

I’d like a simple script that acts as a sort of poor-man’s IDL compiler, reading a text file containing names and values and spitting out nicely-formatted declarations in a variety of languages. A new constant would be added by updating the text file, running the script and committing the modified source files to the nearest version control system.

This seems like such an obvious thing to do that I was sure I’d find a Perl module or seven on CPAN to do it. But I couldn’t find anything.

What I did find were two patents describing almost exactly this idea. The first, US Patent 7143400 titled Configuration description language value management method and system, contains this in the summary:

… the present invention fills this need by providing a method and a system for centralizing the maintenance of name value pairs for defining constants and properties used by different portions of a program, where the different portions are of a different programming language type.

The second, US Patent 6964038 titled Constant values in mixed language programming environments, is described as:

a method of and apparatus for maintaining consistency between header files for differing computer program languages. More particularly, the invention relates to automatically generating one or more header files in a programming language based on a header file in a different programming language.

The assignee of the first patent is Sun Microsystems, Inc and the assignee of the second is the Hewlett-Packard Development Company, L.P..

I haven’t read the patents thoroughly, so I guess there could be some patent-worthy ideas in them. Maybe. The thing that irks me is that for what it cost in lawyers to file these two patents you could build the finest constant-generating system the ‘Net has seen, supporting a bunch of languages with all the bells and whistles. And you might just get something useful for your money.

If you know of a good free tool- potentially infringing or not- for this simple problem, please comment.

Mascot deployed

Moe on my desk

Like 38% of software development organizations globally, we use characters from The Simpsons as host names for our servers and dev boxes. While perhaps not as cool as LOTR characters, The Simpsons are a good choice as their names usually short (less typing) and there are several hundred of them- a number suspiciously close to the number of addresses on a /24 network.

My RHEL 4 dev box is called Moe, my 2nd favourite character (CBG is #1). For some reason I told my girlfriend this and rather than run screaming in the other direction she bought me a very cool Bobble-head Moe by Funko. He’s now cheering me on as I battle deeply-nested control structures on a daily basis.

Thanks babe!

C++ code metrics with cccc

After my last post I thought I should look a little deeper into code metrics. Unsurprisingly, a lot has been done in this area- researchers have been investigating metrics since at least the mid-70s. I’m not sure how active the field is today.

There are numerous commercial offerings of tools that will generate metrics for a codebase, but relatively few open source ones, at least for C and C++. Presumably this is because of the difficulty of developing a parser for the tortured syntax of C++. The best open-source tool I found was cccc which unfortunately is no longer under active development. cccc was written by Tim Littlefair for his PhD at Edith Cowan University in Perth, making it home-grown open source. Cool! It uses PCCTS (The Purdue Compiler-Compiler Tool Set) as a parser and generates XML and HTML files containing the calculated metrics.

The range of metrics calculated is good, although the HTML output is fairly basic (sorry Tim), and there’s no graphs. I ran cccc over my pet project Springysim, the resulting output is here.

The metrics produced by cccc are divided into three groups: procedural, object-oriented and structural:

Procedural metrics include Lines of Code (LOC), Lines of Comment (COM), McCabe’s cyclomatic complexity measure and various ratios of these numbers. The concept of cyclomatic complexity was introduced by McCabe in his 1976 paper and the cccc documentation has this to say about it:

The formal definition of cyclomatic complexity is that it is the count of linearly independent paths through a flow of control graph derived from a subprogram. A pragmatic approximation to this can be found by counting language keywords and operators which introduce extra decision outcomes. This can be shown to be quite accurate in most cases. In the case of C++, the count is incremented for each of the following tokens: ‘if’,'while’,'for’,’switch’,'break’,'&&’,'||’

This intuitively seems like a useful metric, although I’d like to read some studies validating it in practice.

Objec-oriented metrics produced by cccc for each class include:

  • Weighted methods per class (WMC). In the simplest case the weighting of each method is just one. cccc also provides WMCv, which only counts public and protected methods.
  • Depth of inheritance tree (DIT)
  • Number of children (NOC)
  • Coupling between objects (CBO). This is the number of other classes that are coupled to a class either as clients or a suppliers.

All these metrics were originally proposed by Chindamber and Kemerer in their 1994 paper A Metrics Suite for Object Oriented Design. It’s not a bad read, but does spend quite some time proving that the proposed metrics satisfy various formal properties proposed by Weyuker in her 1988 paper Evaluating Software Complexity Measures; these parts might be a little dry for some. But it’s not all ivory tower stuff, they also evaluated the metrics by collecting empirical samples at two different software development organisations. However, no attempt was made to correlate the code metrics with project outcomes such as defect rates or maintenance costs.

Unfortunately cccc does not calculate the 5th and 6th metrics suggested by Chindamber and Kemerer. The 6th metric, Lack of Cohesion in Methods (LCOM), examines which instance variables are used by which methods of a class. A class with a single instance variable that is used by all methods has high cohesion, while a class with many instance variables each used by few methods will have a low cohesion. This seems like an interesting metric for OO designers to know.

The structural metrics calculated by cccc are:

  • Fan-in: The number of other modules that pass information into a module.
  • Fan-out: The number of other modules that a module passes information to.
  • An “Information Flow measure” calculated as the square of the product of the fan-in and fan-out of a single module.

These metrics were proposed by Henry and Kafura in their 1981 paper Software Structure Metrics based on Information Flow, this unfortunately does not seem to be freely available. This paper is super-cool as the code base they use for evaluating the metrics is UNIX, version 6. The Lions book is cited as a reference- even cooler!

Tragic fawning over old-school UNIX aside, the paper shows that the information flow measure described above is strongly correlated with the occurrence of changes in the UNIX sources. That is, modules with a high value of the metric also had many changes made to them. The number of changes in a module is used as a proxy for the number of errors in a module, on the assumption that these two measures are strongly correlated.

cccc looks like an interesting tool, or at least the beginning of one. To be useful during development, it would be nice to see how these metrics are changing over time, and cccc doesn’t provide any facilities for that.

Perl code statistics with PPI

Sometimes there’s nothing better to do on a Sunday morning than read the Google Testing Blog. A recent post suggested that methods (or functions, subroutines, etc.) should be made shorter to make testing easier- because a short method does less than a longer one, it’s usually easier to test.

Normally I would cite improving readability and flexibility as the main reasons to prefer short methods, but ease of testing seems just as good. Code that’s difficult to read or difficult to test will always be difficult and costly to maintain. In fact right now, all over the world, there are thousands of maintenance programmers bent over in prayer reciting their litany: Please, write code that can be read rather than deciphered…

The post made me speculate on how long my methods are. The only code I’ve written recently that’s wholly my own work is a test harness in Perl. This is the confidential intellectual property of my employer and is busy delivering sustainable competitive advantage, so I can’t post it on this blog. Hopefully some statistics on the code won’t dilute shareholder value too much.

It turns out there’s a Perl module for doing just this sort of thing: PPI.There’s a cool introduction to PPI by Adam Kennedy (the module author) on perl.com. PPI makes doing something basic like finding the average length of subroutines very easy, so I hacked up a script to do just that.

My test harness script is a bit less than 1000 lines:

$ wc -l foo.pl
914 foo.pl

Running the sublength.pl script on it gave the following results:

$ ./sublength.pl foo.pl
Number of subroutines: 46
Length of longest sub: 50
Length of shortest sub: 4
Average length of sub: 15.76

My subroutines are pretty short with an average length of just 15.76 lines, which includes the header and the opening and closing braces on separate lines. As I’ve said before, I find long, deeply-nested methods difficult to understand, so I just don’t write them like that.

More generally, I like the idea of measuring the complexity or readability of a codebase with metrics. You could even analyse a repository commit-by-commit and see whether each commit has increased or decreased readability. Then you can have graphs showing the change in readability over time. Awesome! Sure, it sounds like overkill, but folks much smarter than me have emphasised the importance of readability:

Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.