a blog about things that I've been thinking hard about

Extreme Negative Code Documentation

21 September, 2011
what is this line of code for? can I delete it?

Presumably every line of your code is there for some reason.

For any line of source code, you could describe, in a comment, what would happen if it wasn't there.

You could do that for every single line of code in your application.

The Basic Idea: What Is This Line Of Source Code Here For?

When you write software, you would normally expect that every line of code in your software is there for some reason.

And for each line, if it wasn't there, there would be some problem caused by it not being there.

The idea of Extreme Negative Code Documentation (ENCD), is that for every line in your code, write a comment explaining what would be wrong with the code if that line was missing.

First Example

As an experiment, I have applied ENCD to source files from my synqa project, which is a simple SSH-based file uploader for static websites.

For the sake of the experiment, I have applied ENCD on the branch master-neg, separate from the main master branch.

I have applied it to the following files:

The negative comments all start with #N, which distinguishes them from the "normal" comments, which don't have the N character at the beginning.

The original copies of the files on the master branch, without ENCD, are sample-rakefile, based.rb and synqa.rb.

Notes

Not All Lines are Negatively Commented

I did not write negative comments for all lines in the code. I did not write negative comments about "end" statements, because the comment would always be the same: "Without this 'end' statement, the block would not be closed." I also did not write negative comments for simple return statements at the end of method, because about the most useful thing I could write would be "Without this, the result value would not be returned."

But apart from that, I negatively commented more or less every non-empty line of code in those three source files, even if it seemed, for a particular line, that the corresponding negative comment was somewhat trite and obvious.

Block or Statement?

In a program, some lines are statements, others are the beginning of a statement. For lines which start a statement, there are two choices:

  1. Comment on the line itself
  2. Comment on the structure started by the line

For "if" statements, I generally commented on the condition, i.e. what would happen if the condition wasn't checked before executing the block?. For other block statements, including "for" statements, method definitions and class definitions, I commented on the block or definition as a whole, i.e. what would happen if the for loop wasn't executed?, or, what would happen if this class or this method wasn't defined?

Observations

You Have to Be Familiar With The Code

It's difficult to negatively comment every line of a program if, to use an expression from an essay by Paul Graham, you don't hold the program in your head. To negatively comment on every line in a meaningful way, you have to know what every line of code is there for and what would be missing from the program if that line wasn't there.

If you recently wrote the whole program yourself (as is the case for my example), then you can probably do it. If it's a large program that someone else wrote, and you're just in change of maintaining it, then good luck.

Having said that, attempting to negatively comment on someone else's code may be a good way of starting to get familiar with it, or at least to gain an idea of how ignorant you are about how that code works.

Negative Comments Often Describe Value

It's one thing to way what a method does, as API documentation typically tells us that. But to say what would happen if a method wasn't there, can tell us more. For example if method X was missing, it might be that:

In other words, negative comments can tell us how much individual items of code matter, and why they matter.

There Is No Right Amount Of Documentation

The traditional view about comments in code is that there should be some comments, but not too much.

The problem with this view is that how much is the right amount is highly dependent on how familiar the reader already is with the code.

A contrary view is that a reader should have the option of asking for an unlimited amount of detail about any item in the code that they are reading, should they have any uncertainty about what that item is there for.

A Manifesto: Extreme Documentation

A naive view about programming languages is that if you know a particular programming language, such as Java, then you can read anything written in Java, just like you can read anything written in English if you know English.

But I would say, in practice, that most Java programmers cannot read most Java programs. A Java programmer may be able to read code that they wrote themselves from scratch, and they may be able to read other people's code that they have worked on. But they cannot read most of the Java programs in the world. At least not in the same way they can read unfamiliar prose written in English.

One reason for this is that most Java code is not written in Java. Most of the words in Java programs are not Java words. They are identifiers, which have been invented by the programmers writing the program. (And some of them are library identifiers, which are an in-between case, depending on how widely used the library is.)

Which leads to the concept of extreme documentation. If we want a Java program to be readable by most of the Java programmers in the world, we need to intensively translate the code into English (note I'm naively assuming that most Java programmers know English, which might be approximately true).

That means writing a whole lot of comments. It means writing too many comments, so many comments in fact that it clutters up the source code too much, and it's no longer possible to read the actual Java.

To-Do List (i.e. the "Road Map")

Filtered Viewing

In my Ruby examples, all the negative comments have a special marker, i.e. an "N" character immediately after the "#" comment start character, also followed by a space character. So by grepping on "#N ", it would be possible to filter out the negative comments, leaving only "normal" source code with "normal" comments included.

A more sophisticated form of filtering would involve adjusted syntax highlighting (such as provided by Pygments), so that negative comments were hidden by default, e.g. by assigning a suitable CSS class in an HTML-ized version of the code, and perhaps showing the negative comment for a line only when mousing over some special icon or location associated with the line that it applies to.

Specialised Merging

In my examples I created the negatively commented versions of the source files in a different branch in my Git repository. However this is not the best approach going ahead. For example, if negative comments are useful, then they should form part of, or be associated with, the actual source code within the same branch.

Also, I have not attempted to alter any of those source files after creating the negatively commented versions of those files. If I did so, it might be possible to use Git merging to adjust the negatively commented versions to match any changes. However it would be better to use a merge tool which is specifically designed to deal with negative commentation, in particular ensuring that the rest of the source code in the merged negatively commented version exactly matches the source code in the non-negatively commented version (which standard Git merge strategies might or might not do), and highlighting which source code lines require new or updated negative comments.

Other Forms of Extreme Documentation

Negative Documentation requires a choice of an arbitrary level of granularity, i.e. for each line of source code, write one corresponding negative comment.

But there is no reason not to provide intensive documentation at all levels of granularity. For example, in an assignment statement, what does the LHS mean before the assignment, and what does it mean after the statement? And what does the RHS mean? And what does each sub-expression of the RHS mean?

Documentation of every item at every level of granularity in the source code would support arbitrary code inspection. That is, given a view of the source code, select a given statement or variable or expression, and the code inspector will display a hand-written description, in English, of that item.

A lot of the time, such a level of detail might seem like overkill. But for any developer reading any code, there will be situations where the developer thinks they know what something is, but they're not quite sure. Or the developer is just starting to read some code, and they have almost no prior context to help them understand what anything in the code actually does or what it means.

I Have A Dream ...

Ultimately, I dream of a world where we can vastly increase the number of people who can work on any individual item of code. This particularly matters for open source code, where in theory any item of code can be used and improved by any one of hundreds of millions of people who have access to a computer with an internet connection, but in practice, for most code, we're lucky if more than a few dozen people have enough understanding of the code to do anything with it other than install the whole application from the Windows installer and use it as is.

Vote for or comment on this article on Reddit or Hacker News ...