a blog about things that I've been thinking hard about

Disorganized Incremental Software Development

15 September, 2005
use cryptographic hashes to name everything

"Naming" is the "second hardest problem" in software development.

Cryptographic hashing can automatically name everything with a unique name.

Problem solved! (OK, now we have the problem of how to read the code ...)

The Minimum Contribution Problem

One of the interesting features of Wikipedia is that anyone can make a very minimal contribution to an article, such as a word, or a sentence, or even a one sentence article, and still provide some positive benefit to the project.

Because Wikipedia has no barriers up front to contributors, there are some "contributors" whose contributions are less than useful, and Wikipedia relies on certain loosely defined social structures to ensure that, in the long run, beneficial contributions accummulate and contributions of negative value are discarded.

Does the advantage of reducing the barrier (more contributions) outweigh the disadvantage (negative contributions)? The proof is in the pudding, and the success of Wikipedia cannot be denied.

It would be nice if one could do something similiar with open-source software. A line of code here. A skeletal class definition there. Perhaps a couple of useful comments.

But in real life this doesn't happen. There seems to be a minimum amount of effort that you need to make to contribute to a project that a significant number of people will use.

If you contribute to an existing project, then the developers who are already working on that project will want to know that incorporating changes that you make is worth the risk. And if you start your own project, well, you are going to have to do a non-trivial amount of work. And you will have to spend time and effort promoting your project.

If you attempt to contribute to an existing project, those who already have something invested in it will want to know if your contribution adds value. Software is notoriously fragile, and there are many "improvements" that can be made to an application which seem good in the short term, but which cause more problems in the long term because they increase the complexity and entanglement of different aspects of the application's functionality. The final result can be a Big Ball of Mud, and the only cure for a Big Ball of Mud is to abandon it, and start all over again.

There is also the dreaded "fork". Given the open-sourceness of open-source software, you can take an existing project, add your one line of code, and declare the result to be a new and better version. But will any user other than you be inclined to use your version, and will any developer other than you want to develop on top of your change?

Some projects satisfy the urge for incremental development by providing for creation and installation of plugins and extensions. But this never allows for all possible improvements, and even plugins and extensions require some minimum level of effort to be useful to others.

Potential Advantages of Incremental Development

If there was some way whereby many different people could make arbitrarily small contributions to any software project, then both the quantity and quality of open-source software could increase by a significant factor. And we do need more quality and quantity in open-source software. Open-source software can still be frustrating to use. New versions can have more bugs than old versions. Whole projects never quite "make it", and the efforts of their authors are largely wasted, except perhaps that they provide a lesson in how things shouldn't have been done.

When we take a more abstract view of how software is constructed, we can see that most software is built up out of various "ideas", which come together in the mind of the developer or developers as source code. If there was some way that ideas could be "dropped into" a software application, just like you might drag and drop MP3s into your music collection, then incremental software development might be possible.

Merging, Naming and Collisions

What makes software development complicated is that each component of the source of a software application has meaning within some context, and at the same time it alters the context of other components of the source. So when you contribute some change to the code, two questions immediately arise:

Modern programming languages do recognise these dependence problems, and efforts have been made to allow for and guarantee independence between different software components. A major part of how this is done has to do with naming.

Java is one programming language that provides naming conventions to avoid conflicts between code written by different parties.

For example, the Apache Jakarta Commons project has a library called HttpClient, and this contains a class called org.apache.commons.httpclient.HttpClient, which is the main point of entry into the functionality of the library. We can see that the name of the component is multi-part. The first three parts – "org.apache.commons" – tell us something about the organisation responsible for maintaining the class. The last two components – "httpclient.HttpClient" – serve the purpose of describing the functionality of the class.

Now there might be some reason why you really didn't like the way that this class was implemented, and because the Jakarta Commons libraries are open-source, you can download the source, and alter it, and recompile it, and use an altered version of in your own applications, all while keeping the same name. But if you released a version of the library altered in this way, people would probably avoid using it, because it would obviously clash with the "official" version. If both your version and the official version were on the same Java classpath, the Java classloader would have to make some decision about which one to use, and ignore the other one.

You could also make copies of all the Java classes, and give them package names that reflected your own organisation, and then alter them. For example, if you were Joe Bloggs at example.com, you might rename the HttpClient class to com.example.bloggs.joe.httpclient.HttpClient. This avoids the collision of names, but your version is now completely different from the original version. Any new improvements to the Apache version won't automatically appear in your version.

You could go to Apache with your improvements, and try to get them included into the official version. But what seems like useful improvements to you might not seem so useful to someone else. And the Apache team will have to ask themselves if it is worth the risk and effort of incorporating your changes in return for the alleged benefits. Someone will have to examine your code, and determine what effect it has. And this might be as much work as if they wrote the code themselves.

Ownership and Arbitrariness

The problem with most naming systems designed to avoid collisions is that each part of the namespace has some ostensible "owner", and this owner can make arbitrary decisions about the meaning of the names in the namespace. As a result, the meanings of these names can change over time. This is what happens when components of a library change their functionality in subtle ways from one version to the next (perhaps because the functionality in the earlier version was "wrong" in some way). And if you decide to achieve reuse by making your library depend on someone else's library, then your library suffers "code rot" when the meanings of named components in the library you depend on are changed by the owner of that library.

Non-Arbitrary Names

If names were not arbitrary, then this problem would not occur. If a name was always exactly determined by the thing that it named, then it wouldn't matter who "owned" the name. If my version of a component of software had the same determined name as your version, then this would be OK, because the two components would necessarily be exactly the same.

The simplest way to absolutely guarantee that a name is determined by content is to have the name actually be the content. The only problem with this is that the names are too long. The whole point of having a name is that it is less effort to read and write than the entity that it references, and if the content is the name, then you lose that advantage.

There is a way to almost have names determined by content, and that is to use a cryptographic hash function. Such a function maps a sequence of bits of arbitrary length to a fixed size sequence of bits, in such a way that it is extremely difficult to ever find two different input values that generate the same output value. One well known hash function is SHA-1, which was designed by the USA's National Security Agency, and which maps any data value onto a "unique" 160 bit value. (The "security" of the "uniqueness" of this particular hash function is subject to continued revision, and although SHA-1 has not been properly "broken", attempts have come close enough that extended versions of the algorithm with slightly larger hash sizes are being recommended for future use.)

How would this be applied to computer software? Consider a function like square, written in some imaginary functional programming language:

square = lambda(x) multiply(x,x)

For the sake of illustration, let us assume that lambda is a keyword, but multiply is a function defined in some existing library. multiply will therefore have some "real" determined name, such as "b098AjkdA93rTDfW59JU941DfK0". So square will more precisely be defined as:

lambda(x) b098AjkdA93rTDfW59JU941DfK0(x,x)

What name do we give square? One way would be to replace the reference to "b098AjkdA93rTDfW59JU941DfK0" with the actual content which that hash value was derived from, and then compute the full hash of the expanded string. But full expansion would become more and more tedious as more and more definitions were layered on top of each other. It's simpler and quicker to assign a hash based on the function as it is defined, i.e. to compute the hash of the exact string "lambda(x) b098AjkdA93rTDfW59JU941DfK0"(x,x)". (One could also normalise bound variable names, to reduce arbitrary variation caused by choice of those names.)

Specification Names versus Implementation Names

If the name of a function is derived from the content of its implementation, what happens if you change the implementation without altering the functionality of the function? Even the tiniest changes in name would propagate all the way to the top of the software definition "tree". An alternative is to name software components by hashing their formal specifications. This has the advantage that the implementation can be changed without altering the name, as long as the changed implementation is exactly equivalent in its result to the original.

Descriptions and Comments

One problem with hash-based names is that they will be hard to read. Trying to remember that "b098AjkdA93rTDfW59JU941DfK0" means "multiply" is not that easy. When we looked at the example of org.apache.commons.httpclient.HttpClient, we found that the earlier portion of the name had to do with ownership and collision avoidance, but the rest of the name serves a descriptive function, and is written in something close to English.

In a word of collision-free names determined by cryptographic hash functions, there will still be a need for the English descriptions. There will still be "ownership" of these descriptions, and it will be up to people using formal specifications (and not wanting to analyse them in mathematical detail) to decide whose descriptions of software specifications with hash-based names are trustworthy.

A Vision of the Future

If hash-based naming is adopted widely, then the very nature of software development and distribution will change considerably:

(Updated 10 October 2005)
Vote for or comment on this article on Reddit or Hacker News ...