a blog about things that I've been thinking hard about

DSL Metaprogramming: Five Kinds of Source Code

24 September, 2006
grammar, compiler, source, compiled, libraries

1. The grammar of the Domain-Specific Language (DSL)

2. The compiler to compile the DSL

3. Source code written in the DSL

4. Compiled code (as compiled by the compiler)

5. Support libraries for the compiled code

Background

DSL Metaprogramming is the kind of metaprogramming where you create a Domain Specific Language which naturally describes the "meaning" of what your program is about.

A critical different between metaprogramming via DSL and just programming in some very high-level programming language is that with DSL metaprogramming, you not only write code in the DSL – you also have to define the DSL and you have to manage the relationship between the DSL and the code written in some real programming language which actually runs the program.

One Kind of Source Code

In the "traditional" model of programming, you choose a programming language, such as Java, and then you write your program in that programming language. To make it run, you might invoke a compiler, or perhaps an interpreter. The only source code involved is the source code written in the chosen programming language.

There are various things that you as the programmer may have to do with your source code, including but not limited to the following:

I call this model "traditional", because I can remember a time in the past when a lot of programming was done with just one programming language. But that was definitely in the past – nowadays programming has become multi-language, and a typical Java application might have source code written in various additional languages including SQL, XML (and various languages implemented in XML), HTML, CSS, JSP, properties files, and so on. And for each one of those languages there is going to be source code, and there are going to be programmers reading, writing, committing, compiling, running and debugging code written in those languages.

Five Kinds of Source Code

If you create a DSL to describe how an application runs, then there is still a requirement to convert this DSL into code written in a general purpose programming language (GPPL), such as Java, that can be executed. (Sometimes DSL's are used in other ways, but even in these cases a lot of what I am going to say still applies.)

If you follow this path, then I submit that there are five kinds of source code that you have to contend with, which are the following:

  1. The definition of the DSL, i.e. its grammar.
  2. The application code written in the DSL itself.
  3. The code which compiles the DSL into the GPPL.
  4. The DSL code compiled into the GPPL.
  5. Support code written directly in the GPPL to provide functionality required by the previous item.

Of these five items, it seems that item 4, the generated code, is not really source code because it is not written by the programmer. However, in practice, programmers often start wanting to treat generated code as if it was manually written source code. In other words, they want to read it, compile it, fix compiler errors, run it, debug it, and maybe even track its revision history.

It is almost an axiom of metaprogramming that the temptation to abandon the generation of code and just continue developing the generated code is a temptation that must be resisted (in political military jargon this is "staying the course"). But even where programmers are fully committed to metaprogramming, there will be times when they want to manipulate generated source code directly. The only caveat is that they must "fold" their changes back into the code which generates the generated code (which may be either the DSL code or the DSL-to-GPPL conversion code).

Development System Quality

When we choose a traditional general purpose programming language to write a program, we take into consideration how easy it is to perform all the various source-code-related tasks, i.e. reading, writing, compiling, fixing compile errors, running, debugging etc.

By analogy, when deciding on a metaprogramming system for doing metaprogramming, we need to know how easy it is to perform all of these tasks for each of the five kinds of source code. And if a metaprogramming system makes it hard to perform any of those tasks for any of the five kinds, then our metaprogramming efforts may fail.

Metaprogramming "failure" can take various forms. It can be straight-out failure, where the developed application simply fails to do what it is meant to do. But it can also "fail" in other ways, in particular it can lose its "meta", and just become straight "programming". There are at least two ways that this can happen. One, as already mentioned, is that developers can start developing against the generated code, at which point re-generation of new code becomes difficult if not impossible. Another way to fail is that the DSL abstraction becomes too "leaky", and developers think in terms of the generated code, and they end up manipulating the DSL code in order to generate the GPPL code which they really want to write.

Separation

In saying that there are five kinds of source code, the question arises as to whether these five kinds are properly separated from each other. One would hope so, but there are plenty of ways to not separate them.

In the first instance there can be a failure to separate DSL definition from implementation. This can happen especially if you go the route of embedding a DSL in a high-level programming language. For example, a DSL might be implemented by a set of Lisp macros, in which case the syntax of the DSL is implicit in the names of the macros and the structure (if given) of the macro arguments.

If you use a grammar tool like ANTLR then the separation of grammar definition from grammar implementation is quite explicit, for example see this example of Java 1.5 syntax definition, which defines the grammar without any implementation details.

In some cases the generated code may not exist as such. The objects described by a DSL may exist only as run-time objects. Or, as in the case of Lisp macros, the generated code may only exist in memory, where it is immediately compiled into the Lisp system's internal representation of executable Lisp code. (Of course we can and do use macroexpand and macroexpand-1 to view expanded macro invocations.)

Generated code may also be merged with manually coded "helper" code. This leads to source files with warnings like "Do not edit generated code within these comments.".

My own recommendation would be that if you are implementing a large application using DSL metaprogramming, then you should keep each of the five code types in separate files, and use appropriate linking and binding mechanisms to combine them when necessary at run-time.

The Future

In the open-source world there is no "standard" way to do DSL metaprogramming. "In-language" metaprogramming remains popular, for example using Lisp (the traditional language of choice), and now Ruby (which trades simplicity and robustness for the opportunity to use better syntactic sugar). But these languages unnecessarily constrain the design of your DSL to be part of the host language. The fact that doing DSL metaprogramming this way is popular suggests that it is still harder than it should be to invent your own language and write an interpreter or compiler for it.

ANTLR seems to be the most popular and easy to use open-source grammar tool that is currently available (and the only one I have direct experience with). Older tools like lex and yacc impose too much work on the programmer to be useful for "casual" DSL invention.

I have written this blog entry because I have been thinking about doing some experiments of my own in creating a parsing system which is "meta-programmer friendly". One of my aims in designing such a system will be to make sure that the "system" is programmer-friendly with respect to each of the five kinds of source code that I have identified.

Vote for or comment on this article on Reddit or Hacker News ...