From the distributor's point of view, the easiest delivery method is bare source code, since it requires no work other than making the code available. However, this does not make the problems of distribution go away; it just moves them to the user. In order to compile the program, the user needs to have a development system compatible with the developer's, including a compiler, translators, libraries, and tools such as make and yacc. And even with the proper tools, if the user's hardware or OS is different from the developer's, the user may need to do various porting work. The disadvantages of this method may be summarized by saying that source code is too dependent on the developer's hardware, OS, and development platform.
So binary distributions are too dependent on the target system and source code distributions are too dependent on the development system. These platform dependencies can be largely eliminated by delivering virtual machine (VM) binaries. This method has been popularized by Java and its class files, but it has been successfully used in other systems for decades. VM binaries are independent of the target system in the sense that virtually any computer can have a VM interpreter. Typically, VM implementations are not independent of the development system since they use only a single programming language, but there is no particular reason why this should be the case. In fact, although the Java VM was intended to execute a single language (Java), many other languages can now be translated into JVM class files (see http://grunge.cs.tu-berlin.de/~tolk/vmlanguages.html). Programs written in these languages can be implemented on any machine that has a Java interpreter, making them relatively independent of both the target and development systems.
At first glance, it seems that Java class files might solve all of our software distribution problems. We can translate all programming languages into JVM class files and distribute all programs in that form. But the problem with interpreted software in general, and class files in particular, is that it is much slower and requires more memory than compiled software. JIT compilers can partially address the speed problem by generating machine code on the fly, but they cannot do serious optimization since that would take too long. This will prevent Java-style JIT implementations from ever really competing with native code solutions in performance.
Of course, there is no reason in principle why we could not deliver machine-independent class files and have the installation run an optimizing compiler on the class file to produce native code. This would generally produce better code than a JIT compiler could, but it would still not be comparable to what a high-quality optimizing compiler could do with the original source code. The reason is that much of the semantic information you need for effective optimization is lost in the translation to class file format. Another problem with this solution is that optimization can take a very long time, so running an optimizing compiler is impractical for applets and inconvenient for the installing traditional applications.
So far, I have concluded that the most machine-independent form of software distribution is VM binaries, that it is necessary to optimize VM binaries, and that optimizing VM binaries on the user's machine is ineffective and inconvenient. The obvious alternative is to deliver VM binaries that have already been optimized. There is a problem with this as well: different target platforms require different optimization. But there are some optimizations that can be done at a higher level. For example, it is always a win to evaluate complex expressions at translation time instead of at run time, and it is always a win to remove dead or unreachable code. What is needed is a VM that allows more extensive system-independent optimization in the VM code. This sort of preoptimized code could be loaded by a program like a JIT compiler and executed at optimized native code speeds. The challenge is to find a way of optimizing VM files without relying on machine-dependent optimization.
To see how we could approach this, consider the process of translating a source language to a VM language and then to machine language for native execution.
Source ------> VM ------> Machine
The source language is very different from the machine language, and the VM language is somewhere in between. The question is, where is it? Is it closer to the source language, closer to the machine language, or somewhere in the middle? At one end of the scale, the VM for a traditional compiler is the machine itself, and at the other end are common scripting languages where the VM is the source language. Java class files are near the middle. The closer a VM is to the source language the easier it is to do the source-to-VM translation. The closer the VM is to the machine language, the easier it is to do the VM-to-machine translation.
For fast JIT compilation, we want the fastest possible VM-to-machine translation, so this suggests moving the VM toward the machine and away from the source. But going against this is the fact that we want to remain machine-independent, and that moves us away from the machine. Still, there are commonalities between machines. For example, most modern machines are register-based, and this suggests that the VM should be register-based so we can do preliminary register allocation in the VM.
We can view the process of translation more generally as one that involves many source languages and many machines, and the challenge is to find the point for the VM that allows for the most optimization and the simplest VM-to-machine translation for the largest set of machines. Let's call such a VM an Optimizable Portable Instruction Set or OPIS. I have studied these issues a little as part of implementing an optimizing compiler, and I am confident that a reasonably good OPIS can be designed. However, this a research project that would require expertise in implementing many different programming languages and in writing compilers for many different machines.
Is this something the Open Source community is capable of? Can the community model be applied to a large research project or is research too different from development for the model to carry over? In a sense, the community model is very similar to the academic research model; the work is distributed over many researchers each doing what he or she is most interested in, and the result is often rapid progress in the state of knowledge.
I would like to hear from anyone who might be interested in participating in such a research project and anyone who knows of related work in the area.
Dave Gudeman (firstname.lastname@example.org) received his PhD in computer science from the University of Arizona in 1994. His research areas involved programming language design and implementation and his dissertation involved the design of an optimizing compiler for a concurrent constraint programming language. He is currently working for a small software company designing databases and GIS applications. His contributions to free software include the Janus Compiler (for research purposes only) and a Java XML reader/writer called Harp (the source code for Harp will be posted on SourceForge Real Soon Now).