Today, it is quite common that people from different countries are working together. Also, though every employee may have his or her own terminal, it's likely that common applications are provided by a single file server. This creates a need for multi-language operating systems. It sounds critical, but it isn't (yet). The commands are exactly the same (Chinese sysadmins also type "mount", though I can imagine language-dependent symlinks) and answers may vary only a little (e.g., "y/n" in German is "j/n"). Until now, all other tasks of internationalization have been avoided by the system, and applications have to take up the slack. If they don't (if, for expample, the administrator forgot to install the appropriate .po files), you can be lost on a terminal with pictograms that, to you, mean nothing.
The i18n movement which started some years ago solves a lot, but not everything. With it, only output is guaranteed to match the best gettext will find. What about the input? Multibyte strings, produced by input parsers like kinput2 or ami in an 8bit or 7bit environment, are hard to handle and crack easily (if you press the delete button, it removes only half a sign). kinput2 and ami cannot run together in one terminal, because code pages intersect. Start and end sequences are one solution, but a bad one and one especially not meant for the long run. Imagine a document full of different languages; if I want a function that gives a line length for this doc, it will be the hell, and I haven't even mentioned what will happen when new languages with new start and end sequences are implemented.
Also, we have so many applications which handle text and formatting. Integration of multiple language parsers into them may take 5 times more than implementing the problem-specific algorithms. I think something like Microsoft's IME, a central (system-wide) solution, is needed here. Unfortunately, IME is not Open Source, and is therefore un(sup)portable.
Next problem: Character encoding. Oops, this discussion is as old as computers are. Every nation had its own coding scheme, using the same domains! What a crappy idea! How could somebody let this happen?! OK, you say we have Unicode. Unicode was a good idea, until they found that 16bit is too little. Also, look at Yudit's encoding list; there's not one single Unicode, but many: UTF-7, UTF-8, UTF-16, etc. Furthermore, Unicode text files have a starting sequence, and Windows saves Unicode with low-hi byte order, but Posix systems don't. Java uses wide characters (16bit) internally. Wow! Now it means nothing. 16bit is just too little; it was only for the short run.
Next problem: The console. Its fonts and behavior differ from those of X (which, ten years after the invention of TrueType fonts, still lacks correct handling; take a look at Abiword, and you will understand). If I were Chinese, I would want to also see Chinese on my console, but this is even harder than under X, not to mention input routines. But what's the difference for an input parser between X and the console?
Next problem: Somebody better stop me from complaining. We have to move on. We still use the old stuff, but are now saving in XML. This is not very revolutionary. I will try to take a step forward. I'd like to present a solution. It's time to think about an all-inclusive, simple, and working system design.
But first, again, a collection of the problems mentioned above:
Fortunately, there are now these advantages which we can use:
Especially when I think about points 5 and 6 of the advantages, I say: Why bother? Let's give it a try. What I propose is first a new char type that consists of 32bit lengths. This will give us the security that in the future, no characters of any language will be outlawed. The most low-level routines (that write to the buses) will have to be changed. Upper-level APIs may stay the same (as user programs do), as long as they do not play with overflow (255 + 2 = 1) calculation. And, for heaven's sake, I propose to use only 7bit of the 4 bytes. Still, we would have around 270 million signs available. You might say "That's way too much; 3 bytes, like for my display, is enough!" Well, there are sound cards that process 24bit, but the processor has to pack it into 32bit packages to enhance speed, so in the end, there's no real advantage to 24bit. Also, in another 100 years, there might be the need for more. Please throw away the idea that you will see 4 bytes when you open a terminal! A char (you could call it sign or foo or bar if you like) will be an atomic piece of data. This view also fits in the modern multimedia processing arena, where sound data consists of 2-byte 2-channel or, for studio work, 24/32-byte multi-channel data structures. Binaries consist of 32bit, as do video streams, the most complex data we know today. The text file just hindered us taking this revolutionary step.
If you think this will blow up your filesystem, you are most likely wrong. Take the sizes of your text files, multiply them by 4 (or 2 if you are using CJK encoding text files), and compare them with your wav, MP3, or DivX files' sizes. Those files will not get bigger. The 7bit style is for old Internet routing hardware, but I think that in another 100 years, it won't still be there. Then, these domains may also be used. The encoding scheme is clear: As the number, so the saving and loading. No conversion. Hi-low byte order is preferred and seems more logical. We could use what Unicode did for 16bit, seamlessly integrating domains, allowing enough room between language domains. Unlike UTF-8, we don't want different sizes for Western and Eastern characters; that makes programmers unhappy and software difficult to control. Also, UTF-8 emphasizes historical Western domination of computing science, which is not very friendly. No start and end sequences -- that's it.
Let's go on. There will still be a mapping between keyboard-sent codes and the 32bit chars attached to them, as a phase of preprocessing. The next step will be checking the user's input choice and sending the data to the parser. This parser will build buffers for input, syllable buffers, chosen readings, etc. The buffers will belong to the system. They will be cleared when switching between parsers, but we need the ability to foresee what we type. Under X, window managers may have a small buffer dock app in which you could see the language symbol of what you're typing. Take a look at IME, and you'll know what I mean. It might be more difficult in the console, but with libs like ncurses, there might be a way to give a better view on the writing.
Also, shells might stop character echoing and write buffer contents instead, then clear back to the last breakpoint, write the new sign, change buffers, and write them again after the new sign. I did this once with a small Japanese console learning app, and it's fast enough that you don't see it. When pressing the enter key, I propose that we use only ready-typed input; otherwise, some shells might want to do so and some do otherwise, not giving a standard behavior.
Now I will think about one of the most-feared things in the computer world: The change of data structures and backward compatibility. First, a single machine holds its data -- text files, binaries, etc. It is most likely connected to other machines or the Internet, and that's where I begin. These new generation operating systems that fully process 32bit data from hard disk to whatever will be compiled and booted on machines, but will get data through FTP or other services that will get tons of chars of the old type. Their buffer (char) will be a 32bit array, provided by the system (because sizeof(char) returns 4!). The service now writes it to disk, but, fortunately, the system does it for us, because the system always has a fear that some applications might damage the hardware. For the newly-installed machine, there is no difference in behavior except the file sizes for texts. When the new machine provides services, there will be the problem of sending too much information. If the client wants the next byte (or char), there might be a problem if it gets a value above 255 (or 127 signed) and dumps an error message or disconnects.
In the end, there will be no progression without backward compatibility problems. Network connections are a big advantage here. We should try to use it and finally throw away our fear, because the gain will be a clear character processing solution that works world-wide, with no more hassles with encoding schemes and browser displaying problems, and a user-friendly, simple-to-use, speedy and secure multi-language interface. Also, encoding information can be left out, which cleans up email and XML files. The first big step will be to change low-level system routines, and for that I wish us all some more courage towards a change of thinking.