Character Encodings: a Useful Overview and New Features For Software Of RP Photonics
Probably every computer user has encountered problems related to character encodings – for example, special characters (examples: μ2, °) being corrupted when reading data from a text file. Such problems are at least less nasty if you properly understand that matter, but few people do that; things can be rather complicated.
In this article, I give you two things: a hopefully useful, somewhat simplified introduction into the technical background, and also explanations concerning profound improvements of the handling of character encodings which I have just implemented in our software. (Hard work, I tell you!)
Many users may just not use any special characters, thus avoiding any encoding problems straightaway. However, you may get some special outputs, in particular the Greek μ used as the micro sign in formatted numerical outputs. In some countries, you may even want to use much more, in particular when creating custom forms containing e.g. explanations in Chinese or Japanese language, or some box drawing symbols.
General Explanations on Character Encodings
Computer memories and files do not store numbers or characters as such, but only bit sequences. Those are often interpreted as numbers or characters based on some encodings. Both for numbers and characters, very different kinds of encodings exist and are currently used.
Encoding vs. Fonts
One should not confuse the issues of character encodings and fonts; nowadays, those aspects are normally handled separately. An encoding determines what kind of character (e.g. a certain letter) is associated with a certain bit sequence, while a font determines how that character looks when it is displayed on a screen or printed. By using different fonts, one can have different graphical looks of the letter “A”, for example.
For characters, the ASCII encoding was used in the early days, providing only 128 usable different characters, represented with a single byte (a set of 8 bits), where only the lower 7 bits were used. That is of course only sufficient for the simplest purposes.
ANSI Code Pages
Due to the need for more different characters, one soon started to use extended character ranges where the full 256 characters which can be encoded with 8 bits were utilized. However, many different versions of such ANSI character sets have been used (and are still used); some of them contain primarily things like accented letters (e.g. for the French language), while others contain letters of other languages or more symbols of certain types, e.g. box drawing symbols. Basically, the problem is that overall people in this world need far more than 256 different characters. In Windows, different ANSI character sets are characterized with so-called code pages (originally introduced by IBM). For example, in parts of Europe it is common that Windows systems use code page 1252 “Western European”. Note that for countries like Japan, where more than 256 characters are needed, 8 bits are not sufficient for representing one character.
Although the use of different ANSI code pages made possible the use of a vast variety of characters, this approach has serious limitations. In particular, when displaying e.g. the content of the text file with such an encoding, this works correctly only when the corresponding code page is known and correctly taken into account. Unfortunately, plain text files often do not contain any information on the used code page, so that additional information (e.g. manually transmitted by the user) is often needed to display such files properly; if such information is available, it may or may not be correct. In addition, one can of course not convert files from one code page to another without information loss, unless the file contains only characters which occur in both code pages.
Unicode as a Universal Solution
In order to solve such problems, the Unicode system (i.e., Unicode character definitions in conjunction with various encoding systems) has been developed. Here, a huge number of different characters (each one represented by a so-called code point) can be encoded – essentially any characters which one would normally require.
Of course, one byte (8 bits) is not sufficient for encoding an arbitrary Unicode character. Now there are a couple of different encoding schemes for handling Unicode characters in computer memory or files. A common scheme is UTF-16, where 16 bits (2 bytes) are normally used for one character, but for some characters one in fact uses 2 of those items, i.e., 32 bits in total, called surrogate pairs. This scheme is quite practical for use in CPUs and main memories. Another common encoding scheme for Unicode characters is UTF-8, where the most common characters are encoded with a single bytes (8 bits), most other characters need two bytes, and some need even three or four bytes.
The content of a computer file, for example, can only be correctly handled if the used encoding (e.g. ANSI or Unicode, and which particular scheme) is known. Unfortunately, that information is often lost or corrupted e.g. by wrong assumptions on an encoding. This problem has been solved for many Unicode files by writing a so-called byte order mark (BOM), consisting of a few bytes, at the beginning of a file. A program finding such a BOM in a text file can use it to determine (a) that it is a Unicode file and (b) what UTF encoding exactly has been used. However, it is common e.g. for web pages that one uses Unicode by without a BOM; the used encoding is indicated elsewhere (in an HTTP header transmitted before the page content). It is then often difficult or impossible for software to determine what encoding has been used for some file; asking the user may be impractical or not work because the user doesn't know that either.
Of course, an encoding which does not use a fixed number of bytes for one character implies some technical difficulties. For example, in order to find the 23rd character in such a byte sequence, one has to scan it from the beginning in order to properly take into account the number of bytes for each previous character. Due to such difficulties, it is often not easy to modify old software such that it can handle Unicode data. That problem can in principle be solved by using UTF-32 encoding, where each character is represented by 32 bits (4 bytes), but that is rarely used because it does not present a memory-efficient way for storing text.
The Handling of Encoding in Our Software
In earlier versions of software from RP Photonics (before 2017), Unicode in the form of UTF-16 was internally used throughout, but text files (e.g. scripts) were written with ANSI encoding based on the standard code page in the Windows system of the user. That approach is quite common, but it led to problems e.g. for users in Japan. Also, there were (rare) problems e.g. with the Greek μ (micro) character missing or encoded differently with some code pages. Therefore, the workings have now been modernized profoundly, arriving at the following rules:
- Internally, UTF-16 is used throughout, allowing virtually all characters to be handled. Only it does not care about surrogate pairs at some locations; that might cause problems, but hardly any user of our software will deal with such characters, and even then it will often work.
- When the software writes a plain text file (e.g. when the user saves a script or the settings of an interactive form), it now always uses UTF-8 encoding with a BOM (see above). This implies (a) that virtually all characters can now be stored in a file and (b) that one can determine automatically without doubt what encoding has been used for such a file.
- The software offers various functions for saving information in text files; for example, see the page describing the function open_file(). By default, it now uses UTF-8 also for such files, but the mentioned function offers the new feature that one can change the file encoding which is used for writing or assumed when reading a text file. Therefore, one can now easily write or read text files with any encoding – in the case of Unicode with or without BOM.
- All demo scripts and related files are now encoded with UTF-8 (with BOM).
- When a plain text file without a BOM is read, and the software is not explicitly told what the encoding is, it assumes ANSI encoding with a certain code page. That code page is by default 1252 (Western European), which leads to correct results e.g. when demo scripts from old versions are loaded. However, this assumption can lead to problems if a user having a different code page in his Windows system saved a file with an older version of the software. Therefore, one can change the assumed code page in the general settings (see Options | Options); one needs to set the correct code page there before opening such a file. One can thereafter just save the file in order to obtain it with UTF-8 encoding. So even in such rare cases you don't need an additional software utility for converting old files.
- A remaining question is how to enter special characters in the software. There is a new character map form which you get in the main menu with Edit | Character map. There, you can select a certain character block and click on any of the displayed characters in order to collect them for pasting them into your script, for example. Anyway, you can paste any text into the script editor, so you might use some external utility for entering special characters.
The taken approach implies that you can now use all kinds of Unicode characters, even in a single script or output file, and they will not be corrupted when you read such files later on. The encoding of text files written from now on will always be correctly recognized, so you will have no problems with corrupted special characters. So normally you don't need to care, and it just works. And with some care (see above), you can correctly load old files (even those containing special characters) and save them again with UTF-8 encoding.
By the way, it is now also possible to open a large number of text files at once, i.e. with a single File | Open action. This is useful for converting a bunch of files: just open all of them and save each one with Ctrl-S.
Most users of our software probably never had in encoding problems even with old versions, but if you like to have a free update in order to enjoy the above described features, just tell me. That offer applies even to those using quite old versions.
This article is a posting of the RP Photonics Software News, authored by Dr. Rüdiger Paschotta. You may link to this page, because its location is permanent.
Note that you can also receive the articles in the form of a newsletter or with an RSS feed.
Share this with your friends and colleagues, e.g. via social media:
These sharing buttons are implemented in a privacy-friendly way!