Command Prompt for Windows




Unicode and ANSI Text

What is Unicode?

Unicode is a character set that includes all the characters from all the languages around the world, rolled into the single character set. Unicode was designed to over come the issues of working with multiple code pages. As of Unicode 6.3, there are just over 1 million characters in the Unicode set. Even today, Unicode is still being modified.

Unicode comes in various encodings including :-

  • UCS
  • UCS2
  • UTF-7
  • UTF-8
  • UTF-16
  • UTF-32

The first 3 encodings are obsolete and are no longer used today. UTF-8 is used widely over the internet, while UTF-16 is the default encoding for Windows. UTF-32 is not widely used but still pops up here and there. If that is not confusing enough, there is also a byte order or endianness for UTF-16 and UTF-32. Intel processors use Little endian (LE) format while Motarola processors use Big endian (BE) format.

The algorithms are well documented for converting between UTF-8 to/from UTF-16 to/from UTF-32.

When the WinOne® Command Prompt refers to Unicode, it is refering to UTF-16 LE encoding. When there is a reason to refer to the another Unicode encoding, the correct name will be used.

What is ANSI?

Ansi is another character encoding, typically used for english characters. It includes only 256 characters. ANSI was made more general by introducing code-pages. Different langauges required different code-pages to represent the foreign characters and this is what makes it difficult to support other langauges when using the ANSI character encoding.

The best way to think of a code-page is to imagine it is the actual character graphic arranged in a list of 256 images and the characters inside a text file are simply indexes into the table of images. Change the code-page and the output of the text file changes, and this is where the main issues with ANSI characters come from as a text file does not contain any information anywhere inside the text file to tell us what code-page has been used to create the text file.

In Windows, there are different english code-pages for DOS/Console programs (CP-437) and for Windows GUI programs (CP-1252).

Windows will include a default code-page when you install your operating system, and this is what is used to display ANSI text. If the wrong characters are display, it is up to the user to set the correct code-page. See command CHCP and ACS for more information.

Also, when converting from ANSI to Unicode, the main issue arises if the incorrect code-page is used to convert the ANSI text to Unicode text. The result will end up displaying the wrong characters.

With Unicode, there is no code-pages, and each code-point is displayed the same on all computers around the world. This is the main advantage of Unicode!

Not all Console programs are designed to work with Unicode, typically, many only work with ANSI characters. All Microsoft written console programs support unicode characters, while most third party written console programs support only ANSI characters.

Why is Unicode important?

Unicode is important because Unicode characters can easily end up in file names. Especially, as the internet is so widely used today, it is not uncommon to download files from the internet with foreign characters (ie. Unicode characters) in the filename, Programs that only work with ANSI file names quite often fail to work correctly with files that have Unicode characters in the file name.

Similarly, Unicode can easily show up inside your text files for the same reason.

Valid Unicode character sequences

The Unicode standard includes a huge number of invalid Unicode characters. The overhead of validating what is a valid unicode character or sequence of unicode characters is huge and the implementation would simply result in really slow software.

Too complicate things even more, what is considered an invalid character in one Windows operating system is perfectly valid on another version of Windows. Similarly, no one knows what new characters will be added or removed in future versions of Windows.

Internally, the WinOne® Command Prompt does not check if a unicode character or a sequence of unicode characters is valid. This is normal for the industry. The vast majority of software just assumes the characters it comes across is valid. For example, a text editor does not check a file it opens to see if it has invalid unicode characters in the file. It just opens the file and uses Windows to display what characters can be displayed.

Characters vs Code-points

As a single unicode character can include multiple code-points, it becomes confusing to know what the WinOne® Command Prompt documentation is refering to when you see the word "character". When refering to ANSI text, a character is just a single character, but when the WinOne® Command Prompt documentation is refering to unicode text, a character is a single code-point. This is an important difference!

For example, consider the following sequence of characters U+0061 U+030a, (ie. "LATIN SMALL LETTER A" + "COMBINING RING ABOVE"), which looks like the single character "" when displayed. In the WinOne® Command Prompt documentation this is not a single character but it is 2 characters (as the word character really means code-point when refering to unicode text). When there is a need to be specific, then the documentation will use the specific terms, that is, the words code-point. For example, Command CHARNEXT and CHARPREV can be used to determine the number of code-points that make up a single unicode character.

Unicode Normalisation

In unicode, there are situtations where you find that 2 or more sequences of characters look identical even though the actual code-points are different. This can make it difficalt to compare strings as they may appear to be idential when displayed, but have different code-points internally. Windows over comes this problem by using Normalisation. When 2 strings are normalised to the same form, then they can be ready compared.

WinOne® includes a command called STRFOLD which includes several different forms of Normalisation that you can convert your strings to before comparing them. See Command STRFOLD for more information.

Unicode Fonts

There are very few fonts available that implement the complete Unicode character set and are fixed width, so that, the font is suitable for use with console programs. Also, Windows has a built-in mechanisum that will search for a subsitute font for a character when the selected font does not include the character that needs to be displayed. This is often called Automatic Font Subsitution.

When it comes to console fonts, Automatic Font Subsitution can have undsirable results, as the font Windows substitutes is often the wrong size or even looks completely out of place. WinOne® will detect this and if the font is not right, a question mark character is often displayed instead of the wrong font. When Windows can not find a font substitute then it can display a square character or an upside down triangle and so on.

One font that works well with the WinOne® Command Prompt is unifont. Unifont is not supplied with the WinOne® Command Prompt, but is freely available from the internet to download. See http://unifoundry.com/unifont.html. Simply download and install the font into your Windows Operating System, just like you would for any other font.

Known bugs in Windows XP

WinOne® makes extensive use of the Win32 API calls CharNext() and CharPrev(). These functions are designed to allow a progam to move forward or backward one character in a string. In Unicode, a character can be made up of one or more code-points. A code-point is 16 bits in Windows. This means that one character could be 16-bits or 32-bits or even more. CharNext() will determine the correct number of 16-bit code-points to skip to get to the next character.

Under Windows Xp, CharNext() and CharPrev() always returns 1 (that is, 1 code-point) and this means there is a possiblilty for the WinOne® Command Prompt to incorrectly move forward or backward a complete characters with in a string under Windows XP. There is no known fix for this and as far as is known on the internet, Microsoft has no plans to fix this bug.

This bug does not effect ANSI characters as CharNext() and CharPrev() only needs to return 1 for ANSI characters and that's exactly what these two APIs do.

Even though this may sound like an huge issue, it is quite rare that this bug will actually result in a problem.

What version of Unicode does Windows Support?

This is a great question. No one knows for sure what version of Unicode is supported under which Windows Operating System. The older the operating system the less of the Unicode standard is implemeted. This is also something to consider when using the WinOne® Command Prompt on older Windows Operating systems. A Unicode character may exist on one operating system and not exist on another operating system.

Many poeple working on older Windows operating systems are aware that the Unicode standard supported on those operating systems are not up to date.

ANSI characters are fully supported on all Windows operating systems.