Convert ANSI file to Unicode

Trying to get your ANSI file into Unicode? Here are some steps for converting ANSI to Unicode in Windows using the free firstobject XML editor called "foxe" (download foxe here). This is practical information, not the place to learn historically fascinating details about the ANSI misnomer.

Which ANSI charset?

The ANSI charsets are all the same for ASCII characters such as digits 0-9, and English letters a-z and A-Z, the problems become noticeable for non-ASCII characters. All text files that are not in a Unicode encoding can have problems when shared internationally or simply when a machine's default language configuration changes.

You probably don't need to know exactly what ANSI charset you have if it is specified in the XML or HTML file or it matches your system default. There are many different names used across the Internet in XML Declarations encodings and HTML Content-Type charsets (explained below). Here is a small sample of the common single-byte (aka SBCS) ones:

  • Western European Windows: Windows-1252, iso-ir-6, ISO: iso-8859-1, cp819, csISO, Latin1, ibm819, iso-ir-100, Mac: macintosh
  • Cyrillic Windows: Windows-1251, x-cp1251, ISO: iso-8859-5, csISOLatin5, cyrillic, iso-ir-144, Mac: x-mac-cyrillic, KOI8-R: koi8-r, csKOI8R, koi
  • Greek Windows: Windows-1253, ISO: iso-8859-7, csISOLatinGreek, ECMA-118, Mac: x-mac-greek
  • Hebrew Windows: Windows-1255, ISO-Visual: iso-8859-8, csISOLatinHebrew, ISO-Logical: iso-8859-8-i, Mac: x-mac-hebrew
  • Others Windows-1256, iso-8859-6, arabic, x-mac-arabic, Windows-1257, iso-8859-4, Windows-1250, iso-8859-2, iso-8859-3, iso-8859-15, Windows-1254, iso-8859-9, x-mac-turkish, Windows-1258
  • In Windows terminology, the ANSI charsets include the double-byte (aka DBCS) far eastern encodings. These are actually "multi-byte" (aka MBCS) with 1 byte for ASCII characters and 1 or 2 bytes for other characters:

  • Japanese EUC: euc-jp, csEUCPkdFmtJapanese, x-euc-jp, Mac: x-mac-japanese, Shift-JIS: shift_jis, csShiftJIS, csWindows31J, ms_Kanji, x-ms-cp932, x-sjis
  • Chinese Simplified EUC: EUC-CN, x-euc-cn, GB2312: gb2312, chinese, CN-GB, GB_2312-80, GBK, iso-ir-58, Mac: x-mac-chinesesimp
  • Chinese Traditional Big5: big5, cn-big5, Mac: x-mac-chinesetrad
  • Korean ks_c_5601-1987, euc-kr, iso-ir-149, ISO: iso-2022-kr, EUC: euc-kr, csEUCKR, Mac: x-mac-korean
  • Thai Windows: Windows-874, iso-8859-11, TIS-620
  • System default

    Operating systems and/or shell environments all configure a default charset. In Windows it is the Regional Settings Language for non-Unicode Programs which specifies the system locale code page. Most editors assume the default charset and do not give you a choice of ANSI charset when they open a file. Likewise, Windows ANSI "A" APIs expect strings in that code page.

    Which Unicode?

    You probably already know which Unicode encoding you need, but if not, UTF-8 is the ideal encoding for XML and HTML files. UTF-16 is also used, especially in Windows. Windows programs often refer to UTF-16LE (little endian) as simply "Unicode." Other platforms may use UTF-16BE (big endian) so if it is not Windows you will need to find out whether you need LE or BE. UCS-2 is the precursor to UTF-16 and is the same except UTF-16 allows for more characters, so just select UTF-16LE wherever you are dealing with UCS-2LE and UTF-16BE for UCS-2BE.

    If you are saving to a UTF-8 file you have a choice of using the 3-byte UTF-8 preamble (aka BOM) or not. Some tools (e.g. Oracle SQL*Plus) don't accept the UTF-8 preamble. If your file is XML with UTF-8, don't use a preamble; it is not necessary. However, with UTF-16, the 2-byte BOM is generally expected.

    Corrupted Text

    If you don't handle your text conversion right you will get corrupted text. Corrupted text displays as garbled text, strange symbols, boxes, and question marks.

    •à¾à šà¶

    A little box (a square, to be precise, aka ".notdef glyph") is commonly shown for each character not supported by the font. Text corruption often results in random code points not supported by available fonts. But if the only problem you see is boxes, it might not be corrupted text; it might simply be that the font does not support the characters even though they are being interpreted (decoded) correctly. If this is the case, try changing the font to a full Unicode font or a font that supports the language.

    Opening your ANSI file in the editor

    The first step is to open the file properly in the firstobject XML editor (foxe) so there is no corruption. When you open the file (File menu, select Open, and find your file using the File Open dialog), watch the status bar at the bottom for information about the file encoding when you open the file.

    Foxe tells you in the status bar what charset was used to interpret the text in the file. If wrong, you will see corrupted text in your document where there are non-ASCII characters. In this case you will need to close the file and re-open it by selecting the correct charset from the Encoding drop-down list in the File Open dialog.

    In many cases, foxe will open the file properly (even if it is EBCDIC or ISCII or DOS). If it is Unicode, it will usually be able to tell due to the 2-3 byte BOM or preamble, or with UTF-8 sequence auto-detection.

    For XML files that are not Unicode, it will read the XML declaration at the top, e.g.:

    <?xml version="1.0" encoding="windows-1252"?>
    

    For HTML files that are not Unicode, it will read the Content-Type meta tag in the head, e.g.:

    <meta http-equiv="Content-Type" content="text/html; charset=windows-1252"/>
    

    If none of those clues are present, foxe assumes the default non-Unicode system code page. Again, if it is wrong, you will need to select the correct charset from the Encoding drop-down list in the File Open dialog.

    Saving to Unicode with the editor

    Once you've opened and viewed the text file successfully, the easy part is getting it into Unicode. If it is XML or HTML, you can select the UTF-8 or UTF-16 encoding from the Tools menu to set the XML Declaration encoding or HTML meta charset in the content of your document. In XML, the XML Declaration encoding is not necessary if the file is any Unicode encoding.

    Select Save As from the File menu, and pay attention to the Save As dialog's Encoding setting. It will say the encoding you are saving in. It has a drop-down box to change the encoding. If the right encoding is not available, that means the wrong encoding is specified in the declared XML encoding or HTML charset in the document itself, so you need to remove or change the setting in the document content or go back to the Tools menu and set the encoding from there.

    See Also:

    UTF-8 Files and the Preamble
    UTF-16 Files and the Byte Order Mark (BOM)
    ANSI and Unicode files and C++ strings
    CMarkup AToUTF8 Method
    Setting the XML Declaration With CMarkup