non-Unicode text handling in CMarkup

To convert text that is not Unicode and not ASCII, CMarkup can use Windows APIs, or iconv on Linux and OS X. If you don't need this, just define MARKUP_STDCONV.

Modes of charset conversion

It has taken a long time developing the simplest design for this cross-platform issue with CMarkup, but I think I've gotten closer.

Update December 17, 2008: With CMarkup release 10.1, CMarkup has 3 compile-time modes for character set conversions and multibyte functions:

  • Windows API mode: default for Visual C++, MARKUP_WINCONV
  • iconv mode: default for GNUC, MARKUP_ICONV
  • standard C mode: use MARKUP_STDCONV to turn other modes off
  • In VC++ or g++ you can add MARKUP_STDCONV to your precompiler definitions to force it to standard C mode.

    Windows API

    Windows API based mode MARKUP_WINCONV is implemented using WideCharToMultiByte, MultiByteToWideChar where CP_ACP represents the Windows system locale code page (the same as GetACP). In MBCS builds MARKUP_WINCONV mode uses _mbclen to determine character length. _mbclen is a Visual C++ function that uses the same Windows system locale code page as CP_ACP.

    The MARKUP_WINCONV mode is automatically selected in VC++, otherwise if you are on Windows you should add MARKUP_WINCONV to your project preprocessor definitions. For example, compile the CMarkup test program with g++ in cygwin as follows:

    g++ main.cpp Markup.cpp MarkupTest.cpp -DMARKUP_WINCONV

    iconv

    The iconv API mode MARKUP_ICONV on Linux and OS X sometimes requires an extra step to link into your program. On OS X with g++ I had to specify -liconv on the command line. I compiled the test program as follows:

    g++ main.cpp Markup.cpp MarkupTest.cpp -liconv

    On some systems you may need to choose between libiconv or iconv. CMarkup only uses the most basic functionalily of the iconv API to try to avoid inconsistencies in implementations.

    The MARKUP_ICONV mode is currently automatically selected in g++ based on the __GNUC__ predefined macro. Again, you can turn off iconv usage by adding MARKUP_STDCONV to your project preprocessor definitions, or on the command line with -DMARKUP_STDCONV.

    g++ main.cpp Markup.cpp MarkupTest.cpp -DMARKUP_STDCONV

     

    comment posted CMarkup 10.1 compiler problem

    Eric 23-Dec-2008

    My OS is HP-UX 11iv3; my C++ compiler is gcc 4.2.3. I could compile CMarkup 10.0 with no error. When I compiled my program with CMarkup 10.1, it ran into these error messages:

    ld: Unsatisfied symbol "libiconv_open" in file /var/tmp//ccbA9y5m.o
    ld: Unsatisfied symbol "libiconv_close" in file /var/tmp//ccbA9y5m.o
    ld: Unsatisfied symbol "libiconv" in file /var/tmp//ccbA9y5m.o

    When I use link iconv like this:

    g++ -o ca ComputerEstate.cpp Markup.cpp -liconv

    it shows:

    ld: Can't find library or mismatched ABI for -liconv
    Fatal error.

    This compiles without error:

    g++ -o  ca ComputerEstate.cpp Markup.cpp -DMARKUP_STDCONV

    With MARKUP_STDCONV you are excluding iconv. It is usually fine to do without a full conversion API (see below how in standard C mode you can use setlocale even when you have Far Eastern or ANSI files not in the system locale charset). But if you need iconv MARKUP_ICONV mode in your program, you might have to download and install libiconv before compiling.

    Standard C

    Standard C mode MARKUP_STDCONV supports ANSI conversion to and from Unicode if you call setlocale to initialize your character set. On Windows, CMarkup will be sensitive to the setlocale charset if and only if it is in standard C MARKUP_STDCONV mode.

    You can use setlocale to select a charset other than the system locale code page and the UTF8ToA and AToUTF8 functions will work for that charset.

    Note that as a process-wide setting, setlocale has potential disadvantages for your program as a whole. If another part of your program depends upon or uses setlocale there could be unintended conflicts. And in a multi-threaded program there are additional implications. You have to decide if setlocale is appropriate in your case. The firstobject XML editor used setlocale with success until release 2.3.1. At the time of writing (12/2008) the editor is being switched over to CMarkup 10.1 and the Windows API conversion mode.

    So if you are using standard C mode and converting a file or string to/from a single-byte or double-byte encoding, or using an MBCS build, you must call setlocale. If you are using just the system locale charset, call setlocale in your program initialization to prime the C multibyte functions for the system/user locale charset.

    #include <locale.h>
    setlocale(LC_ALL, "");

    Note that on Windows, setlocale will default to the *user* locale charset which is usually but not always the same as the *system* locale charset (Regional Settings for non-Unicode programs). To make sure you are setting it to the system locale code page on Windows you can do something like this:

    char szACP[10];
    sprintf( szACP, ".%d", GetACP() );
    setlocale(LC_ALL, szACP);

    Standard C conversions use mbtowc, wctomb, and in MBCS builds it uses mblen to determine character length. All of these standard C functions use the code page specified in setlocale, not necessarily the same as CP_ACP.

    See Also:

    ANSI and Unicode files and C++ strings