non-Unicode text handling in CMarkup
To convert text that is not Unicode and not ASCII, CMarkup can use Windows APIs, or iconv on Linux and OS X. If you don't need this, just define
Modes of charset conversion
It has taken a long time developing the simplest design for this cross-platform issue with CMarkup, but I think I've gotten closer.
Update December 17, 2008: With CMarkup release 10.1, CMarkup has 3 compile-time modes for character set conversions and multibyte functions:
MARKUP_STDCONVto turn other modes off
In VC++ or g++ you can add
MARKUP_STDCONV to your precompiler definitions to force it to standard C mode.
Windows API based mode
MARKUP_WINCONV is implemented using
MultiByteToWideChar where CP_ACP represents the Windows system locale code page (the same as
MARKUP_WINCONV mode uses
_mbclen to determine character length.
_mbclen is a Visual C++ function that uses the same Windows system locale code page as
MARKUP_WINCONV mode is automatically selected in VC++, otherwise if you are on Windows you should add
MARKUP_WINCONV to your project preprocessor definitions. For example, compile the CMarkup test program with g++ in cygwin as follows:
g++ main.cpp Markup.cpp MarkupTest.cpp -DMARKUP_WINCONV
The iconv API mode
MARKUP_ICONV on Linux and OS X sometimes requires an extra step to link into your program. On OS X with g++ I had to specify
-liconv on the command line. I compiled the test program as follows:
g++ main.cpp Markup.cpp MarkupTest.cpp -liconv
On some systems you may need to choose between libiconv or iconv. CMarkup only uses the most basic functionalily of the iconv API to try to avoid inconsistencies in implementations.
MARKUP_ICONV mode is currently automatically selected in g++ based on the
__GNUC__ predefined macro. Again, you can turn off iconv usage by adding
MARKUP_STDCONV to your project preprocessor definitions, or on the command line with
g++ main.cpp Markup.cpp MarkupTest.cpp -DMARKUP_STDCONV
MARKUP_STDCONV you are excluding iconv. It is usually fine to do without a full conversion API (see below how in standard C mode you can use
setlocale even when you have Far Eastern or ANSI files not in the system locale charset). But if you need iconv
MARKUP_ICONV mode in your program, you might have to download and install libiconv before compiling.
Standard C mode
MARKUP_STDCONV supports ANSI conversion to and from Unicode if you call
setlocale to initialize your character set. On Windows, CMarkup will be sensitive to the
setlocale charset if and only if it is in standard C
Note that as a process-wide setting,
setlocale has potential disadvantages for your program as a whole. If another part of your program depends upon or uses
setlocale there could be unintended conflicts. And in a multi-threaded program there are additional implications. You have to decide if
setlocale is appropriate in your case. The firstobject XML editor used
setlocale with success until release 2.3.1. At the time of writing (12/2008) the editor is being switched over to CMarkup 10.1 and the Windows API conversion mode.
So if you are using standard C mode and converting a file or string to/from a single-byte or double-byte encoding, or using an
MBCS build, you must call
setlocale. If you are using just the system locale charset, call
setlocale in your program initialization to prime the C multibyte functions for the system/user locale charset.
Note that on Windows,
setlocale will default to the *user* locale charset which is usually but not always the same as the *system* locale charset (Regional Settings for non-Unicode programs). To make sure you are setting it to the system locale code page on Windows you can do something like this:
char szACP; sprintf( szACP, ".%d", GetACP() ); setlocale(LC_ALL, szACP);
Standard C conversions use
wctomb, and in
MBCS builds it uses
mblen to determine character length. All of these standard C functions use the code page specified in
setlocale, not necessarily the same as