wchar_t string on Linux, OS X and Windows
wchar_t work on Linux, OS X and Windows for CMarkup release 10.1 I learned a couple of humble lessons, and I expect I'll be posting more here as I get feedback. To me the term
wchar_t string is the same as C++ wide string, C++ wide char, C++ wchar, C++ wide character string, etc, which all come down to an array of
wchar_t. The STL
std::wstring class based on
wchar_t characters is the wide version of the
std::string class based on
wchar_t string (and STL
std::wstring) on POSIX (Linux and OS X) has few advantages if any since nowadays a regular
char string is in Unicode UTF-8 by default, including, I assume, most system functions, file paths, and programming interfaces. Using wide strings therefore means an extra layer of UTF-8 to UTF-32 conversion on many operations. Nevertheless, I went ahead and implemented and tested wide char "
MARKUP_WCHAR" support in CMarkup since a) it was there for Windows
UNICODE builds, and b) a customer expressed interest in doing a wide char build for Mac.
Note that the gcc 3.4.4 "cygming" compiler that comes with cygwin 1.5.25-15 doesn't seem to have have
std::wstring or even
wprintf, though it does have
wchar_t. Since CMarkup requires a
wchar_t based string class, a wide char build is not supported here.
Compiling for wide char vs char
I took my cue from VC++
_T macros such as
_tcscpy which switch based on the character set selected for the build. With CMarkup, you define
UNICODE) to compile for wide strings since otherwise it compiles for
char strings. A set of macros is defined accordingly with the wide versions of functions and types. Here are examples of defines for character, constant character pointer and string copy that are different based on
#if defined(MARKUP_WCHAR) #define MCD_CHAR wchar_t #define MCD_PCSZ const wchar_t* #define MCD_PSZCPY wcscpy ... other wide functions #else // not MARKUP_WCHAR #define MCD_CHAR char #define MCD_PCSZ const char* #define MCD_PSZCPY strcpy ... other non-wide functions #endif
Unlike Windows UTF-16 2-byte wide chars,
wchar_t on Linux and OS X is 4 bytes UTF-32 (gcc/g++ and XCode). On cygwin it is 2 (cygwin uses Windows APIs).
At first I used runtime if statements like
if ( sizeof(wchar_t) == 4 ) but aside from being bad style that led to compiler warnings in the code that was for the other size of
wchar_t. I wanted a way to automatically determine the size of
wchar_t at compile time based on predefined macros (you can list g++ predefined macros with the command
cpp -dM and press Ctrl+D). I settled on using
__SIZEOF_WCHAR_T__ or even better
__WCHAR_MAX__ which is provided by gcc on Linux, OS X, and cygwin.
#if ! defined(MARKUP_SIZEOFWCHAR) #if __SIZEOF_WCHAR_T__ == 4 || __WCHAR_MAX__ > 0x10000 #define MARKUP_SIZEOFWCHAR 4 #else #define MARKUP_SIZEOFWCHAR 2 #endif #endif
I left the option of setting it explicitly by defining
MARKUP_SIZEOFWCHAR if the predefined macros aren't available.
Of course, everywhere you do conversions to and from
wchar_t strings, you have to be aware of whether it is UTF-16 or UTF-32. So I differentiate as follows:
#if MARKUP_SIZEOFWCHAR == 4 // sizeof(wchar_t) == 4 ... treat wchar_t string as UTF-32 #else // sizeof(wchar_t) == 2 ... treat wchar_t string as UTF-16 #endif
sprintf wchar_t with "%ls"
In VC++, you can use
"%s" in the format string of
fwprintf) to insert a wide string. But in POSIX you have to use
"%ls". This may be compiler dependent rather than operating system dependent.
|type||meaning in ||meaning in |
|ls or lS|
The only way to switch between
sprintf and it's wide char version
swprintf on POSIX seamlessly would be to use a macro in the middle of your format string. I was able to concatenate strings instead and avoid the whole issue of
swprintf for strings.
Note also that gcc uses a safe form of swprintf with the extra argument to specify the length of the receiving buffer (VC++ 2005 and up has the safe string version
swprintf_s). And I also was confused when I accidentally googled
wsprintf (first two letters swapped) which appears to be a version of this function only on Windows.
no wide filenames on POSIX
There is no wide
fopen on POSIX like
_wfopen on Windows (same goes for
stat). Filenames, whether received from system APIs or composed by your program should be kept in "filesystem representation" (UTF-8) and you should avoid doing encoding conversions on pathnames because you could be subject to differences in Unicode decomposition implementations that could subtly modify the pathname.
Therefore I had to implement special filename macros for filenames to be passed to the CMarkup functions without wide strings even in a wide char build.
iconv on OS X doesn't support "WCHAR_T"
This is more an issue of using
iconv across different platforms and configurations, but I found that although
iconv_open did not complain about
"WCHAR_T" on OS X, it did not convert properly. So I switched to explicitly using
"UTF-16" depending on
MARKUP_SIZEOFWCHAR. I can't say I understand all of the iconv vs libiconv issues, but the way I used iconv in g++ was with the -liconv flag.