wchar_t string on Linux, OS X and Windows

Making wchar_t work on Linux, OS X and Windows for CMarkup release 10.1 I learned a couple of humble lessons, and I expect I'll be posting more here as I get feedback. To me the term wchar_t string is the same as C++ wide string, C++ wide char, C++ wchar, C++ wide character string, etc, which all come down to an array of wchar_t. The STL std::wstring class based on wchar_t characters is the wide version of the std::string class based on char characters.

Why wchar?

Using a wchar_t string (and STL std::wstring) on POSIX (Linux and OS X) has few advantages if any since nowadays a regular char string is in Unicode UTF-8 by default, including, I assume, most system functions, file paths, and programming interfaces. Using wide strings therefore means an extra layer of UTF-8 to UTF-32 conversion on many operations. Nevertheless, I went ahead and implemented and tested wide char "MARKUP_WCHAR" support in CMarkup since a) it was there for Windows UNICODE builds, and b) a customer expressed interest in doing a wide char build for Mac.

Note that the gcc 3.4.4 "cygming" compiler that comes with cygwin 1.5.25-15 doesn't seem to have have std::wstring or even wprintf, though it does have wchar_t. Since CMarkup requires a wchar_t based string class, a wide char build is not supported here.

Compiling for wide char vs char

I took my cue from VC++ _T macros such as _tcscpy which switch based on the character set selected for the build. With CMarkup, you define MARKUP_WCHAR (or UNICODE) to compile for wide strings since otherwise it compiles for char strings. A set of macros is defined accordingly with the wide versions of functions and types. Here are examples of defines for character, constant character pointer and string copy that are different based on MARKUP_WCHAR:

#if defined(MARKUP_WCHAR)
#define MCD_CHAR wchar_t
#define MCD_PCSZ const wchar_t*
#define MCD_PSZCPY wcscpy
... other wide functions
#else // not MARKUP_WCHAR
#define MCD_CHAR char
#define MCD_PCSZ const char*
#define MCD_PSZCPY strcpy
... other non-wide functions
#endif

sizeof wchar_t

Unlike Windows UTF-16 2-byte wide chars, wchar_t on Linux and OS X is 4 bytes UTF-32 (gcc/g++ and XCode). On cygwin it is 2 (cygwin uses Windows APIs).

At first I used runtime if statements like if ( sizeof(wchar_t) == 4 ) but aside from being bad style that led to compiler warnings in the code that was for the other size of wchar_t. I wanted a way to automatically determine the size of wchar_t at compile time based on predefined macros (you can list g++ predefined macros with the command cpp -dM and press Ctrl+D). I settled on using __SIZEOF_WCHAR_T__ or even better __WCHAR_MAX__ which is provided by gcc on Linux, OS X, and cygwin.

#if ! defined(MARKUP_SIZEOFWCHAR)
#if __SIZEOF_WCHAR_T__ == 4 || __WCHAR_MAX__ > 0x10000
#define MARKUP_SIZEOFWCHAR 4
#else
#define MARKUP_SIZEOFWCHAR 2
#endif
#endif

I left the option of setting it explicitly by defining MARKUP_SIZEOFWCHAR if the predefined macros aren't available.

Of course, everywhere you do conversions to and from wchar_t strings, you have to be aware of whether it is UTF-16 or UTF-32. So I differentiate as follows:

#if MARKUP_SIZEOFWCHAR == 4 // sizeof(wchar_t) == 4
  ... treat wchar_t string as UTF-32
#else // sizeof(wchar_t) == 2
  ... treat wchar_t string as UTF-16
#endif

sprintf wchar_t with "%ls"

In VC++, you can use "%s" in the format string of swprintf (or wprintf, fwprintf) to insert a wide string. But in POSIX you have to use "%ls". This may be compiler dependent rather than operating system dependent.

type	meaning in `sprintf`		meaning in `swprintf`
	Windows	POSIX	Windows	POSIX
ls or lS	`wchar_t`	`wchar_t`	`wchar_t`	`wchar_t`
s	`char`	`char`	`wchar_t`	`char`
S	`wchar_t`	`char`	`char`	`char`

The only way to switch between sprintf and it's wide char version swprintf on POSIX seamlessly would be to use a macro in the middle of your format string. I was able to concatenate strings instead and avoid the whole issue of swprintf for strings.

Note also that gcc uses a safe form of swprintf with the extra argument to specify the length of the receiving buffer (VC++ 2005 and up has the safe string version swprintf_s). And I also was confused when I accidentally googled wsprintf (first two letters swapped) which appears to be a version of this function only on Windows.

no wide filenames on POSIX

There is no wide fopen on POSIX like _wfopen on Windows (same goes for open and stat). Filenames, whether received from system APIs or composed by your program should be kept in "filesystem representation" (UTF-8) and you should avoid doing encoding conversions on pathnames because you could be subject to differences in Unicode decomposition implementations that could subtly modify the pathname.

Therefore I had to implement special filename macros for filenames to be passed to the CMarkup functions without wide strings even in a wide char build.

iconv on OS X doesn't support "WCHAR_T"

This is more an issue of using iconv across different platforms and configurations, but I found that although iconv_open did not complain about "WCHAR_T" on OS X, it did not convert properly. So I switched to explicitly using "UTF-32" or "UTF-16" depending on MARKUP_SIZEOFWCHAR. I can't say I understand all of the iconv vs libiconv issues, but the way I used iconv in g++ was with the -liconv flag.