Home   |   Products   |   Documentation
 

dev net revision
29 July 2008
 

 
 

UTF-16 Files and the Byte Order Mark (BOM)

UTF-16 (RFC 2781) is a Unicode encoding (the same as "wide char") that is sometimes used in files on Windows operating systems. It is only used sometimes, and even in programs that are compiled for _UNICODE (wide char), text is normally stored to file in single or multi-byte encoding systems (ANSI or UTF-8). See ANSI and Unicode Files for more about this.

As with UTF-8, a UTF-16 file does not need an XML Declaration specifying the encoding (see the XML 1.0 Spec, 4.3.3). While UTF-8 and ANSI characters appear the same when they are in the ASCII range, UTF-16 is unmistakeably different because it uses two bytes per character.

Support for UTF-16?Scott Wilson 11-Feb-2004
Does CMarkup support UTF-16? I've found a sample file that it thinks is not well-formed. The firstobject XML Editor 1.2 didn't like the file either. The only related documentation I could find on your site is Encodings in the CMarkup article. The first two bytes of the UTF-16 file are 0xFF 0xFE. Do you know why these characters are there?

UTF-16 files are now supported in release 6.6 developer version. See below. In MarkupDlg.cpp of release 6.5, in the CMarkupDlg::OnButtonParse function, you'll see a test for the BOM in if ( pBuffer[0] == 0xff && pBuffer[1] == 0xfe ). You can change the code as follows to perform the conversion to UTF-8; the WideCharToMultiByte function supports UTF-8 conversion on NT 4.0, Windows 2000 and XP (the Advanced CMarkup Developer code has full source conversion functions for other platforms).

	// Windows Unicode file is detected if starts with FFFE
	if ( pBuffer[0] == 0xff && pBuffer[1] == 0xfe )
	{
		// Contains byte order mark, so assume wide char
#if defined( _UNICODE )
		csText = (LPCWSTR)(&pBuffer[2]);
#else
		// Perform UCS-2 (wide char) to UTF-8 conversion
		int nUTF8Len = WideCharToMultiByte(CP_UTF8,0,
			&((LPWSTR)pBuffer)[1],nFileLen/2-1,NULL,0,
			NULL,NULL);
		nUTF8Len = WideCharToMultiByte(CP_UTF8,0,
			&((LPWSTR)pBuffer)[1],nFileLen/2-1,
			csText.GetBuffer(nUTF8Len),nUTF8Len+1,
			NULL,NULL);
		csText.ReleaseBuffer( nUTF8Len );
#endif
	}

UTF-16 detection is only in CMarkup Developer.

CMarkup looks for the Byte Order Mark (BOM) at the beginning of the file indicating that it is a UTF-16 file (LE "Little Endian" only). This BOM consists of two bytes with the hexidecimal value ff and fe. They would be in the opposite order (fe then ff) if the UTF-16 encoding was "BE" (big endian) instead of "LE" (little endian).

Updated September 27, 2004 If the UTF-16 BOM is discovered on Load, the MDF_UTF16LEFILE CMarkup flag is set, and the 2 bytes of the BOM are not loaded into the document. In the Save method, if that flag is set, the document is stored in the file in UTF-16LE encoding with the BOM inserted at the start of the file. By setting or unsetting this flag, you can control whether the file is saved with UTF-16 encoding.

xml.SetDocFlags( xml.GetDocFlags() | xml.MDF_UTF16LEFILE ); // on
xml.SetDocFlags( xml.GetDocFlags() & ~xml.MDF_UTF16LEFILE ); // off

This flag is also supported in the ReadTextFile and WriteTextFile functions.

Before Windows 2000 and XP, these wide char files were usually called UCS-2. UTF-16 is an extension of UCS-2 such that all UCS-2 files are legitimate UTF-16 files. UCS-2 is a very simple Unicode encoding. Each two byte integer is simply the Unicode value for the character. The only problem is that it can only handle up to 65535 (ffff) different values and the Unicode standard is expanding beyond that limit. So now we have UTF-16 which is the same as UCS-2 except that it takes advantage of a special range of values between d800 and dfff that are not valid Unicode characters. UTF-16 uses "surrogate" characters in that range and two of them go together to represent a Unicode value much larger than the UCS-2 limit.

Below are examples of three different Unicode encodings. The last one is an example Unicode value beyond ffff that utilizes the surrogate pair in UTF-16, and will most likely appear as a little square because only rare fonts support it.

character Unicode UTF-8 UCS-2 UTF-16
€ (Euro) 20AC (8364) E2 82 AC 20AC 20AC
中 (middle) 4E2D (20013) E4 B8 AD 4E2D 4E2D
𝄞 (Treble Clef) 1D11E (119070) F0 9D 84 9E N/A D834 DD1E

In a _UNICODE (wide char) build of CMarkup, a UTF-16 file does not need to be converted. In a normal build, it is converted to UTF-8 on Load and back to UTF-16 on Save. CMarkup uses its own UTF-16 UTF-8 conversion functions (UTF8To16, UTF16To8) because the MultiByteToWideChar and WideCharToMultiByte Win32 APIs are not consistently supported across Windows operating systems, and of course are not available on other platforms. They are in Windows CE 1.01 but not Windows CE 1.0, and on many versions of Windows they exist but do not support UTF-8 (Windows 95/98/ME/NT3.5 and versions of CE).

 
 

Question or comment about this article?

©Copyright 2008 First Objective Software, Inc. All rights reserved.