| ||||||||
UTF-16 Files and the Byte Order Mark (BOM)UTF-16 (RFC 2781) is a Unicode encoding (the same as "wide char") that is sometimes used in files on Windows operating systems. It is only used sometimes, and even in programs that are compiled for As with UTF-8, a UTF-16 file does not need an XML Declaration specifying the encoding (see the XML 1.0 Spec, 4.3.3). While UTF-8 and ANSI characters appear the same when they are in the ASCII range, UTF-16 is unmistakeably different because it uses two bytes per character.
UTF-16 files are now supported in release 6.6 developer version. See below. In MarkupDlg.cpp of release 6.5, in the // Windows Unicode file is detected if starts with FFFE
if ( pBuffer[0] == 0xff && pBuffer[1] == 0xfe )
{
// Contains byte order mark, so assume wide char
#if defined( _UNICODE )
csText = (LPCWSTR)(&pBuffer[2]);
#else
// Perform UCS-2 (wide char) to UTF-8 conversion
int nUTF8Len = WideCharToMultiByte(CP_UTF8,0,
&((LPWSTR)pBuffer)[1],nFileLen/2-1,NULL,0,
NULL,NULL);
nUTF8Len = WideCharToMultiByte(CP_UTF8,0,
&((LPWSTR)pBuffer)[1],nFileLen/2-1,
csText.GetBuffer(nUTF8Len),nUTF8Len+1,
NULL,NULL);
csText.ReleaseBuffer( nUTF8Len );
#endif
}
CMarkup looks for the Byte Order Mark (BOM) at the beginning of the file indicating that it is a UTF-16 file (LE "Little Endian" only). This BOM consists of two bytes with the hexidecimal value Updated September 27, 2004 If the UTF-16 BOM is discovered on Load, the xml.SetDocFlags( xml.GetDocFlags() | xml.MDF_UTF16LEFILE ); // on xml.SetDocFlags( xml.GetDocFlags() & ~xml.MDF_UTF16LEFILE ); // off This flag is also supported in the ReadTextFile and WriteTextFile functions. Before Windows 2000 and XP, these wide char files were usually called UCS-2. UTF-16 is an extension of UCS-2 such that all UCS-2 files are legitimate UTF-16 files. UCS-2 is a very simple Unicode encoding. Each two byte integer is simply the Unicode value for the character. The only problem is that it can only handle up to 65535 ( Below are examples of three different Unicode encodings. The last one is an example Unicode value beyond
In a | ||||||||||||||||||||||||||||
|
Posted May 3, 2004 updated September 27, 2004. Question or comment about this article? ©Copyright 2008 First Objective Software, Inc. All rights reserved. |