|
Unicode Character EncodingThis is a concise yet thorough description of Unicode used for Software Internationalization and Localization
|
|
Character | Unicode | Dec. | Rev. | |||||
---|---|---|---|---|---|---|---|---|
a | \u0250 | ɐ | ɐ | |||||
b | - | - | q | |||||
c | \u0254 | ɔ | ɔ | |||||
e | \u01DD | ǝ | ǝ | |||||
f | \u025F | ɟ | ɟ | |||||
g | \u0183 | ƃ | ƃ | |||||
h | \u0265 | ɥ | ɥ | |||||
i | \u0131 | ĭ | ĭ | |||||
j | \u027E | ɾ | ɾ | |||||
k | \u029E | ʞ | ʞ | |||||
l | \u05DF | ן | ן | |||||
m | \u026F | ɯ | ɯ | |||||
n | - | - | u | |||||
r | \u0279 | ɹ | ɹ | |||||
t | \u0287 | ʇ | ʇ | |||||
v | \u028C | ʌ | ʌ | |||||
w | \u028D | ʍ | ʍ | |||||
y | \u028E | ʎ | ʎ | |||||
This may be "useful" for people in the Southern Hemisphere (Australians, etc.)? (ha ha)
Upper-case letters are converted to lower-case letters (and numbers are not flipped) because the current Unicode set doesn't have upside down glyths for all capital letters (nor numbers).
The Unicode Standard developed by the international Unicode Consortium defined hexidecimal code values (prefixed with U+) to consistently identify the world's glyphs (characters and symbols) assigned in ranges called code pages.
The range U+10000 to U+10FFFF is divided by Unicode 3.01 into 16 planes, only three of which have so far been used to encode supplementary characters used primarily to encode historical and classical literary documents from the rich heritage of the Chinese, Korean, and Japanese (Asian) languages.
Outside-the-computer, the format for Unicode is known as UTF (Unicode Transformation Formats) defined by IETF's RFC 3629. The ISO/IEC 10646 Annex D standard also uses the term "UCS Transformation Format" for UTF.
Presented in the sample below for the three UTF formats is the Greek capital letter Δ (Delta) from code page 1253.
Sample | Format |
---|---|
SGML/HTML Entity Code Δ | ASCII/ISO 8859-1 "Latin-1" character set is not Unicode. It is a fixed single byte 256 character set. |
0xCD 0x94 | UTF-8 data consists of a
variable number of 8-bit single bytes.
UTF uses as many bytes bytes as needed to encode a character.
UTF-8 remains a simple, single-byte, ASCII-compatible encoding method for characters at or below 127
(which does not include the Euro currency at 126, Pound currency £ at 163
nor the copyright © character at 169).
2 bytes is used for characters at or below 2047 (hex 0x07FF). 3 bytes is used for up to 65535 unique (mostly Asian) code points (nicknamed "magic numbers"). Windows 2000/XP/2003 are UTF-8 aware, so use of a UTF-8 storage format in the database requires many extra conversions. Although SQL Server 2005 does not store data in UTF-8 format, it supports UTF-8 for handling XML data. |
0x0394 | UTF-16 is a fixed width encoding form that uses 16-bit code units, like the older UCS-2 double-byte character set (DBCS) representations. UTF-16 is the default encoding form of the Unicode Standard -- the native string type for Java, Visual Basic, COM, and Windows NT/2000/XP/2003. The Windows Component Object Model (COM) supports only UTF-16/UCS-2 in its APIs and interfaces. The last two values FFFE16 and FFFF16 and the 32 values from FDD016 to FDEF16 represent noncharacters. Very unusual characters are represented as surrogate pairs, which extend the character set to over a million characters. |
0x00000394 | UTF-32 is a fixed-width 32-bit encoding form, like UCS-4 (for 4 bytes). The major advantage of the encoding form is that it uniformly expresses all characters, so that they are easy to handle in arrays. |
To alert software to the fact that a file contains Unicode characters, the first bytes should contain a byte-order-mark (BOM) to declare that file's encoding format.
Bytes (in Hex) | Encoding Form |
---|---|
EF BB BF | UTF-8 |
00 00 FE FF | UTF-32, big-endian |
FF FE 00 00 | UTF-32, little-endian |
FE FF | UTF-16, big-endian |
FF FE | UTF-16, little-endian |
Microsoft's Notepad and many other text editors recognize this. But some older text editors ignore the BOM as a Zero Width Non-Breaking Space (ZWNBSP).
When using Microsoft Excel 2007, open a CSV file only from within Excel's File > Open menu with File of type designated to "Text Files (*.prn, *.txt, *.csv)", rather than double-clicking the file from within Windows Explorer. This is because Excel recognizes only the UTF-16 (but not UTF-8) BOM character.To display Unicode in Excel, specify the "Arial Unicode MS" or another Unicode capable font installed on your machine.
Encoding specifications need to be by some at the beginning of files because it controls how the rest of the file is handled.
The text editor is supposed to automatically insert this saving a file with Encoding selected at "Unicode".
Internet Explorer has logic to guess at the encoding, but this code helps pages load faster and more accurately:
XML parsers require this encoding specification on the first line, such as:
WGL4 is called "Pan-European" because it covers several codepages used in Europe.
To use UTF-16 in C++, declare strings as data type wchar_t ("wide char") instead of char; and use the wcs functions instead of str functions. (For example, wcscat and wcslen instead of strcat and strlen).
To create a literal UCS-2 string in C code, put an L before it as so: L"Hello".
Instances of BreakIterator are not created with a constructor, but with a static factory method for returning BreakIterator object for each type of textual element:
Lookup JavaDoc for "DataInputStream" about Java's variation of UTF-8.
To create database tables in unicode format, use multi-lingual data types nvarchar, nchar and ntext.
A sample SQL INSERT query format:
INSERT INTO SomeMultiLangTable (userfname, userlname, userlangid) VALUES(N'" + Request.Form["txtFName"] + "', + N'" + Request.Form["txtLName"] + "','" + Request.QueryString["lang"] + "')";
A sample SQL query to retrieve data:
SELECT * FROM SomeMultiLangTable WHERE userfname=N'some Unicode Data'
International Features in Microsoft SQL Server 2005
IE7 enables you to click your way to inserting Unicode control charcters:
Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard by Richard Gillam
Next: Internationalization |
| Your first name: Your family name: Your location (city, country): Your Email address: |
Top of Page Thank you! |