Re: CodeSets?

Tuesday, 5 June 2001

        Andy Piper <andyp(a)bea.com&gt; writes:

...
 Can anyone give me a quick run down on the features and differences
 of UTF-16, UCS-2 and UTF-8? 
Instead of referring you to a proper resource, I'll give you the
definitions according to my flawed understanding of the terms.
Mwuahahahhahahahahha!  I expect a big followup by Stephen J.  :-)

UTF-8 -- a way of encoding Unicode characters that preserves ASCII
text intact.  Each Unicode character is represented by one to four UTF-8
bytes.  In spirit similar to Mule internal encoding.  In fact, UTF-8
technically needn't be associated with Unicode -- a UTF-8 encoder can
encode any kind of binary data.

Features: very compact for mostly-ASCII text.  Apparently the
preferred method for I18N text interchange (default for XML, used by
the new GTK, Mozilla, etc.)

Disadvantages: conversion between byte and character indices requires
examining the whole string.  Incompatible with Latin 1.  (Old
protocols that assumed that chars in 160-255 range are Latin 1 will
misread UTF-8 streams.)  In fact, I'm being bitten by this
incompatibility at my current job.

UCS-2 -- a two-byte representation of some Unicode characters.  Each
character corresponds to exactly two UTF bytes.  You don't get the
whole Unicode character range, but you get what Unicode thinks are the
"more frequent" chars.  Many applications used to consider this as a
good compromise.  Many applications are switching to UTF-16, UCS-4, or
UTF-8.

Features: easy conversion from byte to char indices; today most people
consider one-char-two-bytes to be an acceptable (mis)use of memory.

Disadvantages: totally incompatible with ASCII and Latin 1.  (Simple
stuff like splitting the text into *lines* no longer works because
line delimiter such as \n is now a two-byte sequence.)  Problems with
endianness.  You don't get all characters which bites you sooner or
later.

UTF-16 -- a special encoding of Unicode characters that represents
UCS-2-representable characters as UCS-2, and represents the other
characters as a sequence.  A way to think about it is: UTF-16 relates
to UCS-2 as UTF-8 relates to ASCII.  For this reason some consider
UTF-16 to be a good transition from UCS-2 -- UCS-2 readers can still
"see" UTF-16 data, and simply ignore the chars they don't understand.
But others disagree, because UTF-16 destroys the constant character
size property of UCS-2, so a transition to either UTF-8 or UCS-4 might
be more desirable.

Features: "compatible" with UCS-2

Disadvantages: incompatible with UCS-2

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: CodeSets?