On Wed, 2003-01-29 at 11:48, Stephen J. Turnbull wrote:
>>>>> "Terje" == Terje Bless
<link(a)pobox.com> writes:
Terje> If the signature you are refering to is the UNICODE
Terje> Byte-Order Mark then XML 1.0 Second Edition contained an
Thank you for clarifying the standard.
But, uhm, ZERO-WIDTH NO-BREAK SPACE, if you please. ;-) There is no
BOM in UTF-8; not even Microsoft could get away with advocating
little-endian UTF-8.
No, but I think the term "BOM" is a synonym/alias for "Encoding
Signature". "ZERO-WIDTH NO-BREAK SPACE" is the actual character when
used for it's normal purpose, but this character has a dual function;
when it appears as the very first thing in an entity it takes on the
role of an Encoding Signature which for hysterical raisins is called a
"Byte-Order Mark".
IOW, AFAICT, when discussing the usage as an encoding signature, it is
appropriate to refer to it as either the "Byte-Order Mark" or the
"Encoding Signature" and not "ZERO-WIDTH NO-BREAK SPACE", despite
"Byte-Order Mark" being something of a misnomer, since both "BOM" and
"Encoding Signature" refer to the _role_ and not the character itself.
It's there in UTF-8 to allow an heuristic parser to identify this as a
flavour of UNICODE, and to distinguish UTF-8 from the other
transformation formats (such as UTF-16).
e.g. Appendix F of the XML 1.0 Recommendation specifies an algorithm for
automatically determining the encoding of the entity by sniffing for the
encoding signature and falling back on various bit patterns matching
"<?xml" in the various candidate encodings if a signature is not
present.
Then again, this all tends to give me a headache so I'm probably just
hopelessly confused. :-)