a lot of what you say here is confusing so i'm going to make some
general points and ask some questions.
we need a way of representing named character classes. this is what the
\p syntax does.
here is the relevant part of `man perlunicode':
o Named Unicode properties, scripts, and block ranges may be used
like character classes via the "\p{}" "matches property"
construct
and the "\P{}" negation, "doesn't match property".
For instance, "\p{Lu}" matches any character with the Unicode
"Lu"
(Letter, uppercase) property, while "\p{M}" matches any character
with an "M" (mark--accents and such) property. Brackets are not
required for single letter properties, so "\p{M}" is
equivalent to
"\pM". Many predefined properties are available, such as
"\p{Mir-
rored}" and "\p{Tibetan}".
The official Unicode script and block names have spaces and
dashes
as separators, but for convenience you can use dashes, spaces, or
underbars, and case is unimportant. It is recommended, however,
that for consistency you use the following naming: the official
Unicode script, property, or block name (see below for the addi-
tional rules that apply to block names) with whitespace and
dashes
removed, and the words "uppercase-first-lowercase-rest".
"Latin-1
Supplement" thus becomes "Latin1Supplement".
You can also use negation in both "\p{}" and "\P{}" by
introducing
a caret (^) between the first brace and the property name:
"\p{^Tamil}" is equal to "\P{Tamil}".
NOTE: the properties, scripts, and blocks listed here are as of
Unicode 3.2.0, March 2002, or Perl 5.8.0, July 2002. Unicode
4.0.0
came out in April 2003, and Perl 5.8.1 in September 2003.
Here are the basic Unicode General Category properties,
followed by
their long form. You can use either; "\p{Lu}" and
"\p{Uppercase-
Letter}", for instance, are identical.
Short Long
L Letter
LC CasedLetter
Lu UppercaseLetter
Ll LowercaseLetter
Lt TitlecaseLetter
Lm ModifierLetter
Lo OtherLetter
M Mark
Mn NonspacingMark
Mc SpacingMark
Me EnclosingMark
N Number
Nd DecimalNumber
Nl LetterNumber
No OtherNumber
P Punctuation
Pc ConnectorPunctuation
Pd DashPunctuation
Ps OpenPunctuation
Pe ClosePunctuation
Pi InitialPunctuation
(may behave like Ps or Pe depending on usage)
Pf FinalPunctuation
(may behave like Ps or Pe depending on usage)
Po OtherPunctuation
S Symbol
Sm MathSymbol
Sc CurrencySymbol
Sk ModifierSymbol
So OtherSymbol
Z Separator
Zs SpaceSeparator
Zl LineSeparator
Zp ParagraphSeparator
C Other
Cc Control
Cf Format
Cs Surrogate (not usable)
Co PrivateUse
Cn Unassigned
Single-letter properties match all characters in any of the two-
letter sub-properties starting with the same letter. "LC"
and "L&"
are special cases, which are aliases for the set of "Ll",
"Lu", and
"Lt".
Because Perl hides the need for the user to understand the
internal
representation of Unicode characters, there is no need to
implement
the somewhat messy concept of surrogates. "Cs" is therefore not
supported.
Because scripts differ in their directionality--Hebrew is written
right to left, for example--Unicode supplies these properties in
the BidiClass class:
Property Meaning
L Left-to-Right
LRE Left-to-Right Embedding
LRO Left-to-Right Override
R Right-to-Left
AL Right-to-Left Arabic
RLE Right-to-Left Embedding
RLO Right-to-Left Override
PDF Pop Directional Format
EN European Number
ES European Number Separator
ET European Number Terminator
AN Arabic Number
CS Common Number Separator
NSM Non-Spacing Mark
BN Boundary Neutral
B Paragraph Separator
S Segment Separator
WS Whitespace
ON Other Neutrals
For example, "\p{BidiClass:R}" matches characters that are
normally
written right to left.
Scripts
The script names which can be used by "\p{...}" and "\P{...}",
such as
in "\p{Latin}" or "\p{Cyrillic}", are as follows:
Arabic
Armenian
Bengali
Bopomofo
Buhid
CanadianAboriginal
Cherokee
Cyrillic
Deseret
Devanagari
Ethiopic
Georgian
Gothic
Greek
Gujarati
Gurmukhi
Han
Hangul
Hanunoo
Hebrew
Hiragana
Inherited
Kannada
Katakana
Khmer
Lao
Latin
Malayalam
Mongolian
Myanmar
Ogham
OldItalic
Oriya
Runic
Sinhala
Syriac
Tagalog
Tagbanwa
Tamil
Telugu
Thaana
Thai
Tibetan
Yi
Extended property classes can supplement the basic properties,
defined
by the PropList Unicode database:
ASCIIHexDigit
BidiControl
Dash
Deprecated
Diacritic
Extender
GraphemeLink
HexDigit
Hyphen
Ideographic
IDSBinaryOperator
IDSTrinaryOperator
JoinControl
LogicalOrderException
NoncharacterCodePoint
OtherAlphabetic
OtherDefaultIgnorableCodePoint
OtherGraphemeExtend
OtherLowercase
OtherMath
OtherUppercase
QuotationMark
Radical
SoftDotted
TerminalPunctuation
UnifiedIdeograph
WhiteSpace
and there are further derived properties:
Alphabetic Lu + Ll + Lt + Lm + Lo + OtherAlphabetic
Lowercase Ll + OtherLowercase
Uppercase Lu + OtherUppercase
Math Sm + OtherMath
ID_Start Lu + Ll + Lt + Lm + Lo + Nl
ID_Continue ID_Start + Mn + Mc + Nd + Pc
Any Any character
Assigned Any non-Cn character (i.e. synonym for \P{Cn})
Unassigned Synonym for \p{Cn}
Common Any character (or unassigned code point)
not explicitly assigned to a script
For backward compatibility (with Perl 5.6), all properties
mentioned so
far may have "Is" prepended to their name, so "\P{IsLu}", for
example,
is equal to "\P{Lu}".
Blocks
In addition to scripts, Unicode also defines blocks of
characters. The
difference between scripts and blocks is that the concept of
scripts is
closer to natural languages, while the concept of blocks is more
of an
artificial grouping based on groups of 256 Unicode characters. For
example, the "Latin" script contains letters from many blocks but
does
not contain all the characters from those blocks. It does not, for
example, contain digits, because digits are shared across many
scripts.
Digits and similar groups, like punctuation, are in a category called
"Common".
For more about scripts, see the UTR #24:
http://www.unicode.org/unicode/reports/tr24/
For more about blocks, see:
http://www.unicode.org/Public/UNIDATA/Blocks.txt
Block names are given with the "In" prefix. For example, the Katakana
block is referenced via "\p{InKatakana}". The "In" prefix may
be
omit-
ted if there is no naming conflict with a script or any other
property,
but it is recommended that "In" always be used for block tests to
avoid
confusion.
These block names are supported:
InAlphabeticPresentationForms
InArabic
InArabicPresentationFormsA
InArabicPresentationFormsB
InArmenian
InArrows
InBasicLatin
InBengali
InBlockElements
InBopomofo
InBopomofoExtended
InBoxDrawing
InBraillePatterns
InBuhid
InByzantineMusicalSymbols
InCJKCompatibility
InCJKCompatibilityForms
InCJKCompatibilityIdeographs
InCJKCompatibilityIdeographsSupplement
InCJKRadicalsSupplement
InCJKSymbolsAndPunctuation
InCJKUnifiedIdeographs
InCJKUnifiedIdeographsExtensionA
InCJKUnifiedIdeographsExtensionB
InCherokee
InCombiningDiacriticalMarks
InCombiningDiacriticalMarksforSymbols
InCombiningHalfMarks
InControlPictures
InCurrencySymbols
InCyrillic
InCyrillicSupplementary
InDeseret
InDevanagari
InDingbats
InEnclosedAlphanumerics
InEnclosedCJKLettersAndMonths
InEthiopic
InGeneralPunctuation
InGeometricShapes
InGeorgian
InGothic
InGreekExtended
InGreekAndCoptic
InGujarati
InGurmukhi
InHalfwidthAndFullwidthForms
InHangulCompatibilityJamo
InHangulJamo
InHangulSyllables
InHanunoo
InHebrew
InHighPrivateUseSurrogates
InHighSurrogates
InHiragana
InIPAExtensions
InIdeographicDescriptionCharacters
InKanbun
InKangxiRadicals
InKannada
InKatakana
InKatakanaPhoneticExtensions
InKhmer
InLao
InLatin1Supplement
InLatinExtendedA
InLatinExtendedAdditional
InLatinExtendedB
InLetterlikeSymbols
InLowSurrogates
InMalayalam
InMathematicalAlphanumericSymbols
InMathematicalOperators
InMiscellaneousMathematicalSymbolsA
InMiscellaneousMathematicalSymbolsB
InMiscellaneousSymbols
InMiscellaneousTechnical
InMongolian
InMusicalSymbols
InMyanmar
InNumberForms
InOgham
InOldItalic
InOpticalCharacterRecognition
InOriya
InPrivateUseArea
InRunic
InSinhala
InSmallFormVariants
InSpacingModifierLetters
InSpecials
InSuperscriptsAndSubscripts
InSupplementalArrowsA
InSupplementalArrowsB
InSupplementalMathematicalOperators
InSupplementaryPrivateUseAreaA
InSupplementaryPrivateUseAreaB
InSyriac
InTagalog
InTagbanwa
InTags
InTamil
InTelugu
InThaana
InThai
InTibetan
InUnifiedCanadianAboriginalSyllabics
InVariationSelectors
InYiRadicals
InYiSyllables
o The special pattern "\X" matches any extended Unicode
sequence--"a
combining character sequence" in Standardese--where the first
char-
acter is a base character and subsequent characters are mark
char-
acters that apply to the base character. "\X" is equivalent to
"(?:\PM\pM*)".
so what would you call your user-defined character classes? and what api?
and no, many of these classes in no way, shape or form are similar to
iso-8859-anything. "word character" will spread across the whole
unicode space, for example.
as for regexps, i didn't say we should use the pcre library, but we
should make our regexps work like perl's; they are far more powerful
than ours, anyway.
what api for new regexps would you propose?
Does it matter? What are the programmatic applications for these
things?
what do you mean?
it does matter because with the wrong implementation we either [a] take
a humongous amount of space or [b] potentially make our regexps slower
than they should be.
as for fonts, i'm not sure what is so wrong about a char->font mapping,
and what you propose instead. the intention is that `put-char-table'
can take a character class as well as a single character, and sets a
value for that whole class. this seems quite natural to me -- usually,
you want to specify the e.g. "Traditional Arabic" font for Arabic
characters, the e.g. "MS Mincho" font for Japanese, etc.
ben
Stephen J. Turnbull wrote:
>>>>>>"Ben" == Ben Wing <ben(a)666.com> writes:
>>>>>>
>>>>>>
>
> Ben> -- [2] charsets should be generalized so that they encompass
> Ben> an arbitrary set of unicode chars.
>
>Once we have Unicode inside, charsets as currently understood make no
>sense at all. It's not a generalization, it's something completely
>different.
>
>Don't worry about it, nobody outside of the Mule Lab and Kyoto
>University will notice, let alone care.
>
>Of course we will need character classes for regular expressions and
>to describe font repertoires. But let's not call them charsets; not
>only does that clash with Mule, it clashes with MIME usage too.
>
> Ben> -- [3] we should add unicode-compatible charsets. the names
> Ben> should be such that they programmatically map onto
> Ben> perl-compatible (used with regexp \p, see below) charset
> Ben> names.
>
>Just use abstract character classes and have a standard table mapping
>names to predefined classes. In other words, I don't think this
>requires design, we Just Do It when we need it.
>
> Ben> -- [4] the perl regexp \p syntax should be adopted for
> Ben> referencing charsets.
>
>Uh, what syntax is that? And what is perl? :-)
>
> Ben> the big problem here is \( and (, which are backwards.
>
>I assume you mean all of the things we have that are backwards (like
>|,{,} at least). But who cares? Regexps in use are by their nature
>compiled objects. I don't see why we can't have multiple languages
>that target that virtual machine.
>
>We absolutely need an Emacs LISP-compatible compiler, of course.
>
> Ben> the only reasonable solutions i can see are [a] a global variable
> Ben> to control which kinds of regexps are used;
>
>That won't work, we don't have namespaces/packages to protect
>unsuspecting legacy Elisp libraries from PCREs. We're going to need
>to provide multiple interfaces if we accept multiple syntaxes. Note
>that Python used to use the PCRE library, but they trashed that years
>ago in favor of a Python implementation. I'm not sure we could afford
>that for stuff like fontlock, but surely we don't have to limit
>ourselves to PCRE or rx or whatever that has a C library in the
>experimental stage.
>
>Going back to the "regexps are compiled objects" theme, isn't it time
>we gave up on the creaky JIT-compilation-and-caching strategy with
>global match objects? Of course we want an interface that allows
>one-shot regexps to be specified as strings for convenience, but
>backwards compatibility (where you can also spell backwards G-N-U) is
>the only reason to not return match objects.
>
>Yes, Python is the "right" model here, because this is LISP. We
>_can't_ provide Perl-style syntactic support for the global match
>object, so we may as well return first-class objects instead of
>booleans or indicies that we can only use by feeding them to another
>function anyway.
>
> Ben> -- [5] char tables need to be changed. their current
> Ben> implementation is heavily tied to the current mule character
> Ben> structure. we will also need to change `map-char-table'.
> Ben> unfortunately, this will be incompatible with its current
> Ben> workings. fortunately, only three packages use
> Ben> map-char-table, and from looking at these three, it's not
> Ben> clear anything will break. also, the new map-char-table will
> Ben> work like GNU Emacs.
>
>I don't think this is anything to worry about.
>
> Ben> -- [6] at some point, font objects should be changed to
> Ben> include a char table so that different charsets can have
> Ben> different fonts.
>
>Huh? You can't win that way, it's a multivalued mapping in both
>directions. Not to mention fundamentally a non-algorithmic issue. We
>should stop thinking in terms of predefined charsets, period. We need
>to think in terms of a user-specified default repertoire, and the
>repertoire of the current document. (Of course we should make a
>standard set of repertoires available by name for user convenience,
>but that can come on an as-needed or as-the-fit-takes-someone basis.)
>
>BTW I'm in the process of converting everything in the Xft branch to
>use the fontconfig API for internal representation of font
>_instantiators_; I think it's nuts to do anything else. It's not
>perfect, but it can represent any external name (MS Windows, Mac,
>XLFD) and any internal set of properties. Of course instantiated
>fonts will use the platform representation, but I don't see any good
>reason why we shouldn't translate "Courier-12:weight=bold",
>"Courier:12:bold", and "-*-courier-bold-r-*-*-*-120-*-*-*-*-*-*"
all
>to the same font regardless of what platform we're on.
>
>We can also use the fontconfig library itself to handle listing,
>sorting, and filtering on some platforms, but that's not what I'm
>talking about here. Getting that for free in Xft is of course a bonus
>from my point of view, but it also meets your "vector representation"
>desideratum.
>
> Ben> also, in [5] we have an implementation choice. either we use
> Ben> sorted ranges or we use page-table-style lookups, [...]
> Ben> should we allow the user to control this, with range tables
> Ben> the default?
>
Does it matter? What are the programmatic applications for these
things?
>
>In terms of what kind of character classes to expect, with the
>exception of Han and Hangul users, basically I think we can expect
>mostly things isomorphic to ISO 8859: a half page for ASCII, and a
>full page (at most) for other national characters, with everything
>else in Unicode a "don't-care". On the other hand, for the Han and
>Hangul users we can often expect on the order of a quarter-plane (16K)
>of usage due to the arbitrary (ie, phonetic and stroke count)
>arrangements of those subsets within Unicode. Neither of those
>subsets has much locality of reference as far as I know.
>
>
>
>
>