Character Description Language

字形描述语言 (字描语)


An XML application for rendering and indexing Han (CJKV) characters


Technology Overview

• Wénlín Institute’s CDL, the Character Description Language, is the world’s most powerful new font technology.

• CDL is the engine (C source code) behind CJK Unicode megafonts, breaking the 64K glyph barrier! (A CDL font can contain an unlimited number of glyphs.)

• CDL is an XML application, a standards-based font and encoding technology designed for precise and compact description, rendering, and indexing of all 漢/汉 Han (Chinese, Japanese, Korean, and Vietnamese = CJKV) characters, encoded and unencoded.

• CDL is a font database with descriptions for nearly 100,000 characters, including complete Unicode 7.0 CJK character support, and more!

• CDL means consistent stroke/component analyses, built-in indexing and variant mappings, and high-quality graphic images as outlines convertible to SVG, PostScript, MetaFont, and more.

• CDL is a compressed binary with an incredibly small memory footprint (1.4 MB: 1,443,206 bytes), suitable for use in limited-memory mobile devices that want full Unicode CJK support.

• CDL technology has applications for machine learning, for handwriting recognition and input methods, for optical character recognition (OCR), and most importantly for human language-learning.

• The basic elements of CDL are a flexible two-dimensional coordinate space, and a set of basic stroke types. Using these simple elements, CDL provides a framework for describing characters and components, and for (recursive) reuse of character and component descriptions in the descriptions of other characters and components.

• CDL adds new dimensions to the UCS code space, with a variant mechanism for associating an unlimited number of CDL descriptions with any Unicode codepoint.

• Each CDL description can be associated with zero or more Unicode code points, making CDL the ideal tool for extending The Unicode Standard.

• CDL has applications beyond CJK, for organizing information underlying the rendering of any complex script.

On This Page


Core CDL Resources

  • A draft DTD (Document Type Definition) defines the CDL tags (elements and attributes)
  • An equivalent (~) XML Schema (as generated by this program)

CDL Status and News

News in brief (reverse chronological):

  • A total of 93,850 CDL descriptions! (V=0: 81,360) [2014-09-17: CJK Ext. E in progress.]
  • ✮ New CDL Community Group launched in W3C (join here) [2014-08-15]
  • A total of 86,416 CDL descriptions! (V=0: 81,360) [2013-02-26]
  • ☆ Publication of The Unicode Standard Version 6.1 – Core Specification: Appendix F: “CJK Strokes Documentation” (all CJK glyphs in this appendix were created by the CDL team using CDL, and all text derives from the CDL Spec. and from CJK Strokes work in WG2/IRG:N3063) [2012-01-31]
  • Complete coverage of Unicode 6.1 CJK repertory! (including all Unihan 6.1; a total of 84,044 CDL descriptions, including variants and non-CJK) [2011-08-16]
  • Complete coverage of Unicode 6.0 CJK repertory! (including all Unihan 6.0, Radical blocks, selected non-Unihan, and thousands of variants) [2010-06-30]
  • A total of 82,127 CDL descriptions! (including ALL Extension D) [2010-06-30T13:22:40PDT]
  • A total of 81,866 CDL descriptions! (including ALL Extension C) [2010-06-26T21:28:23PDT]
  • ☆ The CDL Team receives a three-year Scholarly Editions and Translations grant from the U.S. National Endowment for the Humanities (NEH) [RQ-50525-10; 2010-2013], in support of continuing work on 《說文解字》Shuō Wén Jiě Zì [June, 2010]
  • A total of 78,254 CDL descriptions! (including ALL Extension B!!!) [2010-04-02T18:17:23PDT]
  • A total of 73,254 CDL descriptions! (including most of Ext B & beginning of Ext C) [2009-07-29]
  • Another CDL presentation at IRG (#30), Pusan ROK [June 2008] (PDF)
  • ☆ Wénlín CDL web prototype is now online! [June 2008] (CGI)
  • Complete CJK Strokes block now in Unicode 5.1 [April 4, 2008] (PDF)
  • Another CDL presentation at IRG (#29) at Adobe, Inc. [Nov. 2007] (PDF)
  • CDL presentation at IUC (#31), Adobe Systems, San Jose [October 2007] (HTML)
  • “A character description language for CJK.” [Bishop & Cook, 2007] Multilingual, #91, Volume 18 Issue 7 (p. 62-8); October/November 2007 (PDF [final draft, full text]).
  • ☆ The CDL Team receives a two-year Digital Humanities Start-up Grant from the U.S. National Endowment for the Humanities (NEH) [HD-50012-07; 2007-2009] (read the funded proposal) [March, 2007]
  • CJK Strokes block extension approved! (36 strokes total, U+31C0 .. U+31E3)
  • CJK Strokes block extended at IRG 25 in 美國加州大學柏克萊分校 Berkeley, California
  • ☆ New CJK Strokes block created, and seeded with 16 HKSCS strokes (U+31C0 .. U+31CF); read the Unicode 5.0 CJK Strokes block description (PDF)
  • Proposal (N1096, N1097) to encode a block of CJK Stroke Types presented at IRG 23 on 濟洲島 Jeju Island, S. Korea
  • Notes towards a Chinese translation of the CDL specification (漢語翻譯)
  • ☆ Publication of draft CDL specification [2003-10-31] (see above)
  • Presentation of CDL specification (N985, N986, N987) at IRG 21 in 桂林 Guilin, China
  • Proposals for IRG, Unicode, and public use of CDL
  • Unicode Technical Note on CDL in preparation
  • Proposal to encode a block of CJK Stroke Types in preparation
  • CDL database now (as of Spring 2003) contains descriptions for more than 56,000 CJK ideographs, including the whole BMP and more than half of Ext. B
  • Remaining Extension B descriptions (still!) in the pipeline
  • XML form of CDL is now primary in unpublished Wenlin
  • Wenlin now supports 4 billion variants for each codepoint
  • Wenlin is now using all of Unicode Plane 15 as PUA Variant Selectors for input of variants
  • CDL description of a complete 《說文解字》 Shuō Wén Jiě Zì text
  • CDL description of all 《宋本廣韻》 Sòng Běn Guǎng Yùn entries and 反切 fǎnqiè
  • 《漢語大字典》 Hànyǔ Dà Zìdiǎn variant mapping table nearly complete
  • Mapping tables developed with CDL technology are now available in Unicode’s public Unihan database [since 2000]

Related Links

Three passages from The Unicode Standard:

Two papers from Academia Sinica’s Info-Tech Laboratory:

  • Stanford CS-TR-83-974 [Hobby84] John D. Hobby and G. Gu, A Chinese Meta-Font, Tugboat, the TeX User’s Group Newsletter, 5(2), (also Stanford Report STAN-CS-83-974), 1984.

Jargon Notes

On “CJK Unified Ideograph”: an apology

In Unicode/ISO parlance, certain blocks of 漢 Hàn characters are called “CJK Unified Ideographs”. CJK (a trademark of the RLG) stands for “Chinese, Japanese, and Korean”, and is sometimes extended to CJKV “Chinese, Japanese, Korean and Vietnamese” (and it could be extended further, to include all IRG contributors). Scripts in all of these locales make use of CJKV (Chinese-derived) characters. These characters are “Chinese-derived” in that the principles for character creation originated in China (more than 3,000 years ago). These characters are sometimes also termed 漢 (“Hàn” as in the name of Unicode’s Hàn database [a.k.a. UniHan]), reflecting the legacy of the influential 東漢 Dōng Hàn ‘Eastern Hàn’ Dynasty script analyses (《說文解字》, c. 121 AD). These characters are “Unified” in that (many though not all) locale-specific differences in character forms (stylistic conventions, typeface expectations) have been ignored (as non-distinctive) for encoding purposes. Of course, there are characters in all locales which are unique to those locales, and so unification also involves superset definition.

Like Hàn, the term ideograph (sometimes also [mis-]written “ideogram”) is today used in information-technology (info-tech) circles to signify ‘the uniquely CJKV script entity’, which is to say, “CJKV ideographs” constitute a certain subset of the “characters” to be found in Asian texts. Japanese Kana (elements of Hiragana and Katakana syllabaries) are also “characters” in Japanese, but are not termed “ideographs” (though they derive from Chinese-derived “ideographs”). The English term “Han” has some advantages over “ideograph”, but reliance on a specific pronunciation of the character “漢” (which varies by locale) presents its own challenges to general acceptance as a cover term. Why English “Han” and not “Kan”? Let’s just chalk this up to emphasis on the Chinese-derived principles governing character formation, and deference to Modern Standard Chinese (北京話 Běijīnghuà = 官話 Guānhuà = 普通話 Pǔtōnghuà ‘Mandarin’) pronunciation of the character 漢 Hàn (as in 東漢).

Though perhaps more politically acceptable than Han as a cover term for CJK characters, the term ideograph is nevertheless something of a misnomer, if “ideograph” primarily means ‘completely pure idea writing, not conveying the sounds of words, but conveying only ideas independent of specific verbal communication’. As originally applied to Chinese writing (by early missionaries), such usage may reflect misunderstanding, a sense that the writing did not convey sound values at all. Of course, Chinese characters are used to write spoken language, not simply wordless ideas or ideas unconnected with specific spoken language. However, if early sinologists were not confused about the meaning of “ideogram”, then they were guilty of hyperbole, and specifically chose to extend the meaning to include also relatively imprecise writing of specific speech sounds. Perhaps the impression was that the degree of phonological imprecision was such as to present almost a total lack of phonological information. Certainly, a first glance at phonological variation (diachronic and synchronic) in character readings within and across CJKV locales can only aggravate the opinion that if the writing is not completely aphonographic, then it is either extremely complex or completely and utterly chaotic. And lacking readily apparent systematic phonological information, what remains but semantics or the basic idea to be conveyed? Indeed, for casual purposes the ideas conveyed by CJK characters are often intelligible across locales, though the spoken languages are not themselves mutually intelligible. Certain pronunciation details may be irrecoverable from even a close phonetic transcription, but CJK writing is an especially lossy means of phonological information storage.

Although info-tech usage of the term ideograph may be an unfortunate neologism derivative of an original misnomer, it seems to have arisen as an informed compromise in a specific usage context, in clear lack of a more precise preferable pre-existing English word. Terms such as logograph (‘one-to-one relation between sign and “word” [itself not terribly well-defined]’; not all CJK syllables are words, and CJK words are not always monosyllabic) or morphosyllabograph (a specialist Greco- mouthful!) might have been similarly imprecise and clumsy, and might have been preferable for not perpetuating a misconception of “pure disembodied idea writing”, but might promote some other misconceptions.

So, the term ideograph in modern info-tech usage might best be understood (or rationalized ex postfacto, along with early sinological usage) as indicative of a difference of degree: that the phonological information conveyed is somewhat limited relative to more fully phonographic scripts (those using alphabets and isographic syllabaries to convey specific sound values).

“A syllabary being a system for writing the elements of the syllable canon of a language, the syllabograph would be a graphical element of a syllabary. When there is a one-to-one correspondence between syllable type and syllabograph, this is an isographic syllabary. In that it sometimes has multiple representations of a given syllable type, the Chinese writing system might be termed an imperfect or heterographic syllabary. Chinese characters, the elements of a heterographic syllabary, might be termed heterographic syllabographs, or heterosyllabographs. No matter what they are called, there is clearly some degree of imprecision in the Chinese script, in terms of its ability to convey specific sound values.” [Cook, 2003:195]

Many Asian languages (and CJK languages in particular) are termed monosyllabic. Of Chinese languages (or dialects) in particular, this means that the syllable (phoneme cluster with tone-bearing vocalic nucleus) bears much functional weight, and is traditionally extremely well-defined, both phonologically (phonemically) and morphologically (morphemically), presumably in natural speech as in orthography and lexicographic descriptions. Most syllables are associated with (one or more) distinct units of meaning (morpheme + syllable = morphosyllable) apparent in and productively used in formation of polysyllabic words. If a syllable has multiple clearly distinct meanings, each of these would “ideally” be written with a distinct character, though actual orthographic practices reflect multiple layers of subjective and irregular development. The syllable has status similar to that of the phoneme in other languages, as evident in meaning-driven “transcription” of foreign words (nativizing morpho-analytic re-syllabification). Characters have been associated with single syllables for a long time, presumably from the beginning of Chinese writing, and this implies that the language before writing was also monosyllabic. It might, however, be more accurately termed “sesquisyllabic”, since historical studies suggest that syllable boundaries are rather fluid, and that prominent syllabic nucleii may assimilate adjacent relatively unstressed (or destressed) elements over time. Though the earliest specific evidence comes from no earlier than the earliest 反切 fǎnqiè ‘sound glosses’ (perhaps in the 7th c. AD), the traditional opinion is that the character-to-syllable connection goes back to the earliest writing (this is reflected in traditional monosyllabic reconstructions of Old Chinese phonology).

Syllables have long had reality for native speakers/writers as functional nuggets of meaning+sound, as is evident e.g. in ancient Chinese character lexicons organized into (homophonic) syllable classes, with both sound and meaning glosses. Of the six major traditional Chinese character types (六書:象形,指事,會意,形聲,轉注,假借), by far the most common is the so-called 形聲 xíngshēng ‘sematophonic’ compound, combining one semantic determiner with one phonographic component. Thus, homophonic (heteromorphic, heteronymous) characters may be written with the same phonographic component, but are semantically differentiated by means of a (non-shared) semantic determiner. But the phonographic component does not always give a very consistent indication of the pronunciation, due to local variation and historical changes. For example, the character 皆 (MC /kei/, c. 1000 AD) is today commonly pronounced jiē, but when it is used as a component the resulting compound is not always pronounced jiē (e.g. 階 jiē, 諧 xié, 偕 xié, 揩 kāi, 楷 kǎi). Such variation is sometimes regular, but at other times it seems unpredictable in the light of available historical evidence. The phonographic component is sometimes said also to have a (non-phonographic) semantic function, and such characters (simultaneously 形聲 xíngshēng ‘sematophonic’ and “會意” huìyì ‘semantic complexes’) are termed “亦聲” yìshēng [lit.] ‘also phonetic’. Thus, even in phonographic writings the phonographic component is sometimes not entirely devoid of (non-phonologic) semantic value, and serves in combination with the determiner to specify the meaning of the syllable. Morphosyllabographs are fraught with meaning, but often only a small part of that meaning is clearly phonological, and sometimes none of it is at all.

At any rate, even if the issues surrounding Chinese character typologies are complex, and usages of the term ideograph are imprecise, contradictory, confused and/or confusing, the term ideograph is most certainly not used in info-tech to indicate “pure idea writing” (bypassing graphic representation of speech), nor is it used to indicate any of the traditional classes of Chinese-derived characters such as 象形 xiàngxíng ‘pictographic’ characters, nor the very small traditional set of so-called 指事 zhǐshì ‘indicative of the deed’ characters. All Chinese-derived characters indicate syllabic speech units, and though the spoken languages of their speakers may not be mutually intelligible, the writings sometimes are.

A word on the character of the word “character”

In naming CDL, we use “character” with (one of) its common English meaning(s), intentionally avoiding the uncommonly understood (or commonly misunderstood) information-technology terms “ideograph” and “glyph”. (No, info-tech glyph does nont mean we’re suddenly talking about Mayan here, as Pulleyblank once complained in the early 1990s.) Arguably, there are some other good reasons not to call CDL a “CJK Ideographic Glyph Description Language”.

  1. In terms of a “character” vs. “glyph” distinction: CDL descriptions lie somewhere between abstract character (script entity class) and concrete glyph (instantiated member of a script entity class). In info-tech-speak: character is to glyph as class [set] is to class-member [element]. The underlying stroke-based CDL description is rather abstract in that it specifies only a skeleton trajectory to be fleshed out by the CDL interpreter (rather than a complete outline). Interpreted and rasterized, the abstract CDL character becomes concrete CDL glyph.
  2. In terms of an “ideograph” vs. “character” (e.g. Kanji vs. Kana) distinction: CDL is truly a “character description language” in that its basic principles are applicable to any script entity in any script (not simply so-called “ideographs” in a CJK script). Wenlin’s own stroked Latin font, for example, is implemented using CDL technology: base characters and diacritics are comprised of curved and straight segments; these basic components may then be combined to describe precomposed Latin script entities. The difference is that where the set of Sòng stroke types is really the set of key distinctive features for CJK (due to the importance of Sòng calligraphic standards in modern orthography and standardization), for Latin letters there is not really the same emphasis on a particular style or a corresponding set of stroke types. Although we can have a set of graphical primitives for Latin, these do not necessarily correspond to the way that people actually write and index Latin letters.

Finally, if you are not bothered by the jargon and prefer to think of CDL as a C(IG)DL “CJK (Ideographic Glyph) Description Language”, please feel free to do so.

Standards Organizations

Last modified: 2014-09-17

The home page of Wenlin Institute

Copyright © 2003-2014 Wenlin Institute, Inc. All Rights Reserved.

Valid XHTML 1.0 Transitional