• Wénlín Institute’s CDL, the Character Description Language, is the world’s most powerful new font technology.
• CDL is the engine (C source code) behind CJK Unicode megafonts, breaking the 64K glyph barrier! (A CDL font can contain an unlimited number of glyphs.)
• CDL is an XML application, a standards-based font and encoding technology designed for precise and compact description, rendering, and indexing of all 漢/汉 Han (Chinese, Japanese, Korean, and Vietnamese = CJKV) characters, encoded and unencoded.
• CDL is a font database with descriptions for nearly 100,000 characters, including complete Unicode 7.0 CJK character support, and more!
• CDL means consistent stroke/component analyses, built-in indexing and variant mappings, and high-quality graphic images as outlines convertible to SVG, PostScript, MetaFont, and more.
• CDL is a compressed binary with an incredibly small memory footprint (1.4 MB: 1,443,206 bytes), suitable for use in limited-memory mobile devices that want full Unicode CJK support.
• CDL technology has applications for machine learning, for handwriting recognition and input methods, for optical character recognition (OCR), and most importantly for human language-learning.
• The basic elements of CDL are a flexible two-dimensional coordinate space, and a set of basic stroke types. Using these simple elements, CDL provides a framework for describing characters and components, and for (recursive) reuse of character and component descriptions in the descriptions of other characters and components.
• CDL adds new dimensions to the UCS code space, with a variant mechanism for associating an unlimited number of CDL descriptions with any Unicode codepoint.
• Each CDL description can be associated with zero or more Unicode code points, making CDL the ideal tool for extending The Unicode Standard.
• CDL has applications beyond CJK, for organizing information underlying the rendering of any complex script.
News in brief (reverse chronological):
Three passages from The Unicode Standard:
Two papers from Academia Sinica’s Info-Tech Laboratory:
On “CJK Unified Ideograph”: an apology
In Unicode/ISO parlance, certain blocks of 漢 Hàn characters are called “CJK Unified Ideographs”. CJK (a trademark of the RLG) stands for “Chinese, Japanese, and Korean”, and is sometimes extended to CJKV “Chinese, Japanese, Korean and Vietnamese” (and it could be extended further, to include all IRG contributors). Scripts in all of these locales make use of CJKV (Chinese-derived) characters. These characters are “Chinese-derived” in that the principles for character creation originated in China (more than 3,000 years ago). These characters are sometimes also termed 漢 (“Hàn” as in the name of Unicode’s Hàn database [a.k.a. UniHan]), reflecting the legacy of the influential 東漢 Dōng Hàn ‘Eastern Hàn’ Dynasty script analyses (《說文解字》, c. 121 AD). These characters are “Unified” in that (many though not all) locale-specific differences in character forms (stylistic conventions, typeface expectations) have been ignored (as non-distinctive) for encoding purposes. Of course, there are characters in all locales which are unique to those locales, and so unification also involves superset definition.
Like Hàn, the term ideograph (sometimes also [mis-]written “ideogram”) is today used in information-technology (info-tech) circles to signify ‘the uniquely CJKV script entity’, which is to say, “CJKV ideographs” constitute a certain subset of the “characters” to be found in Asian texts. Japanese Kana (elements of Hiragana and Katakana syllabaries) are also “characters” in Japanese, but are not termed “ideographs” (though they derive from Chinese-derived “ideographs”). The English term “Han” has some advantages over “ideograph”, but reliance on a specific pronunciation of the character “漢” (which varies by locale) presents its own challenges to general acceptance as a cover term. Why English “Han” and not “Kan”? Let’s just chalk this up to emphasis on the Chinese-derived principles governing character formation, and deference to Modern Standard Chinese (北京話 Běijīnghuà = 官話 Guānhuà = 普通話 Pǔtōnghuà ‘Mandarin’) pronunciation of the character 漢 Hàn (as in 東漢).
Though perhaps more politically acceptable than Han as a cover term for CJK characters, the term ideograph is nevertheless something of a misnomer, if “ideograph” primarily means ‘completely pure idea writing, not conveying the sounds of words, but conveying only ideas independent of specific verbal communication’. As originally applied to Chinese writing (by early missionaries), such usage may reflect misunderstanding, a sense that the writing did not convey sound values at all. Of course, Chinese characters are used to write spoken language, not simply wordless ideas or ideas unconnected with specific spoken language. However, if early sinologists were not confused about the meaning of “ideogram”, then they were guilty of hyperbole, and specifically chose to extend the meaning to include also relatively imprecise writing of specific speech sounds. Perhaps the impression was that the degree of phonological imprecision was such as to present almost a total lack of phonological information. Certainly, a first glance at phonological variation (diachronic and synchronic) in character readings within and across CJKV locales can only aggravate the opinion that if the writing is not completely aphonographic, then it is either extremely complex or completely and utterly chaotic. And lacking readily apparent systematic phonological information, what remains but semantics or the basic idea to be conveyed? Indeed, for casual purposes the ideas conveyed by CJK characters are often intelligible across locales, though the spoken languages are not themselves mutually intelligible. Certain pronunciation details may be irrecoverable from even a close phonetic transcription, but CJK writing is an especially lossy means of phonological information storage.
Although info-tech usage of the term ideograph may be an unfortunate neologism derivative of an original misnomer, it seems to have arisen as an informed compromise in a specific usage context, in clear lack of a more precise preferable pre-existing English word. Terms such as logograph (‘one-to-one relation between sign and “word” [itself not terribly well-defined]’; not all CJK syllables are words, and CJK words are not always monosyllabic) or morphosyllabograph (a specialist Greco- mouthful!) might have been similarly imprecise and clumsy, and might have been preferable for not perpetuating a misconception of “pure disembodied idea writing”, but might promote some other misconceptions.
So, the term ideograph in modern info-tech usage might best be understood (or rationalized ex postfacto, along with early sinological usage) as indicative of a difference of degree: that the phonological information conveyed is somewhat limited relative to more fully phonographic scripts (those using alphabets and isographic syllabaries to convey specific sound values).
“A syllabary being a system for writing the elements of the syllable canon of a language, the syllabograph would be a graphical element of a syllabary. When there is a one-to-one correspondence between syllable type and syllabograph, this is an isographic syllabary. In that it sometimes has multiple representations of a given syllable type, the Chinese writing system might be termed an imperfect or heterographic syllabary. Chinese characters, the elements of a heterographic syllabary, might be termed heterographic syllabographs, or heterosyllabographs. No matter what they are called, there is clearly some degree of imprecision in the Chinese script, in terms of its ability to convey specific sound values.” [Cook, 2003:195]
Many Asian languages (and CJK languages in particular) are termed monosyllabic. Of Chinese languages (or dialects) in particular, this means that the syllable (phoneme cluster with tone-bearing vocalic nucleus) bears much functional weight, and is traditionally extremely well-defined, both phonologically (phonemically) and morphologically (morphemically), presumably in natural speech as in orthography and lexicographic descriptions. Most syllables are associated with (one or more) distinct units of meaning (morpheme + syllable = morphosyllable) apparent in and productively used in formation of polysyllabic words. If a syllable has multiple clearly distinct meanings, each of these would “ideally” be written with a distinct character, though actual orthographic practices reflect multiple layers of subjective and irregular development. The syllable has status similar to that of the phoneme in other languages, as evident in meaning-driven “transcription” of foreign words (nativizing morpho-analytic re-syllabification). Characters have been associated with single syllables for a long time, presumably from the beginning of Chinese writing, and this implies that the language before writing was also monosyllabic. It might, however, be more accurately termed “sesquisyllabic”, since historical studies suggest that syllable boundaries are rather fluid, and that prominent syllabic nucleii may assimilate adjacent relatively unstressed (or destressed) elements over time. Though the earliest specific evidence comes from no earlier than the earliest 反切 fǎnqiè ‘sound glosses’ (perhaps in the 7th c. AD), the traditional opinion is that the character-to-syllable connection goes back to the earliest writing (this is reflected in traditional monosyllabic reconstructions of Old Chinese phonology).
Syllables have long had reality for native speakers/writers as functional nuggets of meaning+sound, as is evident e.g. in ancient Chinese character lexicons organized into (homophonic) syllable classes, with both sound and meaning glosses. Of the six major traditional Chinese character types (六書:象形,指事,會意,形聲,轉注,假借), by far the most common is the so-called 形聲 xíngshēng ‘sematophonic’ compound, combining one semantic determiner with one phonographic component. Thus, homophonic (heteromorphic, heteronymous) characters may be written with the same phonographic component, but are semantically differentiated by means of a (non-shared) semantic determiner. But the phonographic component does not always give a very consistent indication of the pronunciation, due to local variation and historical changes. For example, the character 皆 (MC /kei/, c. 1000 AD) is today commonly pronounced jiē, but when it is used as a component the resulting compound is not always pronounced jiē (e.g. 階 jiē, 諧 xié, 偕 xié, 揩 kāi, 楷 kǎi). Such variation is sometimes regular, but at other times it seems unpredictable in the light of available historical evidence. The phonographic component is sometimes said also to have a (non-phonographic) semantic function, and such characters (simultaneously 形聲 xíngshēng ‘sematophonic’ and “會意” huìyì ‘semantic complexes’) are termed “亦聲” yìshēng [lit.] ‘also phonetic’. Thus, even in phonographic writings the phonographic component is sometimes not entirely devoid of (non-phonologic) semantic value, and serves in combination with the determiner to specify the meaning of the syllable. Morphosyllabographs are fraught with meaning, but often only a small part of that meaning is clearly phonological, and sometimes none of it is at all.
At any rate, even if the issues surrounding Chinese character typologies are complex, and usages of the term ideograph are imprecise, contradictory, confused and/or confusing, the term ideograph is most certainly not used in info-tech to indicate “pure idea writing” (bypassing graphic representation of speech), nor is it used to indicate any of the traditional classes of Chinese-derived characters such as 象形 xiàngxíng ‘pictographic’ characters, nor the very small traditional set of so-called 指事 zhǐshì ‘indicative of the deed’ characters. All Chinese-derived characters indicate syllabic speech units, and though the spoken languages of their speakers may not be mutually intelligible, the writings sometimes are.
A word on the character of the word “character”
In naming CDL, we use “character” with (one of) its common English meaning(s), intentionally avoiding the uncommonly understood (or commonly misunderstood) information-technology terms “ideograph” and “glyph”. (No, info-tech glyph does nont mean we’re suddenly talking about Mayan here, as Pulleyblank once complained in the early 1990s.) Arguably, there are some other good reasons not to call CDL a “CJK Ideographic Glyph Description Language”.
Finally, if you are not bothered by the jargon and prefer to think of CDL as a C(IG)DL “CJK (Ideographic Glyph) Description Language”, please feel free to do so.
Last modified: 2014-09-17
Copyright © 2003-2014 Wenlin Institute, Inc. All Rights Reserved.