skip to content


The Han language (commonly known as Chinese, though it is only one of over fifty languages used within the borders of China) is spoken in its seven major and innumerable minor dialects by over one and a quarter billion people. It has been estimated that up to the end of the eighteenth century, more books had been published in Chinese than in all the rest of the languages of the world put together. Over 100,000 new titles in Chinese are now being published annually, mainly in the People's Republic of China, Hong Kong and Taiwan, but also by the Chinese diaspora throughout the world. These facts sufficiently attest the importance of Chinese and the need for effective bibliographic control. This article presents a brief description of the special features of Chinese, with special reference to its use for cataloguing purposes.

As is well known, the Chinese script is non-alphabetic, consisting of ideographs (usually called "Chinese characters"), pictographic in origin but with arbitrarily assigned phonetic values as well as symbolic significance. Because the symbolic function of the characters is independent of their phonetic value, in much the same way as the Arabic numerals, though understood exactly similarly in each European country, are pronounced differently by speakers of the various European languages, the Chinese script has been able to serve as a unifying factor in both time and space. Dialects which are often mutually unintelligible are recorded using the same script, and a poem written at the time of Beowulf appears to the eye as if written yesterday, because the script is unaffected by historical changes in the spoken language. This also explains how Chinese characters could be borrowed to write languages wholly unrelated to Chinese, such as Japanese and Korean.

The largest dictionary of Chinese so far published contains 84,000 different characters, but 80% of these are of extremely rare occurrence, and a working knowledge of 4-5 thousand suffices for most everyday purposes. By contrast, the phonetic system of all dialects of spoken Chinese is remarkable for its simplicity (disregarding for the moment the question of tones). Standard Chinese, also called Mandarin, has only 420 syllables. The potential for ambiguity when each of 84,000 distinct characters must be read by one of only 420 syllables is obvious. In practice the difficulty is mitigated to a great extent by the use of fixed combinations of syllables in speech and their corresponding characters in writing, though there are no unambiguous rules for word-formation and there is not even agreement as to what constitutes a "word" in Chinese.

The most common transcription or romanisation systems are Wade-Giles (named after its inventor Sir Thomas Wade, first Professor of Chinese at Cambridge, and his successor Dr. Herbert Giles) and Pinyin (which simply means "phonetic spelling"), the official system promulgated by the government of the People's Republic of China in 1958 and now gaining international acceptance. For bibliographic purposes purely phonetic renderings are inadequate unless there is sufficient context to remove ambiguity, and if the underlying characters are not known the correct transcription of a Chinese personal name or book title will always be in doubt.

The nature of the Chinese script causes certain difficulties for bibliographers, one of the most important of which is the very basic question of how to sort the characters in a consistent order. Since the script is not alphabetic, there is no obvious "alphabetical order". The nearest equivalent to such a method is to sort the characters by their phonetic transcriptions, but this means that each character must be linked, both physically and in the mind of the reader, to a particular syllable in the standard dialect. Many characters however have more than one possible reading (depending on context), and it is beyond the capacity of the human brain to memorise the reading of every character. Another serious disadvantage of phonetic sorting is that many speakers of Chinese dialects have an imperfect grasp of the standard pronunciation.

Many attempts have been made to devise methods of arranging the characters in a logical sequence by analysing their structure. The most common method, first used in a dictionary published in 100 AD, identifies certain components (the so-called "radicals") which occur in characters of related meaning and enable those characters to be sorted in sense-groups. For example, all characters to do with fishes, fishing, etc., contain the "fish" radical 魚; all those concerning wood, trees, etc., contain the "tree" radical 木, and so on. This system produces a kind of taxonomy of meaning, the characters being arranged within their respective groups according to the number of written strokes in addition to the radical. Unfortunately there are no universally agreed rules as to how this should be done, and many variations exist.

Another method assigns numbers to the various different writing strokes used to make up the characters (0 for a dot, 1 for a horizontal stroke, etc.), and uses the occurrences of these strokes at various positions in the characters, according to more or less complex rules, to generate a unique numerical value for each character. Literally hundreds of this kind of system have been devised, especially since the advent of computerisation created a demand for ways of directly inputting characters by sequences of keystrokes. Some of them are very quick and efficient when used by trained operators, but they require much time and practice to learn thoroughly.

Until recently the complexity of the Chinese characters was a serious obstacle to their use in automated bibliographic systems, but the situation has been transformed by the availability of cheap and reliable Chinese software packages and multi-script web technology. Large quantities of full Chinese MARC records now exist, notably in the China National Bibliography Retrospective Database on CD-ROM, which contains over one million records and is steadily growing.

Cambridge University Library is one of the consortium of libraries participating in the RSLP UK Database of Chinese Research Materials project which began in November 1999. As part of the project many thousands of the existing romanised-only Chinese records on the Cambridge OPAC have been upgraded to full Chinese MARC records, and the process will continue as more new records become available from China.

(By Charles Aylmer, Former Head of Chinese Department, Cambridge University Library;

Reprinted from: Cambridge University Libraries Information Bulletin (ISSN 0307-7284), New Series No. 47, Michaelmas Term 2000