A question of the frequency of pinyin letters of Chinese was raised in May 2005. Whereas there are frequency tables for characters available on-line, and much written about, the frequency of the pinyin letters of a romanised text does not seem to have been investigated. In this paper, we attempt to analyse a text taken from a webpage in Chinese.
This study is differs from the initial request on sci.lang, in that we are not studying letter frequency, but we have changed it to find out about the Chinese syllables composed of the initial, medial, rimes and their tones. It was found that the distribution of some letters is influenced by certain function words used in Chinese.
For any useful statistical analysis of text, it has to be fairly long to enable a representative sample of Chinese to be analysed. I have selected a 73 kilobyte text, from this webpage http://www.bignews.org/20050524.txt .
The characters on the webpage are in the simplied GB Chinese character set. The page was converted to traditional Big5 characters before any attempts to analyse the text was made. Consequently, the machine conversion of GB to Big5 may be lossy, and sometimes inappropriate characters may have been substituted automatically, where multiple traditional characters map to a single simplified character.
Ignoring the punctuation and non-Chinese words, there were 30328 Chinese characters (and therefore syllables) found. They were then run through a annotator to produce a list of pinyin syllables and 837 unique Chinese syllables were found.
The study is limited by two important points
We have separated the apical vowels i in qi, ci, si from those of the zhi, chi, shi and all other syllables with a single i vowel (e.g. pi, bi, mi etc). The set zcs and x has the vowel represented by a dot ., the set zhchsh and r is represented by a colon :. The letter v represent the u umlaut ü which is found in the syllable nü (woman). This sound also occurs elsewhere, but we have not altered the syllables to account for " üan ".
We now sort the the statistics for the number of times the initials, medials and rimes, tones and medial+rime appear, and how many characters and their frequency of use.
Discussion of Results
The running text of just over 30,000 characters yielded less than 1900 unique characters, and 838 unique syllables (with tone differentiated syllables). Of the initials, a relatively high proportion of syllables(11.09%) began with d. This was anomalous, as the Chinese particle 的 begins with d, accounted for 4.26% of all characters in the running text. Again, this is reflected in rime /e/ which accounts for 11.48% of all rimes.
The first ten of the most frequently used characters in this study are
的 (possessive) 在 (at) 是 (copula) 不 (negative) 和 (peace, and)
are used as function words (prepositions, negation particle, etc) which feature greatly in Chinese. The first ten characters account for nearly 15% of the text.
The letter i in pinyin into three vowels, i, . and : together they appear in approximately 19% of syllables, as non-medials. When i is used as a medial, i forms around 10% of syllables (and not including y initialed characters which form approximately 8.5% of all readings in the text surveyed.
The letter u both represents a medial and an open vowel. As a medial, the words beginning with w in pinyin must also be taken into account also, we fine w accounts for 3.62% of syllables whilst the medial u is found in 9.85% of syllables. As an open vowel, the u is found in just over 9% of syllables.
The appearance of v as defined above, appears in just 12 of the 30000+ syllables of text. However, the phonetic value is not reflected in pinyin, as it also occurs in the üan üe.
The phonetic value [u] is also bound to the phonemic transcription <ong> which is pronounced [uŋ].
With regard to the tones, tone 1 and 2 are Yin Ping and Yang Ping (or upper and lower 'level') tones respectively. Tone3 is the Shang (or rising) tone, whilst tone 4 is the Qu (or departing) tone. Tone 5 is the unstressed tone, used in sandhied syllables. Together the Ping tone accounts for over 46% of syllables, whilst the Shang has around 17% of syllables with the Qu category making 32% of the syllable tones. We find that the smallest number of tones of around 5% is the fifth unstressed tone. As mentioned above, the selection of the syllables in the annotator program used to convert the characters into romanisation was not selective. The figures here may not be reflective of actual accurate pinyin transcription. One could interpret that 5% of the 30,000 syllables being unstressed may be due to the formation of syllable pairs where one of them is sandhied. In view of the preceeding reason, it would be erroneous to assume that this is the case.
We find through looking at a sample piece of text only around 840 syllables reflecting the paucity of phonological richness in Mandarin Chinese. The paucity of sounds combined with the frequency of function words such as de 的 contributes to an unusually high occurence of the initial d.
© Dylan Sung 2005 This page was created on Thurday 26th May 2005