SOAS University of London

Tibetan Studies at SOAS

Tibetan in Digital Communication - Workflow

Text Normalization

In some cases there are two ways in Unicode to encode a Tibetan character. In order to simplify the statistical models we normalize to the more common encoding. When uploading new texts it is useful to ensure that the following changes have been made:

  • "༎" changed to "།།"
  • "༌" (typical after ང and before śad) changed to "་"

Word Segmentation

One approach recasts word segmentation as a problem of syllable classification. Each syllable in a word is tagged in one of 8 ways. The lone syllable of a single syllable word is tagged "S". In multisyllabic words, the initial syllable is tagged "X", and the final syllable is tagged "E". Other syllables are tagged with Y, Z, or M, as follows: X-Y-E, X-Y-Z-E, X-Y-Z-M-E, X-Y-Z-M-M-E, and so on. Two additional tags are used for Tibetan's complex syllables. The tag "SS" is used for two single syllable words which are joined together, such as འདིའི་. And the tag "ES" is used for a word-final syllable that is joined together with the following single syllable word, such as the པོས་ in ཆེན་པོས་. It must be understood that for the purposes of word segmentation, these complex syllables must be split off into separate words: འདིའི་|SS will become འདི + འི་, and ཆེན་|X པོས་|ES will become ཆེན་པོ + ས་.

Tag Set

The tag set used in the project is described in E. Garrett et al. (2015), 'The contribution of corpus linguistics to lexicography and the future of Tibetan dictionaries'Revue d'Etudes Tibétaines, 32, pp. 51-86.

The Classical corpus is the authoritative reference for our tagset, so should be the first point of reference. As noted above, the Saint Petersburg corpus comes via Pavel Grokhovsky, who used a different tagset. We have been converting his tagset to our system, but this work is not yet complete. Finally, the Berlin corpus is only partially tagged, and so includes a great many words with dummy tags (for example, xxx).

Pre Tagger

The aim of pre-tagging is to present the human tagger with a reduced set of choices when tagging a text. The pre tagger can leave difficult tagging decisions to the human, but it should make every effort not to eliminate possible tags. Human taggers are to download the pre tagged outputs, and then upload their corrections back to the system.

In this interface, pages that have not yet been hand tagged are pre tagged, by applying the current best segmenter followed by the rule tagger. Pages that have already been hand tagged are also pre tagged, enabling the performance of the segmenter and rule tagger to be easily assessed alongside the correct tagging.

Tag Suggestions

After hand tagged texts are fed back into the system, they are checked using the tag suggestions mechanism. For each corpus, the system generates a list of tag suggestions, which draw attention to those cases where the rule tagger gives a different answer to the human tagger.

This purpose of this step is twofold. On the one hand, the machine's correct suggestions bring attention to mistakes or inconsistencies in the human tagging. On the other hand, the machine's incorrect suggestions point us to rules that need to be revised or removed.