SOAS University of London

Tibetan Studies at SOAS

Tibetan in Digital Communication - POS Taggers

Lex Tagger

The lex tagger takes a segmented text as input, and assigns to each word every possible part of speech tag it can have. Lex tagging outputs are used as inputs to the regex and cg taggers. They can also be used as baselines against which to compare the performance of other taggers.

Various lex tagged texts are available for download here, in horizontal and VISL CG format. VISL CG format is a kind of vertical file required as input to the CG3 tagger (see documentation).

Rule Tagger

If every word is in the lexicon, then the lex tagger will be 100% accurate when applied to an input text. However, the output of the lex tagger is still highly ambiguous, since many words have more than one tag. This is where the rule tagger comes in. Its job is to use contextual rules to eliminate impossible tags and thereby reduce ambiguity, while retaining near perfect accuracy. First, the lexicon is used to assign to each word of a text all of its possible tags. Then, the rules are applied in order, stripping off tags that are not possible given the surrounding context.

Each rule package includes a background explanation along with a concise statement of the rule itself, with the latter forming the basis of its implementation. End users wishing to understand the intended purpose and function of a rule can ignore the code and focus instead on the rule background and statement.

For detailed discussion of an earlier version of the rule tagger see E. Garrett et al. (2014), 'A Rule-based Part-of-speech Tagger for Classical Tibetan', Himalayan Linguistics, 13 (1), pp. 9-57.

Regular Expressions

The rule tagger was first implemented using regular expressions. While useful for prototyping, this first tagger has proven brittle. It is difficult to maintain, and easy to corrupt. Moreover, regular expressions tend to exclude non-technical users, diminishing any realistic hope of getting others involved in the process of refining the tagger.

We continue to update the regular expressions tagger, but we anticipate that it will eventually be entirely superseded by the constraint grammar tagger. Constraint grammar was specifically designed for use by linguists and computer professionals for language analysis, with rules that are much easier to read and maintain.

Please note that the rules below and the taggers available for download from this site are the most up to date versions that we have. Occasionally, a rule modification will break the tagger. We usually fix broken rules quickly, so if the tagger doesn't work for you, check back later and download a new tagger.

The regex tagger consists of a sequence of rules, applied in order to a horizontal text. Each rule consists of two parts, the pattern (before the < symbol) and thereplacement (after the < symbol).

PATTERN > REPLACE

The horizontal taggings available for download here are the result of applying the regex tagger to hand-segmented text. One use of these outputs might be as input to a statistical tagger charged with the task of further reducing pos tagging ambiguity.

Constraint Grammar

VISL CG is a C++ application that should be compiled and built for your specific platform. To install the software on your machine, follow these detailed instructions.

Next, download and compile the grammar using the cg-comp command.

cg-comp 2014-10-31-cg3-tagger.txt cg3-tagger.cg

Finally, use the vislcg3 command to apply the compiled tagger to a lex tagged output in VISL CG format. For example, the following command will apply the tagger to the lex tagged མཛངས་བླུན་ཞེས་བྱ་བའི་མདོ།, assuming the tagger, the input file, and the output file are all in your current working directory.

vislcg3 -g cg3-tagger.cg -I lex_vislcg_74.txt -O cg3-74.txt

The CG POS tagger is provided in two flavours. The word tagger takes as input a sequence of word cohorts. The syllable tagger takes as input a sequence of syllable cohorts.