SOAS University of London

Tibetan Studies at SOAS

Tibetan in Digital Communication - Corpora and Lexicon

Tibetan Corpora

The corpus consists of three distinct collections. First, there is the Classical corpus. Second, there is the Saint Petersburg corpus, consisting of texts assembled and tagged by Pavel Grokhovski from Saint-Petersburg State University. To some extent texts have been retagged in order to make them consistent with our scheme. Third, there is the Berlin corpus, kindly provided to us by Michael Balk. This corpus comprises the entire Tibetan catalogue of the Berlin State Library. It came to us fully segmented, but not tagged. We have converted the corpus into Unicode and made some effort to adjust its conventions to match ours. Together, these three collections constitute the "Complete corpus".

Texts are available in horizontal and vertical formats. In horizontal format, a single space marks the boundary between words, and line breaks separate sentences. Each word consists of a word form followed by a tag, with the pipe character in between. Whitespace is not permitted within words. For example:

Corpus image 1

Each line of a horizontal file corresponds to a page of text. We have made no attempt to ensure that page breaks correspond to sentence breaks; therefore, logical sentences are often split over two lines. However, we do not allow page breaks to be inserted within words. When this happens, we have moved the page break to the nearest word boundary.

In vertical format, each word occurs on a separate line, with sentence breaks indicated either by a blank line or by special word forms. Word forms are separated from their tags by a tab; therefore, word forms are permitted to contain the single space character. The part-of-speech tag is usually followed by a lemma to which the word form belongs when it has that tag. We follow convention by inserting a dash to indicate that we are not tracking lemmas. Here's the same two sentences in vertical format:

Corpus image 2

Tibetan Lexicon

Numerous Tibetan lexicons are available for download from this site. First, we make use of a processed and somewhat modified version of  Nathan W. Hill's A Lexicon of Tibetan Verb Stems as Reported by the Grammatical Tradition (Munich: Bayerische Akademie der Wissenschaften, 2010). Using this lexicon, we have been able to pre-tag many verb stems and verbal nouns whose tags would otherwise not be known.

In addition to the verb lexicon, we also generate mini-lexicons for each text or text collection that is being tagged. Finally, the complete lexicon comprises the main corpus together with the verb lexicon.

Lexicons are stored and distributed in vertical format. Each word form has its own line, with tabs separating possible readings. Each reading has two parts: a part-of-speech tag, and a lemma to which the word form belongs when it has that tag. Since we are not tracking lemmas, the lemma field is always left empty; by convention, this is indicated with a dash.

Corpus image 3

Note that our system treats word forms with and without tsheg (e.g. ཐོག་ and ཐོག) as separate lexical entries, because the two forms may have different distributions.