NELRALEC Tagset: A Part-of-speech Tagset for Nepali Language

 

Part-of-speech tags are word classes or syntactic categories of words. They carry important information about words, their neighbours and how they relate to each other. Other important information carried by part-of-speech is the possible morphological affixes for a given word. Part-of-speech tagging is an important task in natural language processing. In this blog post we’ll discuss the NELRALEC tagset compiled under the Bhasa Sanchar or NELRALEC project.

 

In linguistics, words can be categorized into two categories: closed and open classes. Closed class includes part-of-speech labels in which it is possible list all the word-forms belonging to that label such as postposition, pronoun, conjunction, and interjection. In an open class such as noun, verb, adjective and adverb, exhaustive listing of the word-forms is not possible. The NELRALEC tagset contains 112 part-of-speech tags that has been compiled with reference various publications on Nepali grammar.

 

Nouns

 

Nepali nouns inflect for number and cases but the inflected forms are not marked with separate part-of-speech labels. This is because the case and plural markers are separated from word and analyzed separately. While gender is also an occurrence in Nepali nouns, they are not included in the tags because they are not grammatical gender.

 

The NELRALEC tagset has two part-of-speech labels for nouns: common noun(NN) and proper noun(NP).

 

Pronouns

 

Nepali pronouns demonstrate lexical variations for person, number and grade of honorifics. NELRALEC tagset contains thirty-nine part-of-speech labels for pronouns.

 

There are eleven labels for lexical variations that Nepali pronouns show for grade of honorifics. Lexical variants of first person pronouns and reflexive pronouns take four different labels each.

Some pronouns are derived by adding -ai. When morpheme -ai is added to a possessive pronoun the category of the possessive pronoun changes. The newly formed pronouns are without number or gender agreement unlike the original possessive pronoun. The tagset has four different categories of such pronouns.

 

Pronoun determiners in Nepali are used at the start of noun phrase in order to reference or provide context to the noun phrase. The tagset contains sixteen labels for pronoun determiners, four of which are for marked and unmarked demonstratives. Similarly, marked and unmarked interrogatives, relatives and other general determiners take four labels each.

 

The question word के(ke) is a special case among the interrogative pronouns. के(ke) receives the question marker(QQ) label when it is not used to refer an unknown entity.

 

Adjectives

 

There are five part-of-speech labels for adjectives in the NELRALEC tagset. Two of the tags are for the gendered forms of the adjectives i.e. masculine and feminine adjectives.

 

There is one label for each of the following category of adjectives.

  • Unmarked adjectives
  • Adjectives with other agreements
  • Comparative or superlative adjectives derived from Sanskrit

 

 

Verbs

 

Nepali verbs are highly inflected. Also, compounding of two or more verbs is common in Nepali. When tagging a compound verb, the tagging model assigns tag to the verb on the basis of the last identifiable verb in the compound word.

 

There are twenty-nine tags for verbs in the NELRALEC tagset. Thirteen of the tags are used for different categories of non-finite verb forms, which lack marking for person. The remaining labels are used for finite verb forms.

 

Adverbs

 

Adverbs in Nepali are open class words but a subset of Nepali adverbs fall into a closed category. The closed category of adverbs is determiners, which are morphologically related to the pronoun-determiners.

 

The tagset contains four labels for adverbs, three of which are for adverb-determiners: demonstratives, interrogatives and relatives. One of the labels is for the open class adverbs.

 

Postpositions

 

A postposition occurs as an element of a word and not as an independent word. The postposition is separated from the word before tagging.

 

There are eight labels for postpositions in the NELRALEC tagset. Three of the labels are for the genitive postposition को(ko) and its inflected forms. There is one label for each of the following:

  • Plural-collective postposition
  • Ergative-instrumental postposition
  • Accusative-dative postposition
  • Other postpositions
  • Unmarked genitive postposition derived using -ai

 

 

Numerals and Numeral Classifiers

 

The tagset contains five labels for Numerals. One of the labels is for cardinal numbers: Nepali digits and Devnagari number. The remaining four labels are for ordinal numbers, among which three are for marked ordinal numbers and one is for unmarked ordinal numbers.

 

There are four different categories for numeral classifiers in the tagset. Three of the categories are for different marked numeral classifiers and one is for the unmarked numeral classifiers.

 

Conjunctions

 

Conjunctions can be of two types: coordinating and subordinating conjunctions. Subordinating conjunctions in Nepali can appear before or after the clause it subordinates.

 

There are three different labels for conjunctions in the NELRALEC tagset, two of which are for the subordinating conjunctions and one is for the coordinating conjunctions.

 

Punctuations

 

The NELRALEC tagset contains four different labels for different categories punctuations listed below.

  • Punctuations that at the end of sentences
  • Punctuations that in the middle of sentences
  • Quotations
  • Brackets

 

 

Particles and Interjections

 

Particles are a small closed class of uninflected word forms. The NELRALEC tagset contains one label for the particles.

 

Interjection is also a closed class part-of-speech for independent particles that function as reduced but syntactically complete sentences. There is one part-of-speech label for interjections in the tagset.

 

There are six labels in the NELRALEC tagset for non-Nepali words.

 

Table 1 lists all the part-of-speech labels defined in the NELRALEC tagset.

 

Table 1: NELRALEC Tagset

Category Category Definition POS Tag
Noun Common Noun NN
Proper Noun NP
Pronouns First Person Pronoun PMX
First Person Possessive Pronoun with Masculine Agreement PMXKM
First Person Possessive Pronoun with Feminine Agreement PMXKF
First Person Possessive Pronoun with Other Agreement PMXKO
Non-honorific Second Person Pronoun PTN
Non-honorific Second Person Possessive Pronoun with Masculine Agreement PTNKM
Non-honorific Second Person Possessive Pronoun with Feminine Agreement PTNKF
Non-honorific Second Person Possessive Pronoun with Other Agreement PTNKO
Medial-honorific Second Person Pronoun PTM
Medial-honorific Second Person Possessive Pronoun with Masculine Agreement PTMKM
Medial-honorific Second Person Possessive Pronoun with Feminine Agreement PTMKF
Medial-honorific Second Person Possessive Pronoun with Other Agreement PTMKO
High-honorific Second Person Pronoun PTH
High-honorific Unspecified-person Pronoun PXH
Royal-honorific Unspecified-person Pronoun PXR
Reflexive Pronoun PRF
Possessive Reflexive Pronoun with Masculine Agreement PRFKM
Possessive Reflexive Pronoun with Feminine Agreement PRFKF
Possessive Reflexive Pronoun with Other Agreement PRFKO
Pronouns Derived using -ai First Person Possessive Pronoun without Agreement PMXKX
Non-honorific Second Person Possessive Pronoun without Agreement PTNKX
Medial-honorific Second Person Possessive Pronoun without Agreement PTMKX
Possessive Reflexive Pronoun without Agreement PRFKX
Pronoun Determiners Masculine Demonstrative Determiner DDM
Feminine Demonstrative Determiner DDF
Other-agreement Demonstrative Determiner DDO
Unmarked Demonstrative Determiner DDX
Masculine Interrogative Determiner DKM
Feminine Interrogative Determiner DKF
Other-agreement Interrogative Determiner DKO
Unmarked Interrogative Determiner DKX
Masculine Relative Determiner DJM
Feminine Relative Determiner DJF
Other-agreement Relative Determiner DJO
Unmarked Relative Determiner DJX
Masculine General Determiner-pronoun DGM
Feminine General Determiner-pronoun DGF
Other-agreement General Determiner-pronoun DGO
Unmarked General Determiner-pronoun DGX
Question Marker Question Marker QQ
Adjective Masculine Adjective JM
Feminine Adjective JF
Other-agreement Adjective JO
Unmarked Adjective JX
Sanskrit-derived Comparative or Superlative Adjective JT
Verb Infinitive Verb VI
Masculine d-participle Verb VDM
Feminine d-participle Verb VDF
Other-agreement d-participle Verb VDO
Unmarked d-participle Verb VDX
e(ko)-participle Verb VE
ne-participle Verb VN
Sequential Participle-converb VQ
Command-form Verb, Non-honorific VCN
Command-form Verb, Mid-honorific VCM
Command-form Verb, High-honorific VCH
Subjunctive/Conditional e-form Verb VS
i-form Verb VR
First Person Singular Verb VVMX1
First Person Plural Verb VVMX2
Second Person Non-honorific Singular Verb VVTN1
Second Person Plural(or Medial-honorific Singular) Verb VVTX2
Third Person Non-honorific Singular Verb VVYN1
Third Person Plural(or Medial-honorific Singular) Verb VVYX2
Feminine Second Person Non-honorific Singular Verb VVTN1F
Feminine Second Person Medial-honorific Singular Verb VVTM1F
Feminine Third Person Non-honorific Singular Verb VVYN1F
Feminine Third Person Medial-honorific Singular Verb VVYM1F
First Person Singular Optative Verb VOMX1
First Person Plural Optative Verb VOMX2
Second Person Non-honorific Singular Optative Verb VOTN1
Second Person Plural(or Medial-honorific Singular) Optative Verb VOTX2
Third Person Non-honorific Singular Optative Verb VOYN1
Third Person Plural(or Medial-honorific Singular) Optative Verb VOYX2
Adverb Adverb RR
Demonstrative Adverb RD
Interrogative Adverb RK
Relative Adverb RJ
Postposition Postposition II
Plural-collective Postposition IH
Ergative-instrumental Postposition IE
Accusative-dative Postposition IA
Masculine Genitive Postposition IKM
Feminine Genitive Postposition IKF
Other-agreement Genitive IKO
Numerals Cardinal Number CD
Masculine Ordinal Number MOM
Feminine Ordinal Number MOF
Other-agreement Ordinal Number MOO
Unmarked Ordinal Number MOX
Numeral Classifiers Masculine Numeral Classifier MLM
Feminine Numeral Classifier MLF
Other-agreement Numeral Classifier MLO
Unmarked Numeral Classifier MLX
Conjunctions Coordinating Conjunction CC
Subordinating Conjunction appearing after the clause it subordinates CSA
Subordinating Conjunction appearing before the clause it subordinates CSB
Punctuation Sentence-final Punctuation YF
Sentence-medial Punctuation YM
Quotation Marks YQ
Brackets YB
Particle Particle TT
Interjection Interjection UU
Others Foreign Word in Devnagari FF
Foreign Word not in Devnagari FS
Abbreviation FB
Mathematical Formula FO
Letter of the Alphabet FZ
Unclassifiable UU
Null Tag Null Tag NULL

 

References

 

Leave a Reply