A New and Reduced Part-of-Speech Tagset for Nepali

 

A well chose tagset is very important in part-of-speech tagging. The NELRALEC tagset contains 112 tags, which large in number. Using such large tagset is not always efficient, especially in cases where there is a limited annotated data available. In this blog post we’ll discuss a new tagset for Nepali, which is a reduced version of the NELRALEC tagset.

 

The reduced tagset is designed to eliminate the error that the part-of-speech tagger makes due to sparseness of annotated data.

 

In the new tagset:

  • All personal pronouns are grouped together and distinctions in grades of honorifics are not considered.
  • All possessive pronouns are grouped together and distinctions in grade of honorifics, gender and other inflectional forms is not considered.
  • Sixteen different tags for pronoun determiners and three different tags for adverb determiners are re-grouped into three new tag groups.
  • The distinctions for inflected forms such as gender, number, honorifics and person are not considered for verbs.
  • All the inflected forms of adjectives, ordinal number, numeral classifiers and genitive postpositions are grouped together.
  • Different categories of foreign words are grouped together.
  • All subordination conjunctions are given a single label.

 

Table 1: Reduced Tagset for Nepali
Category Category Definition POS Tag NELRALEC Tags
Noun Common Noun NN NN
Proper Noun NP NP
Pronouns Personal Pronoun PP PMX, PTN, PTM, PTH, PXH, PXR
Possessive Pronoun PPP PMXKM, PMXKF, PMXKO, PTNKM, PTNKF, PTNKO, PTMKM, PTMKF, PTMKO, PRFKM, PRFKF, PRFKO, PMXKX, PTNKX, PTMKX, PRFKX
Reflexive Pronoun PRF PRF
Determiner Marked DTM DDM, DDF, DKM, DKF, DJM, DJF, DGM, DGF, DDO, DKO, DJO, DGO
Unmarked DTX DDX, DKX, DJX, DGX
Others DTO RD, RK, RJ
Verb Finite Verbs VF VVMX1, VVMX2, VVTN1, VVTX2, VVYN1, VVYX2, VVTN1F, VVTM1F, VVYN1F, VVYM1F, VOMX1, VOMX2, VOTN1, VOTX2, VOYN1, aVOYX2
Infinitive Verb VBI VI
Prospective Verb VBN VN
Aspect Verb VBKO VDM, VDF, VDO, VDX
Others VBO VE, VQ, VCN, VCM, VCH, VS, VR
Adjective Marked JJM JM, JF, JO
Unmarked JJX JX
Degree JJD JT
Adverb Adverb RR RR
Postposition Postposition II II
Plural-collective Postposition IH IH
Ergative-instrumental Postposition IE IE
Accusative-dative Postposition IA IA
Genitive Postposition IKO IKM, IKO, IKF, IKX
Numerals Cardinal Number MM MM
Marked Ordinal Number MOM MOM, MOF, MOO
Unmarked Ordinal Number MOX MOX
Classifier Marked MLM MLM, MLF, MLO
Unmarked MLX MLX
Conjunction Coordinating Conjunction CC CC
Subordinating Conjunction CS CSA, CSB
Interjection Interjection UU UU
Question Marker Question Marker QQ QQ
Particle Particle TT TT
Punctuation Sentence-final Punctuation YF YF
Sentence-medial Punctuation YM YM
Quotation Marks YQ YQ
Brackets YB YB
Foreign Word Foreign Word FW FF, FS, FO, FZ
Unclassifiable Unclassifiable FU FU
Abbreviation Abbreviation FB FB
Null Tag Null Tag NULL NULL

Leave a Reply