Nepali, being a highly inflectional and derivational language, a single word can represent various grammatical forms and meanings. For example a verb root लेख्(lekh) can show different forms such as: लेख्छु(lekh-chu), लेख्छस्(lekh-chas), लेखछेस्(lekh-ches), लेख्छ(lekh-cha), लेखी(lekh-i), लेख्यो(lekh-yo), लेखे(lekh-e).
Stemming is the process of reducing inflectional(or sometimes derivational) forms of words to their respective stems/roots by eliminating the affixes. It is a pre-processing step in various natural language processing applications such as information retrieval, machine translation, spell checker and text summarization.
There are various approaches to stemming like using a look-up table , statistical methods and affix rules list . In this blog post we'll be discussing a rule-based stemmer that strips suffix from words in an iterative manner.
The Stemmer and its Suffix Rules
Stemming is more of a simple process of stripping affixes from words without any context analysis of the word. Derivation being a more complex word formation process a crude suffix-stripping algorithm is not efficient in mapping derivational variants to their respective stems. So the stemming algorithm we'll be discussing next handles inflectional variants only.
A rule-based stemmer uses language dependent affix rules to map a word to its stem. There are only inflectional suffixes in Nepali so, the stemming algorithm that will be discussed next is a iterative model that uses suffix rules based on Nominal and Verbal inflections in Nepali.
The iterative stemming algorithm is a computationally inexpensive and works in three parts. The first part reads the suffixes that occur at the end from a file and eliminates them. Suffixes in this list are nominal inflections only.
#suffixes for the first part ले को सँग लाई
The second and third parts of the algorithm work in iterations. The second part of the algorithm handles vowel modifier and symbols that need to be stripped only in special cases. For example ँ is stripped only when it is preceded by ो", "ु", "उ", "े", "ोै".
#suffixes for the second part ँ ं ै
The third part of the algorithm handles inflections but unlike the first part, the third part deals with both nominal and verbal inflections. Inflectional suffix markers in Nepali can also represent different forms for gender, grade of honorifics, number and so on. This causes the suffixes to occur in between the root and other inflection markers, in any order. Making a list for all the possible variants of the inflectional markers would be a tedious task. So, the third part of the stemmer works in iterations.
#suffixes for the third part हरु देखि ेको ेछु
itrstem is based on suffix stripping rules described above. The stemmer and the complete rule list are available at Nepali NLP Group .
Over and Under Stemming
The stemmer still faces the error due to over and under stemming. The error rate can be greatly reduced by improving the rules that are fed into the second and third part of the system and we are currently working on it.
Tags: Text Analysis , NLP , Pre Processing