Removing stop words is a common and important practice when working with text analysis applications. So, what are stop words and why filter them out during pre-processing?
Stop words are the words used in defining the structure of sentences. These are the most frequent words in a corpus but they do not provide any value to the outcome of an analysis. They only take up extra processing time. Removing stop words during pre-processing lets us to focus more on the important words that provide value to the outcome of the analysis. It also helps to obtain a better performance of the classification models. So, these words are usually filtered out during the pre-processing phase.
Nepali Stop Word List
There is no standard stop words list available for Nepali, so first we are going to build one. Also, it is always better to build a custom stop word list specific to the domain and the type of application you are working in, as it will be more effective. Stop words can include language specific determiners, conjunctions and postpositions. Also, it can include other common words like names, places and temporal words.
Determiners come at the start of noun phrases. A determiner is used to reference and provide context to the noun phrase. They can be of different types.
Demonstratives are the determines used to point out something within sight. These words express how near or far away an object is.
#proximity this - यो(yo) these - यी(yi) #distant त्यो(tyo) these - ती(ti)
A quantifier gives information about the quantity of the noun phrase it modifies. It can be either definite or indefinite.
#definite each - हरेक(harek), प्रतेक(pratyek) #indefinite a little - अलिकति(alikati), थोरै(thorai) all - सबै(sabai) some - केहि(kehi)
Cardinal numbers and ordinal numbers are determiners used with count nouns. A number determiner gives information about the number of the noun phrase it modifies.
#cardinal number one - एक(ek) two - दुई(duii) three - तिन(tin) #ordinal number first - प्रथम(pratham) second - द्रितीय(dritiya) third - तृतीय(tritiya)
Interrogative determiners are used to ask question about noun phrases. In Nepali, these words begin with क्(k) initial.
who - को(ko) which - कुन(kun) how much - कति(kati)
A relative determiner is used to show relation between two clauses and it begins with ज्(j) initial.
who - जो(jo) how - जसरी(jasari) which - जुन(jun)
Conjunctions are particles used to connect/conjoin two words, phrases or sentences. The two different types of conjunctions are discussed below.
Coordinating conjunctions conjoin two words, phrases or sentences with equal rank.
and - र(ra), तथा(tatha) but - तर(tara), किन्तु(kintu), परन्तु(parantu)
A subordinating conjunction introduces a subordinated clause. Some of the subordinating conjunctions can occur in the beginning of the subordinated clauses like, because – किनकि(kinaki) and some at the end like, although – पनि(pani).
Postpositions occur at the end of the nominals they determine. Different categories of postpositions are discussed below.
Plural marker , हरु(haru), occurs with nouns and pronouns in Nepali to indicate their plural forms.
Case markers , are postpositions attached to the nominal to express different case relations with the verbs.
ले(le) बाट(bat) लाई(lai:) ले(le) द्वारा(dwara)
Like other postpositions, adverbial postpositions also occur with nominal. These postpositions carry meanings like adverbs and hence the name ‘adverbial postposition’.
this side - वारि(wari) other side - पारि(pari) before - अघाडी(aghadi)
NOTE: Postposition in Nepali usually occur by attaching itself to the nominal and not as a separate word in itself. Postpositions that occur independently are removed by StopWordRemoval class.
Other Common Words
Other common words contain common names, places and temporal words. Common names contains a list of names of things/people that occur frequently in a corpus. List of places contains names of countries, zones, districts and other commonly occurring names of places. Temporal word list contains time-related data like month and days. Some of such words are listed below.
#common names कुमारी गणेश चित्रकार राम #places कोशी गण्डकी नवलपरासी नुवाकोट ग्रीनल्याण्ड नर्वे ग्रीस #temporal words बैशाख जेष्ठ आषाढ श्रावण
The complete stopwords list is available at Nepali NLP Group .
Tags: Text Analysis , Pre Processing