This blog post is the first post in the series “Clustering Text Documents”. In this blog post, we’ll mathematically define the TF-IDF algorithm along with an example and its python implementation.   TF-IDF is a popular method used in text mining and information retrieval to evaluate the importance of a word in a text […]

  Negation is used to express the opposite meaning of affirmative sentences. Negation in Nepali verbs takes place due to affixation(suffixation and prefixation). The negative case marker न(na) is either prefixed or suffixed with verb roots or verb forms to express negation.   Negation due to Prefixation   Negation in some verb forms occurs when […]

  One of the challenges faced by statistical part-of-speech taggers is the presence of words in test datasets that do not exist in the training dataset. Such words are called unknown words. In this blog post we’ll look into different ways to tag unknown words.   A statistical part-of-speech tagger uses annotated training dataset to […]

  Part-of-Speech tagging is a common sequence-tagging problem in natural language processing. It is the process of assigning a single word class label to each token in the input sentence. For example, for input: इराक सीमाबाट सेना हटाइने।, the output of the tagger is इराक-कुवेत/NN सीमा/NN बाट/II सेना/NN हटाइने/NN ।/YF.   There are various approaches […]

  A well chose tagset is very important in part-of-speech tagging. The NELRALEC tagset contains 112 tags, which large in number. Using such large tagset is not always efficient, especially in cases where there is a limited annotated data available. In this blog post we’ll discuss a new tagset for Nepali, which is a reduced version of […]