Blog

  Part-of-speech tags are word classes or syntactic categories of words. They carry important information about words, their neighbours and how they relate to each other. Other important information carried by part-of-speech is the possible morphological affixes for a given word. Part-of-speech tagging is an important task in natural language processing. In this blog post […]

Source: Devanagari (Unicode block)     Unicode is a standard for representing characters in different languages using four digit hexadecimal number called code points. Each character is associated with a unique code point. In python, these code points are represented as \uXXXX, where \u indicates Unicode and XXXX is the four digit hexadecimal number.   Nepali […]

  Text analysis applications require frequent pattern matching and searching. For this reason, regular expressions play an important role in text analysis. Regular expressions are special sequence of characters that are useful for searching in texts. They can be used to split and modify given texts. Some common text processing tasks where regular expressions can […]

Nepali, being a highly inflectional and derivational language, a single word can represent various grammatical forms and meanings. For example a verb root लेख्(lekh) can show different forms such as: लेख्छु(lekh-chu), लेख्छस्(lekh-chas), लेखछेस्(lekh-ches), लेख्छ(lekh-cha), लेखी(lekh-i), लेख्यो(lekh-yo), लेखे(lekh-e). Stemming is the process of reducing inflectional(or sometimes derivational) forms of words to their respective stems/roots by eliminating […]

Removing stop words is a common and important practice when working with text analysis applications. So, what are stop words and why filter them out during pre-processing? Stop words are the words used in defining the structure of sentences. These are the most frequent words in a corpus but they do not provide any value to the […]