by Ingroj Shrestha on Sept. 4, 2017


Text analysis applications require frequent pattern matching and searching. For this reason, regular expressions play an important role in text analysis. Regular expressions are special sequence of characters that are useful for searching in texts. They can be used to split and modify given texts. Some common text processing tasks where regular expressions can come handy are tokenization, chunking and stemming. In this blog post we'll look at how regular expressions in Python can be used to match and modify texts in Nepali.


We'll be using the re module to work with regular expressions in python 3.

	import re

Pattern Matching


Regular expressions involve matching character classes. A character class can be a single character or a pattern of characters enclosed within square braces[]. Some of the common character classes are listed in Table 1.


Table 1: Character Classes

Character Class

Description

\w [a-zA-Z0-9]

matches alphanumeric characters

\W [^a-zA-Z0-9]

matches non-alphanumeric characters

\d [0-9]

matches decimal digits

\D [^0-9]

matches non-digit characters

\s [ \t\n\r\f\v]

matches whitespaces

\S [^ \t\n\r\f\v]

matches characters other than whitespaces

[०-९]

matches devnagari numbers, for python 3.x only

[क-ज्ञ]

matches devnagari letters, for python 3.x only


All the characters match themselves except for the special metacharacters . ^ $ * + ? { } [ ] \ | ( ) ; this is because each of these metacharacters have a special meaning. You can read about the meanings invoked by these metacharacters at: Python Regular Expressions


Common Re Functions


The re module define several functions. We'll only be discussing the functions that we commonly use in text analysis.


re.split(pattern, string, maxsplit=0, flags=0) is used to split a given string at specified pattern of characters.

	#splitting at comma ","
	print (re.split(',', 'रौतहटमा, चलचित्रमा, घरको'))

	OUTPUT:
	['रौतहटमा', ' चलचित्रमा', ' घरको']

re.match(pattern, string, flags=0) matches the given pattern at the beginning of the string.

	wordlist = ['रौतहट','घरको','चलचित्र']
	print ([w for w in wordlist if re.match('च',w)])

	OUTPUT:
	['चलचित्र']

re.search(pattern, string, flags=0) searches for the given pattern anywhere in the string.

	wordlist = ['रौतहट','घर', 'एक']
	print ([w for w in wordlist if re.search('र',w)])

	OUTPUT:
	['रौतहट', 'घर']

To search the text for the words ending with a certain pattern include $ after the pattern.

	wordlist = ['रौतहट','घर', 'एक']
	print ([w for w in wordlist if re.search('र$',w)])

	OUTPUT:
	['घर']

To search the text for the words starting with a certain pattern include ^ after the pattern.

	wordlist = ['रौतहट','घर', 'एक']
	print ([w for w in wordlist if re.search('^र',w)])

	OUTPUT:
	['रौतहट']

re.findall(pattern, string, flags=0) finds all strings that matches with the given pattern.

	wordlist = ['रौतहटमा','चलचित्रमा', 'घरको']

	for w in wordlist:
		if re.findall(r'^.*(मा)$', w):
			print (re.findall(r'^.*(?:मा)$', w))

	OUTPUT: 
	['रौतहटमा']
	['चलचित्रमा']

re.sub(pattern, repl, string, max=0) replaces all occurrences of pattern with repl unless a max value is provided.

	text = 'रौतहट, घर, एक'
	text = re.sub('\,', '?', text)
	print (text)

	OUTPUT:
	रौतहट? घर? एक

Stemming using Regular Expressions


Stemming is a process of eliminating affixes from words to get to its stem. In Iterative Rule-based Stemming in Nepali we discussed a iterative rule-based stemmer for Nepali that strips suffixes from words using a set of hand-written suffix rules. Here, we will be discussing how regular expressions can be used to get the stems.

	wordlist = ['रौतहटमा, चलचित्रमा, घरको']

	for w in wordlist:
		if re.findall(r'^.*(मा|को)$', w):
			print (re.findall(r'^.*(मा|को)$', w))

	OUTPUT: 
	['मा']
	['मा']
	['को']

Even though the regular expression, r'^.*(मा|को)$', matches the entire word, the output contains only the suffix part that has been matched. This is because (re) remembers the matched substrings and extract them. To get the entire text containing the listed suffixes, ?: needs to be added before grouped suffixes.

	wordlist = ['रौतहटमा','चलचित्रमा', 'घरको']

	for w in wordlist:
		if re.findall(r'^.*(मा|को)$', w):
			print (re.findall(r'^.*(?:मा|को)$', w))

	OUTPUT: 
	['रौतहटमा']
	['चलचित्रमा']
	['घरको']

Since the actual requirement in stemming is to get the stem, .* in the above code is enclosed in () to do so.

	wordlist = ['रौतहटमा','चलचित्रमा', 'घरको']

	for w in wordlist:
		if re.findall(r'^.*(मा|को)$', w):
			print (re.findall(r'^(.*)(?:मा|को)$', w))

	OUTPUT: 
	['रौतहट']
	['चलचित्र']
	['घर']

Note: .* is used for greedy repetition and matches the maximum number of repetitions. This can be change to .*? to match the smallest number of repetitions.


Tokenization using Regular Expressions


Tokenization is the process of splitting a given text into tokens. A token is a sequence of characters; it can be a sentence, word or any other any other unit of texts that is useful in text analysis.

#splitting text at sentence end punctuation

	text = 'परिश्रम नगरी हुन्छ? परिश्रम सफलताको एक मात्र बाटो हो। जो परिश्रम गर्छ, उही सफल हुन्छ।'
	print (re.split('(?<=[।?!]) +', text))

	OUTPUT:
	['परिश्रम नगरी हुन्छ?', 'परिश्रम सफलताको एक मात्र बाटो हो।', 'जो परिश्रम गर्छ, उही सफल हुन्छ।']

+ in the above pattern means match one or more and <= means retain the punctuation while splitting.

#splitting at word level

	text = 'राम आयो र भन्यो, "दाइ र दिदि, यति दिन कता हुनुहुन्थ्यो?"'
	text = re.sub('\,|\"|\'| \)|\(|\)| \{| \}| \[| \]|!|‘|’|“|”| \:-|\?|।|/|\—', ' ', 
	text)
	print (text.split())

	OUTPUT:
	['राम', 'आयो', 'र', 'भन्यो', 'दाइ', 'र', 'दिदि', 'यति', 'दिन', 'कता', 'हुनुहुन्थ्यो']

References


  • re - Regular expression operations

  • Python Regular Expressions


  • Tags: Text Analysis , Python , Regular Expressions