Processing Unicode(Devnagari) in Python

Source: Devanagari (Unicode block)

 

 

Unicode is a standard for representing characters in different languages using four digit hexadecimal number called code points. Each character is associated with a unique code point. In python, these code points are represented as \uXXXX, where \u indicates Unicode and XXXX is the four digit hexadecimal number.

 

Nepali texts are written in Devnagari script. Unicode code points for characters used in Devnagari script ranges from \u0900 to \u097F.

 

Find Unicode table for Devnagari texts at: Unicode/UTF-8

 

Nepali texts are encoded using utf-8 encoding. It is one of the widely used character encodings. In utf-8 encoding an 8-bit block is used to represent a character.

 

Handling UTF-8 Encoded Texts in Python

 

Default encoding for Python 2.x is ASCII. So, when working with non-ASCII texts, include the encoding type in the source header, in this case utf-8.

	# -*- coding: utf-8 -*-

 

If the header is not included in the source then you will get a SyntaxError:

	SyntaxError: Non-ASCII character '\xe0' in file example.py on line 3, 
	but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

 

You don’t need the declaration in Python 3 source because the default encoding for Python 3.x is utf-8.

 

String Representations in Python

 

Python 2.x supports two different types of strings, str, which is a 8-bit string and unicode, which is used for unicode strings.

 

In Python 2.x unicode strings are prefixed with u and byte strings are written as normal strings.

	#python 2.x
	s = u'unicode string in Python 2'

	b = 'byte string in Python 2'

 

Unicode character can also be represented in a string using escape sequences.

	a = u'\u0905'
	print (a)

	OUTPUT:
	अ

 

The python string supported by Python 3.x, str holds Unicode data and two byte types: bytes and bytearray. In Python 3.x Unicode strings are written as normal strings and byte strings are prefixed with b.

	#python 3.x
	s = 'unicode string in Python 3'

	b = b'byte string in Python 3'	

 

Encoding and Decoding

 

When characters are stored, they are stored as bytes, not characters. This is where the concept of encoding fits in. The process of mapping characters into bytes is called encoding and decoding is the process of mapping the bytes back into characters.

 

str() function is equivalent to byte string. Passing a string in encoding other than ASCII(default encoding in 2.x) gives an UnicodeEncodeError. This is because ASCII encoding represents characters in the range 0-127 only.

	str(u'हरुले')

 

	UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-4: 
	ordinal not in range(128)

 

The Unicode string must be encoded in utf-8 by using the encode() function before passing it to the str() function. In this stage the Unicode text is converted to bytes.

	utfstring = u'हरुले'
	bytestring = utfstring.encode("utf8")

 

unicode() function takes 8-bit string which is converted to Unicode using the encoding that has been specified. If the encoding is not specified the default ASCII encoding is used. UnicodeDecodeError occurs when the characters in the string is above the ASCII range(128).

	utfstring = u'हरुले'
	bytestring = utfstring.encode("utf8")
	unicode(bytestring)

 

	UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: 
	ordinal not in range(128)

 

UnicodeDecodeError can be fixed using the decode(). Here, byte strings are converted to Unicode code points.

	utfstring = u'हरुले'
	bytestring = utfstring.encode("utf8")
	unicode(bytestring.decode("utf8"))	

 

The default encoding for Python 3.x is utf-8, so all string in Python 3.x source code are Unicode.

 

Inspect Unicode Properties

 

Properties of Unicode characters can be inspected using the python unicodedata module. The module includes Unicode Character Database, which contains information about Unicode code points.

	#-*- coding: utf-8 -*-
	import unicodedata

 

To find Unicode character code point in decimal use ord() function.

	print ord(u"क")

	OUTPUT:
	2325

 

To find Unicode character’s name use name() function.

	print unicodedata.name(u"क")

	OUTPUT:
	DEVANAGARI LETTER KA

 

Unicode in re

 

The re module provides regular expression support for both 8-bit and Unicode strings. Python 3.x matches Unicode strings by default. Nepali text processing using re is explained in details in Applications of Regular Expression in Text Analysis. Python 2.x has different approach to using regular expressions with Unicode strings.

 

The regex string in Python 2.x is changed into Unicode-escape string by prefixing the string with u.

	string = u'नपाली कंग्रेसका'

 

To use search and match Unicode strings, ur is prefixed with regular expression pattern so that it becomes a raw Unicode string.

 

Unicode strings can be processed using re module.

	import re

 

One common task of string processing is replacing character in string.

	string = u'नपाली कंग्रेसका'
	replaced_string = re.sub(ur'[\u0928-\u0929]+', u'ने', string)
	print replaced_string

	OUTPUT:
	नेपाली कंग्रेसका

 

In the above example, न, which lies within the Unicode range u0928-u0929 is replaced with ने.

 

Finding patterns in strings is another common task in string processing. re.compile() function can be used compile a regular expression pattern into a regular expression object. findall() function can be used to find the pattern in a Unicode string.

	regex = re.compile(ur'[^\u092E-\u092E]+')
	result = regex.findall(u'अमला')
	print result

	OUTPUT:
	[u'\u0905', u'\u0932\u093e'] //\u0905=अ and \u0932\u093e = ला'

 

match() function can be used to search raw Unicode string at the beginning of the search string. search() function can be used to search raw Unicode string anywhere in the search string.

	wordlist = [u'चार', u'पाँच', u'छ']
	print ([w for w in wordlist if re.match(ur'\u091A',w])

	OUTPUT:
	[u'\u091a\u093e\u0930'] // \u091a\u093e\u0930 = चार

	wordlist = [u'चार', u'पाँच', u'छ', u'एक']
	print ([w for w in wordlist if re.search(ur'\u093e\u0901',w)])
	//\u093e\u0901= ाँ

	OUTPUT:
	[u'\u092a\u093e\u0901\u091a'] // \u092a\u093e\u0901\u091a = पाँच

 

The regular expression character classes, such as \w, \W, \b, \B, \d, \D, \s and \S, can also be used in Unicode processing. To use these character classes in Python 2.x you need to set flag to re.U/re.Unicode.

	text= re.compile('[\W]+',re.UNICODE)
	print text.findall(u'नपाली')

	OUTPUT:
	[u'\u093e', u'\u0940'] // \u093e = ा and \u0940 = ी

 

\W character class matches non-alphanumeric characters. However, in the above example, \W class considers vowel modifiers as non-alphanumeric characters which is not correct for Nepali texts, so test the predefined character classes before using them.

 

Reading and Writing Unicode

 

Reading and writing unicode data involves reading from or writing into a file with a particular encoding. We can read from or write into an encoded file using the codecs(encoders and decoders) module in Python 2.x.

 

The codecs module provides functions for encoding and decoding with any text encodings. The text file with a particular encoding can be opened in read, write or both mode using codecs.open() function. The default mode opens file in read mode ‘r’.

	import codecs

	#default mode
	f= codecs.open('test.txt', encoding='utf-8')

 

codecs.open() function returns Unicode text, so the text returned by object f must be encoded using suitable encoding type.

	for line in f:
		line = line.strip()
		print (line.encode("utf8"))

 

To print the read text in \uXXXX representations encode it using the python specific encoding called unicode_escape, which converts all the non-ascii characters in their respective \uXXXX forms. Code points above ASCII range(0-127) and below 256 are represented in two digit forms as \xXX.

	INPUT:
	तराई-मधेसको

	OUTPUT:
	\u0924\u0930\u093e\u0908\u2013\u092e\u0927\u0947\u0938\u0915\u094b   //python 2

	b'\\u0924\\u0930\\u093e\\u0908\\u2013\\u092e\\u0927\\u0947\\u0938\\u0915
	\\u094b' //python 3

 

write() function can be used to write the file.

	f = codecs.open('test', encoding='utf-8', mode='w+')
	f.write(u'\u092a\u0930\u093f रुपायन')

 

The codecs module still works in Python 3.x but it is no longer needed because Python 3.x, comes with a built-in open() function to work with encoded files.

	with open('test.txt', encoding='utf-8') as f:
    	for line in f:
        	print(repr(line))

 

write() function can be used to write the file.

	with open('test', encoding='utf-8', mode='w+') as f:
    	f.write('\u092a\u0930\u093f रुपायन')

 

References

 

Leave a Reply