One of the challenges faced by statistical part-of-speech taggers is the presence of words in test datasets that do not exist in the training dataset. Such words are called unknown words. In this blog post we’ll look into different ways to tag unknown words.
A statistical part-of-speech tagger uses annotated training dataset to perform tagging. The tagger obtains statistical information from the training corpus while tagging words in test data. However, the tagger gets no statistical information for the unknown words.
Emission Probability
Emission probability is the probability that an emitted word is given a particular part-of-speech tag. The
$$b_i(o_t) = P(o_t|q_i) = \frac{C(o_t,q_i)}{C(q_i)}$$
Where,
- $b_i(o_t)$, is the probability of an observation $o_t$ being generated from a state $q_i$
- $C(o,q_i)$, is the number of times word ‘o’ occurs with the part-of-speech tag $q_i$
- $C(q_i)$, is the number of times the part-of-speech tag occurs in the training dataset
Let $x_1….x_n$, be a sentence that contains an unknown word. Since, an unknown word does not occur in the training dataset, the value of $C(o_t,q_i)$ is zero for the word, i.e.,
$$C(o_t,q_i) = 0$$
Therefore, the emission probability for the word is also zero, $b_i(o_t) = 0$
So, the probability for all possible part-of-speech tag sequences for an observation sequence containing an unknown word becomes 0 and then the tagger cannot choose between the sequences.
Tagging Unknown Words
Unknown words cause a serious problem in part-of-speech tagging, which is especially true in case of low resource language like Nepali. There are number of ways to handle unknown words. Some of them are discussed below.
Laplace/Active Smoothing
It is a popular technique used to smooth categorical data. For a set of vocabulary(all tags), V, observation(word in a sentence), $o_t$, and state(tag), $q_i$, the Laplace smoothing technique gives:
$$P(o_t|q_i) = \frac{C(o_t,q_i) + 1}{C(q_i) + C(V) + 1}$$
The idea is to add a small probability of $\frac{1}{C(q_i) + C(V)}$ so that the possible tag sequences for an observation sequence with one or more unknown words will have non-zero probability.
Most Frequent Part-of-Speech Tag
Assuming that the unknown word always takes the most frequent part-of-speech tag is one way of handling them. Therefore, for all unknown words $P(o_t|q_i) = 1$ for the most frequent tag and $P(o_t|q_i) = 0$ for the remaining tags. The most frequent part-of-speech tag from an annotated corpus can be obtained using FreqDist( ).
from probability import FreqDist train_data = [('घटना', 'NN'), ('हरु', 'IH'), ('नेपाल', 'NN'), ('मा', 'IE')] tags = [] for d in data: tags.append(d[1]) fdist = FreqDist(tags) for word, frequency in fdist.most_common(1): print('{}{}'.format(word, frequency))
OUTPUT: NN 2
FreqDist( ) returns a dictionary with part-of-speech tag as keys and the number of word with that tag as values.
Overall Part-of-Speech Distribution
The overall part-of-speech distribution from known words can used to estimate and assign part-of-speech tags for unknown words. Say if the probability of a known word having a tag $q_i$ is ‘x’ the probability of an unknown word having tag $q_i$ is also ‘x’.
Also, we can consider the part-of-speech distribution to open class words only. This is because, it is usually possible to exhaustively list all the members in the closed class words. Therefore, the chances that an unknown word belonging to a closed class is very slim. So, we can consider the part-of-speech distribution from open class words only.
Morphological Features
Words in Nepali are typically formed by affixation, compounding, repetition of word or a part of word and phonetic similarity. These morphological processes are useful in guessing part-of-speech tags of the unknown words.
Affixation is very common in Nepali. Prefixes and suffixes do a lot more than forming new words. They also carry information about the categories the words belong to. Thus, affixes rules and their probability distribution can be used to predict part-of-speech tag of the unknown words.
Compounding is a very productive process in in Nepali verbs. Multiple numbers of combinations are possible, which is why multiple stem morphemes are present in compound words. Nepali verb compounds are tagged according to the last verb in the compound.
The existing morphological rules for part-of-speech taggers in Nepali does not focus on words formed as a result of repetition of word or a part of word and phonetic similarity, which leaves a space for further research in Nepali linguistics.
References