Part-of-Speech tagging is a common sequence-tagging problem in natural language processing. It is the process of assigning a single word class label to each token in the input sentence. For example, for input: इराक सीमाबाट सेना हटाइने।, the output of the tagger is इराक-कुवेत/NN सीमा/NN बाट/II सेना/NN हटाइने/NN ।/YF.
There are various approaches to part-of-speech tagging including rule-based linguistic, stochastic and machine learning based approaches.
In this blog post, we’ll discuss part-of-speech tagging using Hidden Markov Models and Viterbi algorithm.
Parameters of HMMs
Hidden Markov Models are probabilistic sequence models/classifiers. Given a sequence a Hidden Markov Model assigns a class label to each token in the sequence. A HMM has the following parameters:
- A defined start and end state: q0, qF
- A set of states: $Q = q1q2…qN $
- A set of transition probabilities: $A = a_{01} a_{02}…a_{n1}…a_{nm}$. Each $a_{ij}$ represents the probability moving from state i to state j.
- A set of observation likelihoods or emission probabilities: $B = b_i(o_t)$. Each of the probability expresses the probability of an observation $o_t$ being generated from state i.
- An initial probability distribution: π = π1, π2,…, πN. $π_i$ represents the probability that the HMM will start at state i. If a state, say j, cannot be an initial state then $π_j$ = 0.
- A set of legal accepting states: $QA = q_x, q_y,…$
Training a HMM for part-of-speech tagging requires an annotated corpus. Transition probabilities and emission probabilities are computed while training the model. Then, the best sequence of labels is determined in the decoding stage by using the Viterbi algorithm.
POS Tagging with HMMs
Given the observation sequence/sentence of n words $w_1^n$, the goal of the HMM tagger is to find the most probable tag sequence $t_1^n$.
$$\hat{t}_1^n = argmax_{t_1^n} P(t_1^n|w_1^n)$$
The above equation can be simplified using Bayes’ rule.
$$\hat{t}_1^n = argmax_{t_1^n} \frac{P(w_1^n|t_1^n)P(t_1^n)}{P(w_1^n)}$$
$P(w_1^n)$, is dropped from the equation because the value is constant for all tag sequences.
$$\hat{t}_1^n = argmax_{t_1^n} {P(w_1^n|t_1^n)P(t_1^n)}$$
To simplify the calculations further, the HMM taggers make two assumptions. First, the probability of a word appearing is independent of its neighbouring words and their tags.
$$P(w_1^n|t_1^n) \approx \prod_{i=1}^n P(w_i|t_i)$$
Second is the bigram assumption, which assumes that the probability of a tag appearing is dependent only on its previous tag.
$$P(t_1^n) \approx \prod_{i=1}^n P(t_i|t_{i-1})$$
So the simplified equation for a bigram tagger is:
$$\hat{t}_1^n = argmax_{t_1^n} {P(w_1^n|t_1^n)P(t_1^n)} \approx argmax_{t_1^n} \prod_{i=1}^n P(w_i|t_i) P(t_i|t_{i-1}) $$
The probabilities $P(t_i|t_{i-1})$ and $P(w_i|t_i)$ are calculated in the training phase of the tagger by counting on the annotated corpus. The transition probability is calculated using:
$$P(t_i|t_{i-1}) = \frac{C(t_{i-1}, t_i)}{C(t_{i-1})}$$
For example, if the tag NN occurs 10000 times in an annotated corpus and it is followed by plural-collective postposition i.e. IH 5000 times then,
$$P(IH|NN) = \frac{C(NN, IH)}{C(NN)} = \frac{5000}{10000} = 0.5 $$
The emission probability is calculated using:
$$P(w_i|t_i) = \frac{C(t_i, w_i)}{C(t_i)}$$
If NN occurs 2000 times out of 10000 for the word घर then,
$$P(घर|NN) = \frac{C(NN, घर)}{C(NN)} = \frac{2000}{10000} = 0.2$$
The Viterbi Algorithm
There is more than one tag sequence possible for a given sequence of observation. Also the number of tag sequences grows exponentially with the length of the input. So, an efficient algorithm is required to find the most likely tag sequence i.e. the tag sequence with the maximum probability. For this purpose we use a dynamic programming algorithm called the Viterbi algorithm.
Given an observation sequence(words) $O = o_1, o_2,…, o_T$ and HMM $\lambda = (A,B)$, the Viterbi algorithm finds the most probable sequence of tags $q_0, q_1,…, q_N, q_F$. Here, A is N$\times$N matrix where $a_{ij}$ represents the probability of transitioning from $q_i$ to $q_j$ and B is another N$\times$N matrix where $b_i(o_t)$ is the probability that word t is assigned tag i.
The Viterbi algorithm maintains two dynamic programming tables, viterbi[N+2, T] and backpointer[N+2, T]. To compute the best tag sequence the Viterbi algorithm creates table with N state columns, one for each observation. The algorithm then processes the observation sequence from left to right and computes Viterbi value for each cell. After each cell of the first column/observation has been filled, the algorithm computes the Viterbi value for second observation and so on.
For the first word/observation of the input sentence, i.e. for the cells in the first column of the viterbi[] matrix, the previous maximum probability, $v_{t-1}$ is set to 1. The backpointer is set to 0.
$$v_t(j) = max_{i=1}^N a_{0j} b_j(o_1)$$ $$bt_t(j) = 0$$
For the remaining words/observation sequences the algorithm calculates the viterbi value for each cell by using:
$$v_t(j) = max_{i=1}^N v_{t-1}(i) a_{ij} b_j(o_t)$$
Similarly, the backpointer matrix contains maximum previous path probability, which is calculated using:
$$bt_t(j) = argmax_{i=1}^N v_{t-1}(i) a_{ij} b_j(o_t)$$
Where,
- $v_{t-1}(i)$, is the Viterbi path probability of the previous step
- $a_{ij}$, is the transition probability from state i to j
- Given the current state j, $b_j(o_t)$, is the emission probability for symbol $o_t$ at time t
The best path is derived by following the backpointers from $(q_F, T)$ to $q_0$.
References