by Ingroj Shrestha on Dec. 13, 2017

This blog post is the first post in the series "Clustering Text Documents". In this blog post, we'll mathematically define the TF-IDF algorithm along with an example and its python implementation.

TF-IDF is a popular method used in text mining and information retrieval to evaluate the importance of a word in a text document. Evaluating the importance of word in a document is an important step especially when working with a large textual data because such words give a meaningful insight of what the text document is about.

## TF-IDF: Basics

TF-IDF uses the product of Term Frequency and Inverse Document Frequency to weight the terms/words in a document.

$$tf-idf(t,d) = tf(t,d) * idf(t)$$

Term Frequency, $f_{t,d}$, is the local frequency of a term in the document. It is the number of times a word/term, 't', occurs in document, 'd'. For the weighted Term Frequency matrix tf(d,t), we may use the raw count of the terms $f_{t,d}$ or we may use one of the following weighting schemes.

• Binary: $tf(d,t) = 1$, if t occurs in document d. Else, $tf(d,t) = 0$.

• Log Normalization: $tf(d,t) = log(1 + f_{t,d})$, if t occurs in document d. Else, $tf(d,t) = 0$.

• Dividing raw frequency by total words in the document: $tf(d,t) = \frac{f_{d,t}}{total words in the document}$

• Terms that are frequent in a particular document, are important in identifying what the document is about. However, high frequency of a term alone cannot guarantee that it is more important than other less frequent words. If the frequent term appears in many other documents in high number then it is less important and is not indicative of the overall topic.

For example, यो, त्यो, and हो are some high frequency words in all Nepali text documents. Since they appear in all the documents in high number they are less indicative of the main theme or idea the text talks about. While the Term Frequency(TF) incorrectly assigns higher weights to such words, the ti-idf method provides a global parameter which is the Inverse Document Frequency (IDF) to correct this problem by weighing down the frequent terms and scaling up the rare and important ones.

$$idf_t = log_e(\frac{N}{df_t})$$

Where,

• N is the total number of documents in the collection.

• $df_t$ is the document frequency of term, 't'. It is the number of documents containing the term, 't'.

• ## TF-IDF: Example

The document space we'll be using for the example is given below.

d1 = "त्यो घर रातो छ"
d2 = "यो निलो कलम हो"
d3 = "भाईको घर हो"


After pre-processing (tokenization and stemming) the given documents: d1, d2, and d3, we have the five distinct terms or index vocabulary.

['घर': 0, 'रातो': 1, 'निलो':2, 'कलम': 3, 'भाई': 4]


Now, we'll represent the above documents in the vector space model by using a term-document matrix and term-frequency.

\begin{matrix} ~ & t_0 & t_1 & t_2 & t_3 & t_4 \\ d1 & 1 & 1 & 0 & 0 & 0 \\ d2 & 0 & 0 & 1 & 1 & 0 \\ d3 & 1 & 0 & 0 & 0 & 1 \\ \end{matrix}

We have the term-frequency matrix as:

$$M_{tf} = \begin{bmatrix} 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & 0 & 1 \\ \end{bmatrix}$$

Now, we'll calculate the inverse document frequency (idf) for each term above. We will be using the smoothing enabled idf, so we'll calculate idf by using the following formula.

$$idf = ln(\frac{N + 1}{df + 1}) + 1$$

$$idf(t_0) = ln(\frac{4}{3}) + 1 = 1.28768207$$

$$idf(t_1) = ln(\frac{4}{2}) + 1 = 1.69314718$$

$$idf(t_2) = ln(\frac{4}{2}) + 1 = 1.69314718$$

$$idf(t_3) = ln(\frac{4}{2}) + 1 = 1.69314718$$

$$idf(t_4) = ln(\frac{4}{2}) + 1 = 1.69314718$$

We can represent the above idf values using a vector as:

$$\vec{idf} = (1.28768207, 1.69314718, 1.69314718, 1.69314718, 1.69314718)$$

The idf matrix, $M_{idf}$, is a square diagonal matrix whose dimensions are equal to that of the vector $\vec{idf}$. So, we have the inverse document frequency matrix as:

$$M_{idf} = \begin{bmatrix} 1.28768207 & 0 & 0 & 0 & 0 \\ 0 & 1.69314718 & 0 & 0 & 0 \\ 0 & 0 & 1.69314718 & 0 & 0 \\ 0 & 0 & 0 & 1.69314718 & 0 \\ 0 & 0 & 0 & 0 & 1.69314718 \\ \end{bmatrix}$$

Finally, we get the tf-idf matrix, $M_{tf-idf}$, matrix by multiplying $M_{tf}$ and $M_{idf}$.

$$M_{tf-idf} = M_{tf} * M_{idf}$$

$$= \begin{bmatrix} 1 & 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & 0 & 1 \\ \end{bmatrix} * \begin{bmatrix} 1.28768207 & 0 & 0 & 0 & 0 \\ 0 & 1.69314718 & 0 & 0 & 0 \\ 0 & 0 & 1.69314718 & 0 & 0 \\ 0 & 0 & 0 & 1.69314718 & 0 \\ 0 & 0 & 0 & 0 & 1.69314718 \\ \end{bmatrix}$$

$$= \begin{bmatrix} 1.28768207 & 1.69314718 & 0 & 0 & 0 \\ 0 & 0 & 1.69314718 & 1.69314718 & 0 \\ 1.28768207 & 0 & 0 & 0 & 1.69314718 \\ \end{bmatrix}$$

Using the Euclidean norm (L2-norm) we'll normalize the $M_{tf-idf}$ matrix. So we have,

$$v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}}$$

$$M_{tf-idf} = \frac{M_{tf-idf}}{||M_{tf-idf}||_2}$$

$$= \begin{bmatrix} 0.60534851 & 0.79596054 & 0 & 0 & 0 \\ 0 & 0 & 0.70710678 & 0.70710678 & 0 \\ 0.60534851 & 0 & 0 & 0 & 0.79596054 \\ \end{bmatrix}$$

## TF-IDF: Implementation using Scikit-learn

Scikit-learn provides two different methods to get the tf-idf matrix, $M_{tf-idf}$, for a document collection. We'll use the above document space for our code as well.

The first method we are going to discuss works in two parts. First, a term-frequency matrix is created by the CountVectorizer() class. The class converts a collection of text documents into a matrix of term counts i.e. the number of times the terms occur in each document. Along with term frequency counting, CountVectorizer() also implements stopwords removal and tokenization.

The second part of the method implements the TfidfTransformer() class multiplies the term frequency with idf to generate the tf-ifd matrix. By default, TfidfTransformer() uses Euclidean norm (L2-norm) with smoothing enabled.

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from stem.itrstem import IterativeStemmer
from tokenize.tokenizer import Tokenizer
from rmvstopwords import StopWordRemover

stemmer = IterativeStemmer()

d1 = "त्यो घर रातो छ"
d2 = "यो निलो कलम हो"
d3 = "भाईको घरमा हो"
documents = [d1, d2, d3]

class StemmedCountVectorizer(CountVectorizer):
def build_analyzer(self):
analyzer = super(StemmedCountVectorizer, self).build_analyzer()
return lambda doc: (stemmer.stem(w) for w in analyzer(doc))

print("------------------- Method-I -------------------")
vectorizer = StemmedCountVectorizer(stop_words=StopWordRemover().get_stopwords(),
tokenizer=lambda text: Tokenizer().word_tokenize(text=text), analyzer='word')
tf_matrix = vectorizer.fit_transform(documents)
print("========= Vocabulary =========")
print(vectorizer.vocabulary_)

print(" ========= Term Frequency ===========")
print(tf_matrix.todense())
print()

tfidf = TfidfTransformer(norm="l2")
tfidf.fit(tf_matrix)

print("========= Inverse Document Frequency =========")
print(tfidf.idf_)
print()

tf_idf_matrix = tfidf.transform(tf_matrix)
print("========= TF-IDF Matrix =========")
print(tf_idf_matrix.todense())
print()


OUTPUT:

------------------- Method-I -------------------
========= Vocabulary =========
{'घर': 0, 'रातो': 1, 'निलो':2, 'कलम': 3, 'भाई': 4}
========= Term Frequency ===========
[[1 1 0 0 0]
[0 0 1 1 0]
[1 0 0 0 1]]

========= Inverse Document Frequency =========
[ 1.28768207  1.69314718  1.69314718  1.69314718  1.69314718]

========= TF-IDF Matrix =========
[[ 0.60534851  0.79596054  0.          0.          0.        ]
[ 0.          0.          0.70710678  0.70710678  0.        ]
[ 0.60534851  0.          0.          0.          0.79596054]]


The second method uses TfidfVectorizer() class, which performs term frequency counting and tf-idf weighting in a single step. It combines CountVectorizer() and TfidfTransformer() in a single model.

from sklearn.feature_extraction.text import TfidfVectorizer
from stem.itrstem import IterativeStemmer
from tokenize.tokenizer import Tokenizer
from rmvstopwords import StopWordRemover

d1 = "त्यो घर रातो छ"
d2 = "यो निलो कलम हो"
d3 = "भाईको घर हो"
documents = [d1, d2, d3]

class StemmedTfidfVectorizer(TfidfVectorizer):
def build_analyzer(self):
analyzer = super(StemmedTfidfVectorizer, self).build_analyzer()
return lambda doc: (stemmer.stem(w) for w in analyzer(doc))

print(" ------------------- Method-II -------------------")
print("========= Vocabulary =========")
print(vectorizer.vocabulary_)
tfid_vectorizer = StemmedTfidfVectorizer(stop_words=StopWordRemover().get_stopwords(),
tokenizer=lambda text: Tokenizer().word_tokenize(text=text), analyzer='word')
tf_idf_matrix = tfid_vectorizer.fit_transform(documents)

print(tf_idf_matrix.todense())

OUTPUT:

------------------- Method-II -------------------
========= Vocabulary =========
{'घर': 0, 'रातो': 1, 'निलो':2, 'कलम': 3, 'भाई': 4}

[[ 0.60534851  0.79596054  0.          0.          0.        ]
[ 0.          0.          0.70710678  0.70710678   0.        ]
[ 0.60534851  0.          0.          0.           0.79596054]]



In the next blog post of this series we'll discuss cosine similarity and its implementation in details.

## References

• Feature extraction - Scikit-learn

• TF-IDF - ML Wiki

• Machine Learning :: Text feature extraction (tf-idf)

• Tags: Data Mining , Information Retrieval , Machine Learning