Demystify TF-IDF in Indexing and Ranking

Ted Mei
4 min readDec 14, 2019

--

images from: http://filotechnologia.blogspot.com/2014/01/a-simple-java-class-for-tfidf-scoring.html

TF-IDF (term frequency-inverse document frequency) can be thought of as a numerical metric that reflects how important a word is in a collection of corpus. Words that are frequent in a document but not across documents tend to have high TF-IDF score.

Mathematically:

Notice that if we compute the TF-IDF for every word in every document, it can form a matrix with shape number_of_words * number_of_documents. So TF-IDF is a single value (or score, or weight) for 1 word, but a bunch of values forming a matrix when we consider all the documents.

Next let’s go through a simple example to see how TF-IDF can be used in indexing and query-document ranking. Python implementation is also provided in the end.

Say we have 3 documents

Document 1: Machine learning teaches machine how to learn

Document 2: Machine translation is my favorite subject

Document 3: Term frequency and inverse document frequency is important

First we compute f_{ij} which is the frequency of term i in document j as shown in Equation (1) above by purely counting. So we have:

f_{i1}, each term i in document 1:

f_{i2}, each term i in document 2:

f_{i3}, each term i in document 3:

Next we compute normalized term frequency (TF_{ij}) using Equation (1) above. Basically, each word in one document should divide the total number of words in the document. So we have:

Next we compute IDF of each term i based on Equation (2) above:

Notice that TF has 2 dimensions, word i and document j. But IDF only has 1 dimension which is word i.

Next based on Equation (3), we can compute TF-IDF for each word i in document j:

Table1: TF-IDF value of each word in every document

This can be stored as TF-IDF index. Basically for each document we store the TF-IDF value of each term. In practice, each document also needs to go through tokenization (split sentence into words), filter out stop word such as “the”, “is” and then stemming (for instance in our case “learning” and “learn” should be converted to the same word “learn” after stemming).

Also as a side note, the TFIDF implementation in sklearn generates TF-IDF matrix with shape n * vocabulary_size. If a word in the vocabulary does not exist in a document, the corresponding element will be 0. So the matrix it generates will be a very sparse matrix. A code snippet to use the library is as follows:

from sklearn.feature_extraction.text import TfidfVectorizer
# max_features specifies the vocabulary size
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(data['post'])
xtrain_tfidf = tfidf_vect.transform(X_train)
xvalid_tfidf = tfidf_vect.transform(X_test)
# type(xtrain_tfidf) gives scipy.sparse.csr.csr_matrix
# you can view the content of TFIDF matrix by doing: xtrain_tfidf.todense()
# you can view the vocabulary by doing tfidf_vect.get_feature_names()

Below is the TF-IDF implementation in python:

That’s all about TF-IDF in indexing. Let’s now look at when given a query how to rank the document using TF-IDF.

Given a query, say “machine learning”, we can compute the TF-IDF of the query as follows:

So the TF-IDF value ([0.7, 1.05]) can be treated as a feature representation of query “machine learning”.

Similarly we get the TF-IDF value of each query word in every document as shown in Table 1 above:

So vector [0.4, 0.3] can be a feature representation of document 1 given this query; [0.23, 0] is the feature representation of document 2 and [0.0, 0.0] is the feature representation of document 3 given this query.

In order to rank documents, we can compute the query-document similarity using cosine similarity:

As can be seen, query-document3 similarity is undefined. In practice, we can just ignore this data point. Or we can use other similarity measuring function such as Euclidean distance rather than squared Euclidean distance in cosine similarity.

In the end, we can see that document 1 is more similar to our query compared to document 2. This can also be seen using purely human evaluation.

The code implementation of query document similarity is:

That’s all about the usage of TF-IDF in indexing and query-document ranking.

--

--