TF-IDF which stands for Term Frequency – Inverse Document Frequency is a statistical method of evaluating the significance of a word in given documents. Term frequency (tf) refers to how many times a given term appears in a document. Inverse document frequency measures the weight of the word in the document, i.e if the word is common or rare in the entire document. The tf-idf intuition follows that the terms that appears frequently in a document are less important than terms that are rare. The tf-idf uses the vector space modeling technique for text document representation. TF-IDF is used in document classification, text summarization and recommender systems among other use cases. In this post we are going to look at feature extraction with tf-idf, its application in text classification and how it can be implemented using Python-based libraries.

Feature Extraction with TF-IDF

TF-IDF is a measure that uses two statistical method, the Term Frequency and the Inverse Document Frequency. The term frequency denoted as tf(t,d) is the total number of times a given term t appears in the document d against the total number of all words in the document. If the frequency of a given term increases in a document then its tf also increases. The inverse document frequency represented as idf(t,D) is a measure of how much information the word provides. It measures the weight of a given word in the entire document. IDF show how common or rare a given word is across all documents.
Term frequency–Inverse document frequency is the product of term frequency tf (t,d) and inverse document frequency idf(t,D). TF-IDF can be computed as

tfidf(t,d,D)=tf(t,d).idf(t,D)

An increase in term frequency increases the tf-idf which results in a low idf.

Recommended tf–idf weighting schemes

Below is the list of recommended weighting schemes for tf-idf

Feature Extraction with TF-IDF - image.PNG
Calculating TF-IDF

Given that you have two documents; d1 and d2 with terms and there term count as shown below;

d1={“the”:1,”quick”:1,”brown”:2,”fox”:1,”jumped”:1}
d2={“over”:1,”the”:1,”lazy”:3,”dog”:1}

The total terms in d1 = 5, and d2 = 6.

Suppose you want to calculate the tf-idf for the term “the“, then below is how we achieve this.
term frequency tf,
tf(t,d)=t/d
tf(“the”,d1)=1/5=0.5
tf(“the”,d2)=1/6=0.1667

inverse document frequency idf,
idf(t,D)= log(D/t)
idf(“the”,D)=log(2/2)=0

term frequency – inverse document frequency tf-idf,
tf-idf(t,d1,D) = tf(t,d1) * idf(t,D)
tf-idf(“the”,d1,D)=0.2 * 0 = 0
tf-idf(“the”,d2,D)=0.1667 * 0 = 0

Using the word “the” we get tf-idf as zero, this implies that the word is not important since it occurs in both documents.

Now let’s use the word “lazy

tf(t,d)=t/d
tf(“lazy”,d1)=0/5 = 0
tf(“lazy”,d2)=3/6 = 0.5

idf(t,D)= log(D/t)
idf(“lazy”,D)=log(2/1) = 0.301

tf-idf(t,d1,D) = tf(t,d1) * idf(t,D)
tf-idf(“lazy”,d1,D)=0 * 0.301 = 0
tf-idf(“lazy”,d2,D)=0.5 * 0.301 = 0.1505

TF-IDF in scikit-learn

Using TfidfVectorizer method

Using TfidfVectorizer and CountVectorizer

Conclusion

Term frequency – inverse document frequency is an important technique for text processing used for analyzing the most important words in a given document. Words that tend to appear frequently such as “the, and, at, is, to” are deemed to be of less importance while rare words are considered to be more meaningful. Tf-idf is commonly used in document classification, text summarization among other text processing and analysis tasks.

What’s Next

In this post we have looked at Term frequency – inverse document frequency, in the next post we will look at the n-gram model.

Feature Extraction with TF-IDF

Post navigation