TF-IDF which stands for Term Frequency – Inverse Document Frequency is a statistical method of evaluating the significance of a word in given documents. Term frequency (tf) refers to how many times a given term appears in a document. Inverse document frequency measures the weight of the word in the document, i.e if the word is common or rare in the entire document. The tf-idf intuition follows that the terms that appears frequently in a document are less important than terms that are rare. The tf-idf uses the vector space modeling technique for text document representation. TF-IDF is used in document classification, text summarization and recommender systems among other use cases. In this post we are going to look at feature extraction with tf-idf, its application in text classification and how it can be implemented using Python-based libraries.

**Feature Extraction with TF-IDF**

TF-IDF is a measure that uses two statistical method, the Term Frequency and the Inverse Document Frequency. The term frequency denoted as ** tf(t,d)** is the total number of times a given term t appears in the document d against the total number of all words in the document. If the frequency of a given term increases in a document then its tf also increases. The inverse document frequency represented as

**is a measure of how much information the word provides. It measures the weight of a given word in the entire document. IDF show how common or rare a given word is across all documents.**

*idf(t,D)*Term frequency–Inverse document frequency is the product of term frequency

**and inverse document frequency**

*tf (t,d)***. TF-IDF can be computed as**

*idf(t,D)**tfidf(t,d,D)=tf(t,d).idf(t,D)*

An increase in term frequency increases the tf-idf which results in a low idf.

**Recommended tf–idf weighting schemes**

Below is the list of recommended weighting schemes for tf-idf

**Calculating TF-IDF**

Given that you have two documents; d1 and d2 with terms and there term count as shown below;

d1=*{“the”:1,”quick”:1,”brown”:2,”fox”:1,”jumped”:1}*

d2=*{“over”:1,”the”:1,”lazy”:3,”dog”:1}*

The total terms in d*1 = 5*, and *d2 = 6*.

Suppose you want to calculate the tf-idf for the term “** the**“, then below is how we achieve this.

term frequency tf,

*tf(t,d)=t/d*

*tf(“the”,d1)=1/5=0.5*

*tf(“the”,d2)=1/6=0.1667*

inverse document frequency idf,

*idf(t,D)= log(D/t)*

*idf(“the”,D)=log(2/2)=0*

term frequency – inverse document frequency tf-idf,

*tf-idf(t,d1,D) = tf(t,d1) * idf(t,D) *

*tf-idf(“the”,d1,D)=0.2 * 0 = 0*

*tf-idf(“the”,d2,D)=0.1667 * 0 = 0*

Using the word “*the*” we get tf-idf as zero, this implies that the word is not important since it occurs in both documents.

Now let’s use the word “** lazy**”

*tf(t,d)=t/d*

*tf(“lazy”,d1)=0/5 = 0*

*tf(“lazy”,d2)=3/6 = 0.5*

i*df(t,D)= log(D/t)*

*idf(“lazy”,D)=log(2/1) = 0.301*

*tf-idf(t,d1,D) = tf(t,d1) * idf(t,D) *

*tf-idf(“lazy”,d1,D)=0 * 0.301 = 0*

*tf-idf(“lazy”,d2,D)=0.5 * 0.301 = 0.1505*

**TF-IDF in scikit-learn**

**Using TfidfVectorizer method**

1 2 3 4 5 6 7 |
from sklearn.feature_extraction.text import TfidfVectorizer document="The quick brown fox jumped over the lazy dog" vectorizer=TfidfVectorizer() results=vectorizer.fit_transform([document]) print(results) |

1 2 3 4 5 6 7 8 9 10 |
Output (0, 7) 0.6030226891555273 (0, 6) 0.30151134457776363 (0, 0) 0.30151134457776363 (0, 2) 0.30151134457776363 (0, 3) 0.30151134457776363 (0, 5) 0.30151134457776363 (0, 4) 0.30151134457776363 (0, 1) 0.30151134457776363 |

**Using TfidfVectorizer and CountVectorizer**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer from sklearn.preprocessing import normalize import pandas as pd import numpy as np documet1=["The quick brown fox jumped"] document2=["over the lazy dog"] doc = documet1+document2 vectorizer = CountVectorizer().fit_transform(doc) norm_count = normalize(vectorizer, norm='l1', axis=1) tfidf = TfidfVectorizer() tf = tfidf.fit_transform(text) tf = norm_count.multiply(tfidf.idf_) feature_names = tfidf.get_feature_names() doc_index = [i for i in text] df = pd.DataFrame(tf.T.todense(), index=feature_names, columns=doc_index) print("Text : \n",doc) print("\nTerm frequency : \n",tf) print("\nInverse document frequency : \n",tfidf.idf_) print("\nTerm frequency - inverse document frequency : \n",df) |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
Output Text : ['The quick brown fox jumped', 'over the lazy dog'] Term frequency : (0, 0) 0.2810930216216329 (0, 2) 0.2810930216216329 (0, 3) 0.2810930216216329 (0, 6) 0.2810930216216329 (0, 7) 0.2 (1, 1) 0.3513662770270411 (1, 4) 0.3513662770270411 (1, 5) 0.3513662770270411 (1, 7) 0.25 Inverse document frequency : [1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1. ] Term frequency - inverse document frequency : The quick brown fox jumped over the lazy dog brown 0.281093 0.000000 dog 0.000000 0.351366 fox 0.281093 0.000000 jumped 0.281093 0.000000 lazy 0.000000 0.351366 over 0.000000 0.351366 quick 0.281093 0.000000 the 0.200000 0.250000 |

**Conclusion**

Term frequency – inverse document frequency is an important technique for text processing used for analyzing the most important words in a given document. Words that tend to appear frequently such as “*the, and, at, is, to*” are deemed to be of less importance while rare words are considered to be more meaningful. Tf-idf is commonly used in document classification, text summarization among other text processing and analysis tasks.

**What’s Next**

In this post we have looked at Term frequency – inverse document frequency, in the next post we will look at the n-gram model.