Bag of Words (BoW) is a feature extraction method that is used to represent text features in machine learning. Bag of Words is a simple and easy to implement model that is widely used in natural language processing tasks. In the Bag of Words algorithm each given document is considered as a (Bag) containing words. However,¬† the BoW approach does not preserve the semantic relationship between words but only focuses on the frequency of words in the document or sentence. The BoW model has been widely used in tasks such as text classification and computer vision. In this post we are going to look at the Bag of Word model, how it works, it’s implementation in Python, its strength and weaknesses.

Bag of Words Model

When working with text data in machine learning, we need to represent the text data into a format that machine learning algorithms can easily understand which is usually vectors. Bag of Words model is the simplest technique for feature extraction for text data. Bag of Words is unordered collection of words in a given document or sentence. The BoW approach does not factor the word order or the grammatical meaning of the words in the document.

Given the following sentences d1,d2 and d3 also referred to as documents, we can represent the Bag of Words model as follows;

d1=I love AI
d2=I love Data
d3=I love Computer

We make a list of all the unique words from all documents/sentences as follows;

“I”,”love”,”AI”,”Data”,”Computer”

Then we convert each word into a vector representation such that;

d1=I love AI : [1,1,1,0,0]
d2=I love Data : [1,1,0,1,0]
d3=I love Computer : [1,1,0,0,1]

Note that the BoW model does not factor in the word order.

Continuous Bag of Words (CBOW) and Skip-gram Models

CBOW is an algorithm based on the Bag of Word model that predicts a target word from the context words. The Skip-gram model predicts a set of context words given a target word. The Continuous Bag of Words and Skip-gram model will be covered in the Word Embedding model in the next post.

Implementing Bag of Words Model in Python

BoW using collections

BoW using sklearn

Advantages of Bag-of-Words Model

  • Simple and easy to implement
  • It is effective in tasks such as document classification

Limitations of Bag of Words

  • BoW model does not preserve the semantic relationship between words hence loss of meaning.
  • Difficult to model highly sparse data.

Conclusion

Bag of Word model is a simple yet powerful feature extraction algorithm. BoW model has had great success in natural language processing, document classification among other tasks.BoW model can be modeled to advanced models such as Continuous Bag of Words and Skip-gram models.

What’s Next

In this post we have looked at the Bag of Word model, in the next post we will look at the Word Embedding model.

Bag of Words Model

Post navigation