N-gram is a combination of n consecutive items in a text or speech. The n refers to the number of combination of items which can be words, characters, DNA codes, phonemes or any sequence of data. The n-gram model is efficient in capturing the language structure using statistical technique and can be used to predict the most likely word or character that follows the current word or character. The size of the n-gram can have different naming conventions such as; unigram or one-gram, monomer for a one gram model. The wider usage of n-gram models in natural language processing is due to its simplicity and scalability of combining words or character to different order. N-gram models have wider application in computational linguistics for natural language processing tasks, computational biology for DNA sequence analysis and information theory. In this post we are going to look at n-grams what they are, there application and how we can implement them in Python.

Introduction to N-grams

The n-gram is a group of words or characters that follows one another in a given text. An n-gram of one word or character is called unigram, two is bigram, three is trigram and so forth.  Also numbers can be used to refer the group of words or characters such as 1-gram, 2-gram, 3-gram, 4-gram e.t.c.
Given the sentence “The quick brown fox jumped over the lazy dog“, we can extract the n-grams as shown below.

1-gram : “The”,”quick”,”brown”,”fox”,”jumped”,”over”,”the”,”lazy”,”dog”
2-gram : “The quick”,”quick brown”,”brown fox”,”fox jumped”,”jumped over”,”over the”,”the lazy”,”lazy dog”
3-gram : “The quick brown”,”quick brown fox”,”brown fox jumped”,”fox jumped over”,”jumped over the”,”over the lazy”,”the lazy dog”
4-gram : “The quick brown fox”,”quick brown fox jumped”,”brown fox jumped over”,”fox jumped over the”,”jumped over the lazy”,”over the lazy dog”

Implementing N-gram in Python

There are many ways and tools for implementing n-grams in Python, however, in this post we will look at using the Natural Language Toolkit (NLTK) and TextBlob libraries.

N-gram with NLTK

Character n-gram (tri-gram) with NLTK

Words n-gram (tri-gram) with NLTK

N-gram with TextBlob

Words n-gram (tri-gram) with TextBlob

Application of N-gram

Here are some of the applications of the N-grams

  • Natural language processing
  • Communication theory
  • Information retrieval

Advantages of N-grams

  •  N-grams are simple to implement
  • They are scalable
  • N-gram can be used to model the probability of out-of-vocabulary words
  • Forms the basis of other complex NLP techniques

Limitation of N-grams

  •  Needs proper choice of the n
  • Less superior than other sophisticated machine learning models

Conclusion

N-grams are simple techniques used in data mining and natural language processing. They are group of consecutive occurring items such as words, characters e.t.c in a specified window. N-grams can be 1-gram (unigram), 2-gram (bi-gram), 3-gram (tri-gram), e.t.c. N-gram is a probabilistic language model for predicting the next item in a given set of items  wikipedia.

What’s Next

In this post we have looked at N-grams, in the next post we will look at Bag of Words (BoW) model.

Introduction to N-grams

Post navigation