Word embedding is a vector representation of words in a document or sentence. In word word embeddings, words that are similar are placed close to each other in the vector space while those that are not similar are placed far apart. Also words with similar meaning are grouped together such as King-Man, Woman-Queen. Unlike other models such as Bag of Words (BoW) model which does not preserve the grammar of text, in Word embeddings the semantic relationship between words is preserved. Word embeddings have revolutionized the NLP tasks and enhanced the performance of machine learning algorithms. Word embeddings can be used for automatic feature extraction, document classification, reducing dimensionality of data, syntactic parsing among other important use cases. In this post we are going to look at what word embeddings are, there applications, how they are implemented in various word embedding models such as Word2Vec, GloVe and FasText in Gensism and why they are the best choice for various Machine Learning tasks.

Word Embedding

In word embedding we represent words as vectors of real numbers. There are different approaches in achieving the embeddings and include; cosine similarity function, neural network, probabilistic models, skip-gram e.t.c. Word embeddings uses the vector space models to map semantically related words closer to each other. The Vector Space model (VSM) uses the distributional theory which is achieved through the count-based models such as Latent Semantic Analysis (LSA), and predictive models such as natural probabilistic language model. Word embedding model is based on distributional semantic language model hence it preserves the semantic meaning between words.  Now, let’s look at the most commonly used word embeddings models such as Word2Vec, GloVe, FasText and implement them using Gensim;

Working With Gensim

Gensim is an open-source Python library for vector space and topic modeling developed by Radim Řehůřek. It leverages the power of NumPy, SciPy and Cython in its operations. Gensim can work with large text, it also performs data streaming and support efficient incremental algorithms. Gensim comes
with several packages for performing tf-idf, word2vec, latent semantic analysis and latent Dirichlet allocation among others. In this post we are going to see how to implement word2vec, GloVe and FasText using the Gensim.

1. Word2Vec

Word2Vec is an algorithm used to develop word embeddings. Developed by Mikolov in 2013 while at Google, Word2Vec uses the Continuous Bag-of-Words (CBOW) model and the Skip-Gram model in creating the embeddings. The Continuous Bag-of-Words model predicts the target word given the context words while the Skip-Gram model predicts the context words given the target words.
The CBOW model works well and has good representations for words that occurs frequently while the Skip-Gram model is best suited for small data and rare words in the given text. Word2Vec is basically a shallow neural network with two layers that accepts a given corpus of text and outputs the vector numbers of each word. Below diagrams show the CBOW and Skip-Gram models.

Continuous Bag of Words (CBOW)

CBOW-Architecture - word embedding

Skip-Gram Architecture

Skip-Gram-Architecture - word embedding

Training our Word2Vec model

We will be using a data set from Kaggle.com for predicting the incencere question in quora. It is a data set for competition. You can download the train data here after you register and login

Import required libraries and load data

Train our Word2Vec model

Pre-trained embeddings

Computing word vectors for word uk

Find similar words with vectors close to uk word

Find the next word in the sequence (texas – london) + uk = ?

Find words with similar vectors to the word india

Find words with vectors similar to the word “bonj” word which is not in the training vocabulary

Compute the similarity between “uk” and “us”

Save our Pre-trained Word2Vec model

Load our Pre-trained Word2Vec model

Using Wiki-News Word2Vec Pre-trained model

You can download the pre-trained word2vec wiki-news-300d-1M model here.

Load pre-trained wiki-news embeddings

Find similar words with vectors close to the word “india”

Find the next word in the sequence (king – man) + woman = ?

Find a word with vectors similar to the “bonj” word which is not in the training vocabulary

Compute the similarity between woman and man

2. GloVe

GloVe which stands for Global Vectors is a word embedding algorithm developed at Stanford.
GloVe is a log-bilinear model with a weighted least-squares objective. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space Stanford.

Using GloVe for Word Embeddings

You can download the GloVe  glove.840B.300d embeddings here.

Convert the GloVe embedding to Word2Vec embedding in Gensim

Load the Word2Vec version of GloVe model

Find the next word in the sequence (king – man) + woman = ?

Find similar words with vectors close to the word “india “

Compute the similarity between woman and man

3. FastText

FastText is an embedding algorithm that uses sub-words rather that whole words as in Word2Vec model. FastText is an extension of Word2Vec algorithm developed at FaceBook in 2016 and it is available in 157 languages trained on Wikipedia and Crawl. FastText uses the n-gram approach to divide a given word into sub-words that are mapped into vector space. It is efficient in handling the  of out-of-vocabulary (OOV) problem which occurs when certain words are not present in the training set hence most model cannot generate vector space for such words.

Building our Word Embedding with FastText

Load our data

Pre-train our fasttest model with our data set

Find similar words with vectors close to the word “india”

Find a word with vectors similar to the bonj word which is not in the training vocabulary

Compute the similarity between woman and man

Find the next word in the sequence (king – man) + woman = ?

Working with FastText Pre-trained embeddings

Download the fasttext embeddings from here.

Load pre-trained fasttest embeddings for simple english dictionary

Find similar words with vectors close to the word “india”

Compute the similarity between woman and man

Find the next word in the sequence (king – man) + woman = ?

Plotting Word Vectors with PCA

Output

word embedding with pca - word embedding

Conclusion

Natural Language Processing is one of the challenging tasks for computer. Luckily, the advances in machine learning and word embedding has improved the task of predicting the given words and contexts in natural language processing tasks. Word embedding is the a type of machine learning models that represents text data into its equivalent vector space while attempting to preserve the semantic relationship between words. We have looked at the most widely used word embedding algorithms and how to implement them in Gensim library.

What’s Next

In this post we have looked at the word embedding models and how to implement them in Gensim, in the next post we are going to look at Recurrent Neural Networks.

Word Embedding

Post navigation