Large amount of data being generated is made up text. Unlike numeric data which requires minimal processing, text data requires more efforts and use of complex tools to examine, organize and analyse it.
Text data is often unstructured, has many errors such as grammatical errors and it comes in different languages. There are various ways of processing text data including use of regular expressions, manually processing and use of text analysis tools such as Natural Language ToolKit (NLTK), TextBlob, SpaCy and many others.
Most of the text processing tools use Natural Language Processing (NLP) techniques. In this post we are going to look at text processing with NLTK.

Text Processing With NLTK

In text processing we are interested in the syntactical manipulation of text data and not on the semantic representation of such text. NLTK is a widely used open source NLP platform for developing natural language products. NLTK has an easy-to-use interface with many features for natural language processing tasks including; empirical linguistics, cognitive science, artificial intelligence and machine learning.
Before we use NLTK we need to install it and relevant packages. NLTK is a cross-platform platform and it can be installed using the pip command

If you are using Anaconda distribution you can use conda command from the Anaconda Prompt i.e.

These command will install NLTK, however, you need to install other libraries and corpus for you to use NLTK.

Open your favourite machine learning editor such as JuPyter Notebook, SPyder etc. and run the following snippet.

A window will pop up and prompt you to install other packages, click the package you want to install or for simplicity install all libraries that comes with the platform by clicking the Download button. Now that you have installed NLTK with all packages let’s do data processing tasks.

NLTK comes with many corpus. To access some of the corpus simply type the code below for Inbuilt brown corpus

1.0 Tokenization

Tokenization is a process of splitting up a text into sentences or words. Let’s see how tokenization is achieved in NLTK.

Sentence Tokenization

Word Tokenization

Tokenizing non-english text

2.0 Stemming

Stemming is the process of removing affixes from words. It strives to shorten words into their root words. Example is a word such running can be stemmed to run, sleeping can be reduced to sleep, etc.

Stemming English words

Check the languages supported by nltk 

Stemming Non-English words 

3.0 Lemmatization

Lemmatizing is similar to stemming, however, the result of lemmatizing is a real word. For expamle Stemming the word “increases” gives increas instead of increase. With lemmatization we get a grammatical correct word. To lemmatize words ending with “ing” we need to provide a Part-Of-Speech (POS) argument in the lemmatize function. We can pass the following POS verb, noun, adjective, or adverb in the argument.

Lemmatizing words

Lemmatizing Verb words

Stemming has lower accuracy because it does not need to know the context of words. However, stemming is faster as compared to lemmatization. The option to use Stemming or Lemmatization will depend on the level of accuracy or speed.

4.0 Word synonyms

synonym is a word or phrase that means exactly or nearly the same as another word or phrase in the same language. We use synsets function from wordnet function to get words that are similar.

synsets function

getting the synonyms of a given word

5.0 Word Antonyms

Antonyms is a word opposite in meaning to another (e.g. bad and good ).

6.0 Stop Words

Stop words are most common words in a language e.g “the”,”and” etc.and they are considered irrelevant in some NLP tasks such search engine indexing.

List of stopwords from the nltk library

List of stop words in a text

Removing Stop Words

7.0 Part-Of-Speech (POS) Tagging

The parts of speech explain how a word is used in a sentence (http://www.grammar.cl/english/parts-of-speech.htm). We have 8 major part-of-speech words nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions and interjections. These POS can be further divided into subclasses such as nouns can be divided into proper nouns, common nouns, concrete nouns etc.

POS tagging is very useful for extracting both the syntactic and semantic representation of words.

8. Named Entity Recognition

Named entity recognition (NER), is the process of identifying entities such as Names, Locations, Dates, or Organizations from a given text. NER is useful for information retrieval, information classification, chatbots and recommendation systems

Conclusion

NLTK simplifies text processing tasks to a greater extend with easy-to-use libraries. Cleaning text data is one of the preliminary steps in text analytics and it affects the quality of end results. In text processing we use NLP techniques which improves the quality of analysis. We have focused on NLTK platform in this post, however, there are other tools and techniques of preprocessing text data such as use of regular expressions, manually processing data and other machine learning tools that we will cover later.

What’s Next

In this post we have looked at text processing with NLTK, in the next post we are going to look at text processing with TextBlob.

Text Processing With NLTK

Post navigation