Explain the steps required before text classification?

690 Asked by NehaTambe in Data Science , Asked on Dec 26, 2019

Before classification of text data, a text needs to be cleaned by following preprocessing steps

Lowercasing

Lowercasing is one of the simplest and most effective form of text preprocessing. It is applicable to most text mining and NLP problems and can help in cases where the dataset is not very large and significantly helps with consistency of expected output. The output is represented below

Stemming

Stemming is basically a crude method for cataloging related words. In other words, it essentially chops off letters from the end until the stem is reached.

This works fairly well in English language in most cases but has exceptions where a more sophisticated process is required.

The two most commonly used stemmers in nltk library are

a) Porter Stemmer

b) Snowball Stemmer

From both the stemmers, Snowball stemmer is developed afterwards by the same developer who developed Porter Stemmer which is relatively faster both in logic and speed compared to Porter Stemmer.

The below flowchart describes how stemming works

Lemmatization

In contrast to stemming, lemmatization looks beyond word reduction, and considers a language’s full vocabulary to apply a morphological analysis to words.

For instance, the lemma of ‘was’ is ‘to be’ and the lemma of ‘mice’ is ‘mouse’. But for words like ‘meeting’ the lemma could be either ‘meet’ or ‘meeting’ as both are logical so in this case it will depend on its use in a sentence.

It looks at surrounding text to determine a given word’s part of speech, it does not categorize phrases.

The below flowchart represents how lemmatization works

Stopwords

Stop words are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. The intuition behind using stop words is that, by removing low information words from text, we can focus on the important words instead.

Stopword removal, while effective in search and topic extraction systems, showed to be non-critical in classification systems. However, it does help reduce the number of features in consideration which helps keep the models decently sized.

Normalization

A highly overlooked preprocessing step is text normalization. Text normalization is the process of transforming a text into a canonical (standard) form. For example, the word “good” and “good” can be transformed into “good”, its canonical form. Another example is mapping of near identical words such as “stopwords”, “stop-words” and “stop words” to just “stopwords”.

Text normalization is important for noisy texts such as social media comments, text messages and comments to blog posts where abbreviations, misspellings and use of out-of-vocabulary words (oov) are prevalent.

Explain the steps required before text classification?

Your Answer