A text data requires preprocessing and it contains non-alphanumeric symbols. How to remove them?

1.1K Asked by SnehaPandey in Data Science , Asked on Nov 12, 2019

To remove non alphanumeric symbols from text data, regex is used. A regex basically used .compile() method to remove all symbols and sub() method to separate all the words with spaces.

Suppose a text contains the following

A B C D ,5 .. AAA55AAA aaa.bbb.ccc

And the output should be

'A' 'B' 'C' 'D' 'AAA' 'AAA' 'aaa' 'BBB' 'ccc'

For implementation, regex is used.

import re

rx = re.compile(r'[^a-zA-Z]')

res = rx.sub(" ", "AAA BB2BB")

print(res)

Here, [^a-zA-Z] will match all non-alphanumeric characters and sub() will replace them with spaces.

A text data requires preprocessing and it contains non-alphanumeric symbols. How to remove them?

Your Answer