A text data requires preprocessing and it contains non-alphanumeric symbols. How to remove them?
To remove non alphanumeric symbols from text data, regex is used. A regex basically used .compile() method to remove all symbols and sub() method to separate all the words with spaces.
Suppose a text contains the following
A B C D ,5 .. AAA55AAA aaa.bbb.ccc
And the output should be
'A' 'B' 'C' 'D' 'AAA' 'AAA' 'aaa' 'BBB' 'ccc'
For implementation, regex is used.
import re
rx = re.compile(r'[^a-zA-Z]')
res = rx.sub(" ", "AAA BB2BB")
print(res)
Here, [^a-zA-Z] will match all non-alphanumeric characters and sub() will replace them with spaces.