A user has fitted a CountVectorizer to some documents in scikit-learn. He would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example: 'and' 123 times, 'to' 100 times, 'for' 90 times, ... and so on, Is there any built-in function for this?

988    Asked by GayatriJaiteley in Data Science , Asked on Dec 11, 2019
Answered by Gayatri Jaiteley

If cv is the CountVectorizer and X is the vectorized corpus, then the following code must work



It will return a list of (term frequency) pairs for each distinct term in the corpus that the CountVectorizer extracted.

But it won’t be in ordered format. Another way of doing that is given below

from sklearn.feature_extraction.text import CountVectorizer

texts = ["Hello world", "Python makes a better world"]

vec = CountVectorizer().fit(texts)

bag_of_words = vec.transform(texts)

sum_words = bag_of_words.sum(axis=0)

words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]

sorted(words_freq, key = lambda x: x[1], reverse=True)

The above code will give the following output

[('world', 2), ('python', 1), ('hello', 1), ('better', 1), ('makes', 1)]

