A user has fitted a CountVectorizer to some documents in scikit-learn. He would like to see all the terms and their corresponding frequency in the text corpus, in order to select stop-words. For example: 'and' 123 times, 'to' 100 times, 'for' 90 times, ... and so on, Is there any built-in function for this?
If cv is the CountVectorizer and X is the vectorized corpus, then the following code must work
zip(cv.get_feature_names(),
np.asarray(X.sum(axis=0)).ravel())
It will return a list of (term frequency) pairs for each distinct term in the corpus that the CountVectorizer extracted.
But it won’t be in ordered format. Another way of doing that is given below
from sklearn.feature_extraction.text import CountVectorizer
texts = ["Hello world", "Python makes a better world"]
vec = CountVectorizer().fit(texts)
bag_of_words = vec.transform(texts)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
sorted(words_freq, key = lambda x: x[1], reverse=True)
The above code will give the following output
[('world', 2), ('python', 1), ('hello', 1), ('better', 1), ('makes', 1)]