A user is working on Donors choose dataset and have converted categorical, numerical and text features into vectors. He wants to find the top 20 features from my 5095 features using absolute values of the feature_log_prob_ parameter of MultinomialNB. Also, he wants to get their names as well. How to do that?
To do the following we need to follow these steps.
Get the vectorized matrix
print("Final Data-matrix:")
print(X_tr.shape, y_train.shape)
print(X_cr.shape, y_cv.shape)
print(X_te.shape, y_test.shape)
Now we will train the model with MultinomialNB
# Put the optimal alpha value you found
NBModel = MultinomialNB(alpha=1.0, class_prior=[0.5,0.5])
NBModel.fit(X_tr, y_train)
Now we get the features indices sorted by log-probability of features
# Here .argsort() will give indexes of features sorted with their log-probabilities
# For positive class
sorted_prob_class_1_ind = NBModel.feature_log_prob_[1, .argsort()
# For negative class
sorted_prob_class_0_ind = NBModel.feature_log_prob_[0, .argsort()
Now we get the list of all features from concatenating all the previously obtained vectorizers' features.
features_lst = list(vectorizer_essay_tfidf.get_feature_names() + vectorizer_state.get_feature_names() +
vectorizer_prefix.get_feature_names() + vectorizer_grade.get_feature_names() +
["teacher_number_of_previously_posted_projects"] + vectorizer_cat.get_feature_names() +
vectorizer_subcat.get_feature_names() + ["Price"])
Now we get the names from the indices
Most_imp_words_1 = []
Most_imp_words_0 = []
for index in sorted_prob_class_1_ind[-20:-1]:
Most_imp_words_1.append(features_lst[index])
for index in sorted_prob_class_0_ind[-20:-1]:
Most_imp_words_0.append(features_lst[index])
print("20 most imp features for positive class:
")
print(Most_imp_words_1)
print("
" + "-"*100)
print("
20 most imp features for negative class:
")
print(Most_imp_words_0)