A user is working on Donors choose dataset and have converted categorical, numerical and text features into vectors. He wants to find the top 20 features from my 5095 features using absolute values of the feature_log_prob_ parameter of MultinomialNB. Also, he wants to get their names as well. How to do that?

1.4K Asked by ReannaHuitt in Data Science , Asked on Dec 18, 2019

To do the following we need to follow these steps.

Get the vectorized matrix

print("Final Data-matrix:")

print(X_tr.shape, y_train.shape)

print(X_cr.shape, y_cv.shape)

print(X_te.shape, y_test.shape)

Now we will train the model with MultinomialNB

# Put the optimal alpha value you found

NBModel = MultinomialNB(alpha=1.0, class_prior=[0.5,0.5])

NBModel.fit(X_tr, y_train)

Now we get the features indices sorted by log-probability of features

# Here .argsort() will give indexes of features sorted with their log-probabilities

# For positive class

sorted_prob_class_1_ind = NBModel.feature_log_prob_[1, .argsort()

# For negative class

sorted_prob_class_0_ind = NBModel.feature_log_prob_[0, .argsort()

Now we get the list of all features from concatenating all the previously obtained vectorizers' features.

features_lst = list(vectorizer_essay_tfidf.get_feature_names() + vectorizer_state.get_feature_names() +

vectorizer_prefix.get_feature_names() + vectorizer_grade.get_feature_names() +

["teacher_number_of_previously_posted_projects"] + vectorizer_cat.get_feature_names() +

vectorizer_subcat.get_feature_names() + ["Price"])

Now we get the names from the indices

Most_imp_words_1 = []

Most_imp_words_0 = []

for index in sorted_prob_class_1_ind[-20:-1]:

Most_imp_words_1.append(features_lst[index])

for index in sorted_prob_class_0_ind[-20:-1]:

Most_imp_words_0.append(features_lst[index])

print("20 most imp features for positive class:
")

print(Most_imp_words_1)

print("
" + "-"*100)

print("
20 most imp features for negative class:
")

print(Most_imp_words_0)

Your Answer