In R,a user wants to use a naive Bayes classifier to make some predictions. The real data is about 100 features big. So what would be the best way to select the most important features for naive Bayes classification? Below is the code

747 Asked by NatashaKadam in Data Science , Asked on Dec 20, 2019

Answered by Natasha Kadam

## Test KNN Classification

train = dtm_control_tfidf_treino # train set from 1:7

test = dtm_control_tfidf_teste # test set from 8:10

cl = factor(dtm_control_tfidf_treino$class[1:7])

x = knn(train, test, cl, k = 3, prob = TRUE)

attributes(.Last.value)

He is receiving the following error. How to fix that?

> x = knn(train, test, cl, k = 3, prob = TRUE)

Error in knn(train, test, cl, k = 3, prob = TRUE) :

'train' and 'class' have different lengths

The problem is when we consider the subset of the corpus, each of the DTMs will have different words. But we want them to share a common term list. So in such case, we have to build the DTM with all documents and then subset the DTM to make the test/train sets. Here's an example using built in data sets.

reut21578 <- system.file("texts", "crude", package = "tm")

cc<-VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain))

dtm<-DocumentTermMatrix(cc)

train<-dtm[1:7,]

test<-dtm[8:10,]

knn(train,test,factor(letters[1:7]), k=3, prob=T)

In R,a user wants to use a naive Bayes classifier to make some predictions. The real data is about 100 features big. So what would be the best way to select the most important features for naive Bayes classification? Below is the code

Your Answer