A user runs the following script and receives an error. How to fix that?

486    Asked by SnehaPandey in Data Science , Asked on Nov 4, 2019
Answered by Sneha Pandey

import pandas as pd

from sklearn.tree import DecisionTreeClassifier

data = pd.DataFrame()

data['A'] = ['a','a','b','a']

data['B'] = ['b','b','a','b']

data['C'] = [0, 0, 1, 0]

data['Class'] = ['n','n','y','n']

tree = DecisionTreeClassifier()

tree.fit(data[['A','B','C']], data['Class'])

He receives the following error


This happens because sklearn does not handle categorical variables directly unless we encode them into numeric variables. In order to do that, we can perform one hot encoding of all categorical variables present in the data set.

The following code can solve the above issue.

import pandas as pd

from sklearn.tree import DecisionTreeClassifier

data = pd.DataFrame()

data['A'] = ['a','a','b','a']

data['B'] = ['b','b','a','b']

data['C'] = [0, 0, 1, 0]

data['Class'] = ['n','n','y','n']

tree = DecisionTreeClassifier()

one_hot_data = pd.get_dummies(data[['A','B','C']],drop_first=True)

tree.fit(one_hot_data, data['Class'])

Here, pd.get_dummies will convert the data into dummy variables and drop_first will remove the first column to avoid multicollinearity. 



Your Answer

Interviews

Parent Categories