A user runs the following script and receives an error. How to fix that?

667 Asked by SnehaPandey in Data Science , Asked on Nov 4, 2019

Answered by Sneha Pandey

import pandas as pd

from sklearn.tree import DecisionTreeClassifier

data = pd.DataFrame()

data['A'] = ['a','a','b','a']

data['B'] = ['b','b','a','b']

data['C'] = [0, 0, 1, 0]

data['Class'] = ['n','n','y','n']

tree = DecisionTreeClassifier()

tree.fit(data[['A','B','C']], data['Class'])

He receives the following error

This happens because sklearn does not handle categorical variables directly unless we encode them into numeric variables. In order to do that, we can perform one hot encoding of all categorical variables present in the data set.

The following code can solve the above issue.

import pandas as pd

from sklearn.tree import DecisionTreeClassifier

data = pd.DataFrame()

data['A'] = ['a','a','b','a']

data['B'] = ['b','b','a','b']

data['C'] = [0, 0, 1, 0]

data['Class'] = ['n','n','y','n']

tree = DecisionTreeClassifier()

one_hot_data = pd.get_dummies(data[['A','B','C']],drop_first=True)

tree.fit(one_hot_data, data['Class'])

Here, pd.get_dummies will convert the data into dummy variables and drop_first will remove the first column to avoid multicollinearity.

A user runs the following script and receives an error. How to fix that?

Your Answer