A user ran the following code and received an error
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']
tree = DecisionTreeClassifier()
tree.fit(data[['A','B','C']], data['Class'])
He received the following error
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/site-packages/sklearn/tree/tree.py", line 154, in fit
X = check_array(X, dtype=DTYPE, accept_sparse="csc")
File "/usr/local/lib/python2.7/site-packages/sklearn/utils/validation.py", line 377, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: b
How to fix that?
In Python, Sklearn is used in almost all machine learning algorithms and they directly do not accept categorical variables in the algorithm. In such a case, to handle categorical variables, Label Encoder is used which converts strings to numbers or dummy variables are used.
For example
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam"])
le.transform(["tokyo", "tokyo", "paris"])
This will transform the strings Tokyo and paris into numbers and we can also invert the operation to get back into words such as
list(le.inverse_transform([2, 2, 1]))
It will again convert the numbers to the words.