A user tries to run logistic regression on my data (6 categorical, 1 integer) using scikit learn. He is following the scikit learn documentation but when trying to fit my data he is getting the following value error.
#Below are the variables of my data.
train_data.dtypes
OUTPUT
TripType category
VisitNumber category
Weekday category
Upc category
ScanCount int64
DepartmentDescription category
FinelineNumber category
dtype: object
X = train_data.loc[:, 'VisitNumber':'FinelineNumber']
Y = train_data.loc[:, 'TripType':'TripType']
logreg = linear_model.LogisticRegression()
logreg.fit(X, Y)
**ValueError: could not convert string to float: GROCERY DRY GOODS**
The error is due to the presence of categorical variables in the dataset. We cannot use names of categories directly as features in logistic regression. We need to convert them into some encoded vectors (or dummy variables). If we have 6 categories we need to use 5 dummy variables.
The example of changing variable into dummies is given below
The gender column has been changed to dummy variables 0 and 1.