How to handle string data in ML classification - python

Hello I am a beginner in Machine Learning, I have previously worked with some binary ml tasks where the data was numerical. Now I am facing an issue where I have to find the probability of a particular combination. I can not disclose the dataset or the code at this point. My data is a dataframe of 10 columns. I have to train my model on 8 columns and predict the possibility of the last 2 columns. That is my labels are a combination of the last 2 columns. What I am facing a problem with is, these column values are not numerical. I have tried everything I came across but can't find any suitable means of converting this to numerical values. I have tried LabelEncoder from sklearn,which works with the labels, but throws memory error if I use it again. I have tried to_numeric from pandas, which reads all the values as Nan. The values are in the form '2be74fad-4d4'. Any suggestions would be highly appreciated about how to handle this issue.

To convert categorical data to numerical, you can try these approaches in sklearn:
Label Encoding
Label Binarizer
OneHot Encoding
Now, for your problem, you can use LabelEncoder. But there is a catch. In other sklearn models, you can declare it once and then use it to fit and then transform on a number of columns.
In LabelEncoding, you have to fit_transform the model on one column in train data and then transform the same column in test data. Then the same process for the next categorial column.
You can iterate over a list of categorical columns to make it simple. Consider the snippet below:
cat_cols = ['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type', 'Item_Type_Combined']
enc = LabelEncoder()
for col in cat_cols:
train[col] = train[col].astype('str')
test[col] = test[col].astype('str')
train[col] = enc.fit_transform(train[col])
test[col] = enc.transform(test[col])

You can create a dictionary with the mapping form a string to integer. An example can be found here: enter link description here. Then you use onehot encoding or just feed the integer to the neural network. If the characters have some meaning you could also do it on a per character base instead of wordbased. But that depends on the task. If this String is a unique identifier of the column or so, just leave it away and don't feed it to your model.

Related

Python data not being converted

I created a decision tree model in python by training the data set but found the data conversion did not happen from string to float.
Even though after trying to convert the float manually still prompts some arrays cannot be converted to float. any solutions?
I have practiced with this dataset before and I think what is going wrong for you, is trying to shift days before you are getting 'Close' column as a dataframe. Try:
df = df[['Close']]
before you shift days (Which is the 45th execution of your screenshots). It could do the trick.
(Next time, please add code in text instead of screenshots.)
Your x_train or y_train are not supposed to be strings. They should be of type numpy.ndarray. Can you check or provide us the code for the place where you are splitting the data?
This is occurring due something which was done wrong previously. Need more insight to the code.
Your string data needs some pre-processing before it can be converted to float. You can convert your data to categorial variables(if you haven't already done so). For example, if using pandas:
x_train = pd.get_dummies(x_train)
tree = DecisionTreeRegressor().fit(x_train, y_train)
# more actions
Furthermore, I can see from the error, that you have datetime data. You should convert these to a timestamp.
x_train['Date'] = pd.to_datetime(x_train['Date'])
The rest of the preprocessing is up to you. There is a plethora of relevant tutorials.

Feature Selection in Scikit-learn Encounters Problems with Mixed Variable Types

I'm currently trying to do feature selection for a dataset I have. There's about 50 variables, 35 of which are categorical each of which are either binary or have < 5 possible values. I'm trying to get ~15 input variables before the preprocessing.
I'm trying to use Recursive Feature Elimination with Cross-Validation (RFECV) in scikit-learn. Because there is a mix of continuous and categorical variables, I'm having some problems when I one-hot encode the categoricals that I have two questions about:
Will the RFE still work with the one-hot encodings and will it be accurate?
How can I get which columns before one-hot encodings the selected features correspond to? For example, if it tells me to keep column 20, how do I know which column that corresponds to before preprocessing so I can keep that as an original input variable.
I'm not going to include the preprocessing, but all it does is impute and one hot encodes with no columns dropped.
Here's the two RFECV objects I have:
clf = SVC(kernel="linear")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring="balanced_accuracy")
rfecv.fit(x_train, y_train)
clf2 = ExtraTreesClassifier(random_state=RANDOM_SEED)
rfecv2 = RFECV(estimator=clf2, step=1, cv=StratifiedKFold(10), scoring="balanced_accuracy")
rfecv2.fit(x_train, y_train)
One hot encoding turns your categorical features into discrete. It will work just fine with it. You should ask yourself if RFE will work with categorical data (answer: depends on estimator), but it will be just fine with binary features. Eventually one-hot encoding is just a group of binary features. The accuracy should be fine even with one-hot encoding.
Unfortunately, there is no "automatic" way to do so. You'll have to do it manually in some way. The best automated way I can think about is to save a mapping, and then use it. For example save a dict: my_dict = {"Food_Pizza" : "Food", "Food_Pasta" : "Food"}. Then you just call for orig_column = my_dict[new_column] to have the regular column. Other option depends how you features named and one-hot encode. For example, if all your one-hot encoding is "FeatureName_value" (like in pandas dummies) you can just parse the name and take everything before the "_" char.

SyntaxError while trying to perform RobustScaler on Pandas Dataframe

I am working with the House Prices Kaggle dataset. I am trying to use the RobustScaler from sklearn only on numerical features in the dataset (LotFrontage, LotArea, etc.). First, I fit the data to the numerical values of my dataframe by calling select_dtypes(exclude=['object']. Once the transformer has been fit to those values, I call the transform function, trying to transform those same values I just fit the data on by setting the transformer equal to object excluded attributes. Once I attempt that, I get the following error message:
SyntaxError: can't assign to function call
Data has already been rid of null values. What has worked is when I set the transform results equal to some variable, I get the results back as a numpy.ndarray
from sklearn.preprocessing import RobustScaler
transformer = RobustScaler().fit(df_train.select_dtypes(exclude=['object']))
df_train.select_dtypes(exclude=['object']) = transformer.transform(df_train.select_dtypes(exclude=['object'])) # This doesn't work
test = transformer.transform(df_train.select_dtypes(exclude=['object'])) # This DOES work, but not in the format I need
All I want is for the transformed attributes to go back into the original pandas data frame at their corresponding locations. Is there some workaround I can implement if I can't convert the original dataframe results directly?
I managed to get it to work. Not sure how Pythonic this solution is, but it got me back on track:
df_train[list(df_train.select_dtypes(exclude=['object']).columns)] = RobustScaler().fit_transform(df_train[list(df_train.select_dtypes(exclude=['object']).columns)])

Create LabeledPoint from rdd data which has both strings and numbers - PySpark

I have lines like this in my data:
0,tcp,http,SF,181,5450,0,0,0.5,normal.
I want to use decision tree algorithm for training. I couldn't create LabeledPoints, so I want to try HashingTF for strings but I couldn't handle it. "normal" is my target label. How can I create a LabeledPoint RDD data to use in pyspark? Also, Label for LabeledPoint requires double, should I just create some double values for labels or should it be hashed?
I come up with the solution.
First of all, Spark's Decision tree classifier has already a parameter for this: categoricalFeaturesInfo. In the pyspark api documentation:
categoricalFeaturesInfo - Map from categorical feature index to number of categories. Any feature not in this map is treated as continuous.
However, before doing this, we first should simply replace the strings to numbers for pypsark to understand them.
Then we create for the above example data categoricalFeaturesInfo as in the definition like this:
categoricalFeaturesInfo = {1:len(feature1), 2:len(feature2), 3:len(feature3), 9:len(labels)}
Simply, first ones are the indexes of the categorical features and the second ones are the number of categories in that feature.
Note that converting strings to numbers is enough for the trainer algorithm but if you declare the categorical features like this, it would train faster.

Patsy: New levels in categorical fields in test data

I am trying to use Patsy (with sklearn, pandas) for creating a simple regression model. The R style formula creation is a major draw.
My data contains a field called 'ship_city' which can have any city from India. Since I am partitioning the data into train and test sets, there are several cities which appear only in one of the sets. A code snippet is given below:
df_train_Y, df_train_X = dmatrices(formula, data=df_train, return_type='dataframe')
df_train_Y_design_info, df_train_X_design_info = df_train_Y.design_info, df_train_X.design_info
df_test_Y, df_test_X = build_design_matrices([df_train_Y_design_info.builder, df_train_X_design_info.builder], df_test, return_type='dataframe')
The last line throws the following error:
patsy.PatsyError: Error converting data to categorical: observation
with value 'Kolkata' does not match any of the expected levels
I believe this is a very common use case where training data will not have all levels of all categorical fields. Sklearn's DictVectorizer handles this quite well.
Is there any way I can make this work with Patsy?
The problem of course is that if you just give patsy a raw list of values, it has no way to know that there are other values that could potentially happen as well. You have to somehow tell it what the complete set of possible values is.
One way is by using the levels= argument to C(...), like:
# If you have a data frame with all the data before splitting:
all_cities = sorted(df_all["Cities"].unique())
# Alternative approach:
all_cities = sorted(set(df_train["Cities"]).union(set(df_test["Cities"])))
dmatrices("y ~ C(Cities, levels=all_cities)", data=df_train)
Another option if you're using pandas's default categorical support is to record the set of possible values when you set up your data frame; if patsy detects that the object you've passed it is a pandas categorical then it automatically uses the pandas categories attribute instead of trying to guess what the possible categories are by looking at the data.
I ran into a similar problem and I built the design matrices prior to splitting the data.
df_Y, df_X = dmatrices(formula, data=df, return_type='dataframe')
df_train_X, df_test_X, df_train_Y, df_test_Y = \
train_test_split(df_X, df_Y, test_size=test_size)
Then as an example of applying a fit:
model = smf.OLS(df_train_Y, df_train_X)
model2 = model.fit()
predicted = model2.predict(df_test_X)
Technically I haven't built a test case, but I haven't run into the Error converting data to categorical error again since implementing the above.

Categories