MinMaxScaler + DecisionTree classifier with numerical and categorical data - python

I would like to know how should I managed the following situation:
I have a dataset which I need to analyze. It is labeled data and I need to perform over it a classification task. Some features are numerical and others are categorical (non-ordinal), and my problem is I don't know how can I handle the categorical ones.
Before to classify, I usually apply a MinMaxScaler. But I can't do this in this particular dataset because of the categorical features.
I've read about the one-hot encoding, but I don't understand how can apply it to my case because my dataset have some numerical features and 10 categorical features and the one-hot encoding generates more columns in the dataframe, and I don't know how do I need to prepare the resultant dataframe to sent it to the decision tree classifier.
In order to clarify the situation the code I'm using so far is the following:
y = df.class
X = df.drop(['class'] , axis=1)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# call DecisionTree classifier
When the df has categorical features I get the following error: TypeError: data type not understood. So, if I apply the one-hot encoding I get a dataframe with many columns and I don't know if the decisionTree classifier is going to understand the real situation of my data. I mean how can I express to the classifier that a group of columns belongs to a specific feature? Am I understanding the whole situation wrong? Sorry if this a confused question but I am newbie and I fell pretty confused about how to handle this.

I don't have enough reputation to comment, but note that decision tree classifiers don't require their input to be scaled. So if you're using a decision tree classifier, just use the features as they appear.
If you're using a method that requires feature scaling, then you should probably do one-hot-encoding and feature scaling separately - see this answer: https://stackoverflow.com/a/43798994/9988333
Alternatively, you could use a method that handles categorical variables 'out of the box', such as LGBM.

Related

Correct use of LinearSVC

I am trying to implement a machine learning algorithm which detects irregular ecg signals. I extracted some features, but I am not sure how to manage a correct input for the classifier.
I have 20k different ecg signals, each signal has 1000 values. They are all labeld as correct or incorrect.
I choose e.g. the two features heart_rate and xposition_of_3_highest_peaks, but how to feed them into the classifier?
Following you can see my attempt, but everytime I add a second feature the score decreases. Why?
clf = svm.SVC()
#[64,70,48,89...74,58]
X_train_heartRate = StandardScaler().fit_transform(fe.get_avg_heart_rate(X_train))
X_test_heartRate = StandardScaler().fit_transform(fe.get_avg_heart_rate(X_test))
#[[23,56,89],[24,45,78],...,[21,58,90]]
X_train_3_peaks = StandardScaler().fit_transform(fe.get_intervalls(X_train))
X_test_3_peaks = StandardScaler().fit_transform(fe.get_intervalls(X_test))
X_tr = np.concatenate((X_train_heartRate,X_train_3_peaks),axis =1)
X_te = np.concatenate((X_test_heartRate,X_test_3_peaks),axis =1)
clf.fit(X_tr, Y_train)
print("Prediction:", clf.predict(X_te))
print("real Solution:", Y_test)
print(clf.score(X_te,Y_test))
I am not sure if the StandardScaler().fit_transform is necessary or if the np.concatenate is correct? Maybe there is even a better classifier for this use case?
Sorry I am a complete beginner, please be kind :)
When you are doing any transformations for pre-processing, you must use the same process from the training data and apply it to the validation / test data. However, this process must use the same statistics from the training data, because you are assuming that the validation / test data also come from this same distribution. Therefore, you need to create an object to store the transformations of the training data, then apply it to the training and test data equally. Your decreased performance is because you are not applying the right statistics to both training and validation / test correctly. You are scaling both datasets using separate means and standard deviations, which can cause out-of-distribution predictions if your sample size isn't large enough.
Therefore, call fit_transform on the training data, then just transform on the validation / test data. fit_transform will simultaneously find the parameters of the scaling for each column, then apply it to the input data and return the transformed data to you. transform assumes an already fit scaler, such as what was done in fit_transform and applies the scaling accordingly. I sometimes like to separate the operations and do a separate fit on the training data, then transform on the training and validation/test data after. This is a common source of confusion for new practitioners. You also need to save the scaler object so you can apply this to your validation / test data later.
clf = svm.SVC()
#[64,70,48,89...74,58]
heartRate_scaler = StandardScaler()
X_train_heartRate = heartRate_scaler.fit_transform(fe.get_avg_heart_rate(X_train))
X_test_heartRate = heartRate_scaler.transform(fe.get_avg_heart_rate(X_test))
#[[23,56,89],[24,45,78],...,[21,58,90]]
three_peaks_scalar = StandardScaler()
X_train_3_peaks = three_peaks_scalar.fit_transform(fe.get_intervalls(X_train))
X_test_3_peaks = three_peaks_scalar.transform(fe.get_intervalls(X_test))
X_tr = np.concatenate((X_train_heartRate,X_train_3_peaks),axis =1)
X_te = np.concatenate((X_test_heartRate,X_test_3_peaks),axis =1)
clf.fit(X_tr, Y_train)
print("Prediction:", clf.predict(X_te))
print("real Solution:", Y_test)
print(clf.score(X_te,Y_test))
Take note that you can concatenate the features you want first, then apply the StandardScaler after the fact because the method applies the standardization to each feature/column independently. The above method of scaling the different sets of features and concatenating them after is no different than concatenating the features first, then scaling after.
Minor Note
I forgot to ask about the fe object. What is that doing under the hood? Does it use the training data in any way to get you features? You must make sure that this object operates on the statistics of your training data and test data, not separately. What I mentioned about ensuring that the pre-processing must match between training and validation/test, the statistics must also match in this fe object as well. I assume this either uses the training data's statistics to both sets of data, or it is an independent transformation that is agnostic. Either way, you haven't specified what this is doing under the hood, but I will assume the happy path.
Possible Improvement
Consider using a decision tree-based algorithm like a Random Forest Classifier that does not require scaling of the input features, as the job is to partition the feature space of your data into N-dimensional hypercubes, with N being the number of features in your dataset (if N=2, this would be a 2D rectangle, N=3 a 3D rectangle, etc). Depending on how your data is distributed, tree-based algorithms can do better and are the first things to try in Kaggle competitions.

Error "Unknown label type: 'continuous'" when I use IterativeImputer with KNeighborsClassifier

I want to do a multiple imputation with IterativeImputer.
Here is the dataset (the original is from https://www.kaggle.com/jboysen/mri-and-alzheimers) :
alz_df_imp_categorical
The variables to impute are "educ" and "ses". As they are categorical I've choose to use a classifier (KNeighborsClassifier from sklearn). Predictors are continuous (except "sex").
This is the code :
# calling the MICE class
KNN_class_estimator = KNeighborsClassifier()
mice_imputer = IterativeImputer(random_state=0, estimator=KNN_class_estimator, initial_strategy ='mean')
# imputing the missing value with mice imputer
alz_df_imp_categorical = mice_imputer.fit_transform(alz_df_imp_categorical)
And the errors is :
"Unknown label type: 'continuous'"
In fact, fit_transform() function transforms the dataframe to an array and all variables are transformed in float type. Thus, the variables to predict are not categorical anymore because of this array transformation. Furthermore, only one type of variable are accepted in an array (so, i can't convert just the variables to predict in categorical and let the others in float). Thus, as the targets variables are float, the classifier can't works. So, I understand the error... but I don't know how to slove it.
I thought that, maybe, we can't apply a KNN classifier when they are many type of predictors (continuous and categorical). However, I have no problem with this when I use KNN classifier in R.
Do you have some ideas to slove this problem ?
Thank you.
I just understood why it does not works. It's because IterativeImputer works only for continuous variables. So, apparently you can't apply multiple imputation for continuous variables with IterativeImputer.
There is discussion about this here.
I saw it's possible to do simple imputation with categorical variables in python. However, it does not seem possible to do multiple imputation with this type of variables (anyway, I did not find).

What is right time to perform train_test_split when building a model with text and categorical features?

I am trying to train a model which takes a mixture of numerical, categorical and text features.
My question is which one of the following should I do for vectorizing my text and categorical features?
I split my data into train,cv and test for purpose of features vectorization i.e using vectorizor.fit(train) and vectorizor.transform(cv),vectorizor.transform(test)
Use vectorizor.fit transform on entire data
My goal is to hstack( all above features) and apply NaiveBayes. I think I should split my data into train_test before this point, inorder to find optimal hyperparameter for NB.
Please share some thought on this. I am new to data-science.
If you are going to fit anything like an imputer or a Standard Scaler to the data, I recommend doing that after the split, since this way you avoid any of the test dataset leaking into your training set. However, things like formatting and simple transformations of the data, one-hot encoding should be able to be done safely on the entire dataset without issue, and avoids some extra work.
I think you should go with the 2nd option i.e vectorizer.fit_transform on entire data because if you split the data before, it may happen that some of the data which is in test may not be in train so in that case some classes may remain unrecognised

Issues with imbalanced dataset in case of binary classification

I have a binary classification problem where the data division is like :{0:85%,1:15%}. I have tried re-weighting class_weights and other sampling approches. But all the approaches that I have used is giving me unsatisfactory results.
My dataset is (91125,57).
Accuracy:1
F1-Score:1
F2-Score:1
Precision:1
Recall:1
AUCROC:1
Kappa:1
Is there any other method I can use to handle such a situation?
Make sure you're dropping the target variable from your features before feeding the data to the classifier:
X = df.drop('target',axis=1)
y = df['target']
I'd also check if some independent variables are highly correlated with the target. It may give your an idea what causes an unrealistically perfect classiification:
import seaborn as sns
sns.heatmap(X_train.corr())

Getting correct shape for datapoint to predict with a Regression model after using One-Hot-Encoding in training

I am writing an application which uses Linear Regression. In my case sklearn.linear_model.Ridge. I have trouble bringing my datapoint I like to predict in the correct shape for Ridge. I briefly describe my two applications and how the problem turns up:
1RST APPLICATION:
My datapoints have just 1 feature each, which are all Strings, so I am using One-Hot-Encoding to be able to use them with Ridge. After that, the datapoints (X_hotEncoded) have 9 features each:
import pandas as pd
X_hotEncoded = pd.get_dummies(X)
After fitting Ridge to X_hotEncoded and labels y I save the trained model with:
from sklearn.externals import joblib
joblib.dump(ridge, "ridge.pkl")
2ND APPLICATION:
Now that I have a trained model saved on disk, I like to retrieve it in my 2nd application and predict y (Label) for just one datapoint. That's where I encounter above mentioned problem:
# X = one datapoint I like to predict y for
ridge= joblib.load("ridge.pkl")
X_hotEncoded = pd.get_dummies(X)
ridge.predict(X_hotEncoded) # this should give me the prediction
This gives me the following Error in the last line of code:
ValueError: shapes (1,1) and (9,) not aligned: 1 (dim 1) != 9 (dim 0)
Ridge was trained with 9 features because of the use of One-Hot-Encoding I used on all the datapoints. Now, when I like to predict just one datapoint (with just 1 feature) I have trouble bringing this datapoint in the correct shape for Ridge to be able to handle it. One-Hot-Encoding has no affect on jsut one datapoint with just one feature.
Does anybody know a neat solution to this problem?
A possible solution might be to write the column names to disk in the 1rst Application and retrieve it in the 2nd and then rebuild the datapoint there. The column names of one-hot-encoded arrays could be retrieved like stated here: Reversing 'one-hot' encoding in Pandas
What happens here is the following:
During the training-phase, you decided on an encoding to transform a single categorical feature into 9 numerical ones (One Hot). You trained your regression algorithm on this encoding. So in order to use it for unknown (test-) data, you have to transform this data in exactly the same way as you did during training.
Unfortunately, I dont think you can save the encoding used by pd.get_dummies and reuse it. You should use sklearn.preprocessing.OneHotEncoder() instead. So during training:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
X_hotEncoded = enc.fit_transform(X)
fit_transform() first fits the encoder to your training data and then uses it to transform the data. The difference to pd.get_dummies() is that you now you now have an encoder object which you can save und reuse later:
joblib.dump(enc, "encoder.pkl")
During testing you can apply the same encoding used during training like this:
enc = joblib.load("encoder.pkl")
X_hotEncoded = enc.transform(X)
Note that you don't want to fit the encoder again (this is what pd.get_dummies() would do) because it is crucial that the same encoding is used for the training and test data.
Watch out:
You will run into problems if the test-data contains values which were not present in the training data (because then the encoder does not know how to encode these unknown values). To avoid this, you can either:
provide OneHotEncoder() with the categories argument, passing it a list of all your categories.
provide OneHotEncoder() with the handle_unknown argument set to ignore. This avoids the error and just sets all columns to zero.
perform One Hot Encoding before splitting the data into training and test set.
provide OneHotEncoder() with the n_values argument telling the encoder how many different categories to expect for each input feature [edit: deprecated since version 0.20].

Categories