Unexpectedly getting different standardized data with sklearn StandardScaler - python

I am getting different standardized values using two scalers built ont he same dataset using scikit-learn's standardScaler class.
I have built a StandarScaler object using Scikit-learn on a training data set with 52 features. Let's call it Scaler1. I then used that scaler to standardize the training data set and learn different models on the standardized data. This led to a best model with selected features (26 out of 52). In order to implement a predictor class that uses the model: (1) I grabbed only columns form the original (non-standardized) data set that correspond to the 26 selected features; then (2) I created and saved (with joblib) a new StandarScaler object by fitting the newly created data set. Let's call it Scaler2. Below is a simple outline of my implementation.
scaler = StandardScaler()
scaler.set_params (**parameters)
scaler.fit(data)
joblib.dump(scaler, destination)
Contrary to my expectation, when trying to standardize the original data set, Scaler2 gives me different values for the same data points, compared to Scaler1, for each of the 26 features.Is that behaviour normal? Doesn't the standardization happen independently for each row? How can I fix this issue?
Best,
Yannick

This issue was fixed. It is important to make sure the order in which the features are processed remains the same, as the standardizer model does not appear to have the names of the features saved.

Related

Why xgboost BDT model constructed with histogram tree method depends on the training data ordering?

I was using Python xgboost to train some models (with binary logistic) on some data (50k in total) and I used the histogram tree method for the training (tree_method="hist"). I shuffled the events in the data and used them for the training. It turned out that the models built are slightly different depending on the order of events and some result based on the corresponding prediction on validation set (different than training set) could vary up to 5%. As a double check I also used the lightgbm and this effect is also presented. It seems that this is a problem of histogram method because if I use the exact method in xgboost (tree_method="exact") then this problem disappears.
Does anyone know why the bdt model based on histogram method depends on the event order?
I tried to look for the reference paper but was totally lost.

How to implement incremental learning using naive bayes algorithm in python?

I have implemented ML model using naive Bayes algorithm, where I want to implement incremental learning. The issue that I am facing is when I train my model and it generates 1500 features while preprocessing and then after a month using feedback mechanism if I want to train my model with new data which might contain some new features, may be less than or more than 1500 (i.e of my previous dataset) here if I use fit_transform to get the new features then my existing feature set gets lost.
I have been using partial fit but the issue with partial fit is you require same number of features as of previous model. How do I make it learn incrementally?
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray() #replaces my older feature set
classifier = GaussianNB()
classifier.partial_fit(X,y)
#does not fit because the size of feature set count is not equal to previous feature set count
You could use just transform() for the CountVectorizer() and then partial_fit() for Naive-Bayes like the following for the incremental learning. Remember, transform extracts the same set of features, which you had learned using the training dataset.
X = cv.transform(corpus)
classifier.partial_fit(X,y)
But, you cannot revamp the features all from scratch and continue the incremental leaning. Meaning the number of feature needs to be consistent for any model to do incremental learning.
If you think, your new dataset have significantly different features compared to older one, use cv.fit_transform() and then classifier.fit() on complete dataset (both old and new one), which means we are going to create a new model for the entire available data. You could adopt this, if your dataset not big enough to keep in memory!
You cannot with CountVectorizer. You will need to fix the number of features for partial_fit() in GaussianNB.
Now you can use a different preprocessor (in place of CountVectorizer) which can map the inputs (old and new) to same feature space. Have a look at HashingVectorizer which is recommended by scikit-learn authors to be used in just the scenario you mentioned. While initializing, you will need to specify the number of features you want. In most cases, default value is enough for not having collisions in hashes of different words. You may try experimenting with different numbers. Try using that and check out the performance. If not at par with CountVectorizer then you can do what #AI_Learning suggests and make a new model on the whole data (old+new).

xgboost feature importance of categorical variable

I am using XGBClassifier to train in python and there are a handful of categorical variables in my training dataset. Originally, I planed to convert each of them into a few dummies before I throw in my data, but then the feature importance will be calculated for each dummy, not the original categorical ones. Since I also need to order all of my original variables (including numerical + categorical) by importance, I am wondering how to get importance of my original variables? Is it simply adding up?
You could probably get by with summing the individual categories' importances into their original, parent category. But, unless these features are high-cardinality, my two cents would be to report them individually. I tend to err on the side of being more explicit with reporting model performance/importance measures.

Getting correct shape for datapoint to predict with a Regression model after using One-Hot-Encoding in training

I am writing an application which uses Linear Regression. In my case sklearn.linear_model.Ridge. I have trouble bringing my datapoint I like to predict in the correct shape for Ridge. I briefly describe my two applications and how the problem turns up:
1RST APPLICATION:
My datapoints have just 1 feature each, which are all Strings, so I am using One-Hot-Encoding to be able to use them with Ridge. After that, the datapoints (X_hotEncoded) have 9 features each:
import pandas as pd
X_hotEncoded = pd.get_dummies(X)
After fitting Ridge to X_hotEncoded and labels y I save the trained model with:
from sklearn.externals import joblib
joblib.dump(ridge, "ridge.pkl")
2ND APPLICATION:
Now that I have a trained model saved on disk, I like to retrieve it in my 2nd application and predict y (Label) for just one datapoint. That's where I encounter above mentioned problem:
# X = one datapoint I like to predict y for
ridge= joblib.load("ridge.pkl")
X_hotEncoded = pd.get_dummies(X)
ridge.predict(X_hotEncoded) # this should give me the prediction
This gives me the following Error in the last line of code:
ValueError: shapes (1,1) and (9,) not aligned: 1 (dim 1) != 9 (dim 0)
Ridge was trained with 9 features because of the use of One-Hot-Encoding I used on all the datapoints. Now, when I like to predict just one datapoint (with just 1 feature) I have trouble bringing this datapoint in the correct shape for Ridge to be able to handle it. One-Hot-Encoding has no affect on jsut one datapoint with just one feature.
Does anybody know a neat solution to this problem?
A possible solution might be to write the column names to disk in the 1rst Application and retrieve it in the 2nd and then rebuild the datapoint there. The column names of one-hot-encoded arrays could be retrieved like stated here: Reversing 'one-hot' encoding in Pandas
What happens here is the following:
During the training-phase, you decided on an encoding to transform a single categorical feature into 9 numerical ones (One Hot). You trained your regression algorithm on this encoding. So in order to use it for unknown (test-) data, you have to transform this data in exactly the same way as you did during training.
Unfortunately, I dont think you can save the encoding used by pd.get_dummies and reuse it. You should use sklearn.preprocessing.OneHotEncoder() instead. So during training:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
X_hotEncoded = enc.fit_transform(X)
fit_transform() first fits the encoder to your training data and then uses it to transform the data. The difference to pd.get_dummies() is that you now you now have an encoder object which you can save und reuse later:
joblib.dump(enc, "encoder.pkl")
During testing you can apply the same encoding used during training like this:
enc = joblib.load("encoder.pkl")
X_hotEncoded = enc.transform(X)
Note that you don't want to fit the encoder again (this is what pd.get_dummies() would do) because it is crucial that the same encoding is used for the training and test data.
Watch out:
You will run into problems if the test-data contains values which were not present in the training data (because then the encoder does not know how to encode these unknown values). To avoid this, you can either:
provide OneHotEncoder() with the categories argument, passing it a list of all your categories.
provide OneHotEncoder() with the handle_unknown argument set to ignore. This avoids the error and just sets all columns to zero.
perform One Hot Encoding before splitting the data into training and test set.
provide OneHotEncoder() with the n_values argument telling the encoder how many different categories to expect for each input feature [edit: deprecated since version 0.20].

Scikit learn-Classification

Is there a straightforward way to view the top features of each class? Based on tfidf?
I am using KNeighbors classifer, SVC-Linear, MultinomialNB.
Secondly, I have been searching for a way to view documents that have not been classified correctly? I can view the confusion matrix but I would like to see specific documents to see what features are causing the misclassification.
classifier = SVC(kernel='linear')
counts = tfidf_vectorizer.fit_transform(data['text'].values).toarray()
targets = data['class'].values
classifier.fit(counts, targets)
counts = tfidf_vectorizer.fit_transform(test['text'].values).toarray()
predictions = classifier.predict(counts)
EDIT: I have added the code snippet where I am only creating a tfidf vectorizer and using it to traing the classifier.
Like the previous comments suggest, a more specific question would result in a better answer, but I use this package all the time so I will try and help.
I. Determining top features for classification classes in sklearn really depends on the individual tool you are using. For example, many ensemble methods (like RandomForestClassifier and GradientBoostingClassifer) come with the .feature_importances_ attribute which will score each feature based on its importance. In contrast, most linear models (like LogisticRegression or RidgeClassifier) have a regularization penalty which penalizes for the size of coefficients, meaning that the coefficient sizes are somewhat a reflection of feature importance (although you need to keep in mind the numeric scales of individual features) which can be accessed using the .coef_ attribute of the model class.
In summary, almost all sklearn models have some method to extract the feature importances but the methods are different from model to model. Luckily the sklearn documentation is FANTASTIC so I would read up on your specific model to determine your best approach. Also, make sure to read the User Guide associated with your problem type in addition to the model specific API.
II. There is no out of the box sklearn method to provide the mis-classified records but if you are using a pandas DataFrame (which you should) to feed the model it can be accomplished in a few lines of code like this.
import pandas as pd
from sklearn.linear_model import RandomForestClassifier
df = pd.DataFrame(data)
x = df[[<list of feature columns>]]
y = df[<target column>]
mod = RandomForestClassifier()
mod.fit(x.values, y.values)
df['predict'] = mod.predict(x.values)
incorrect = df[df['predict']!=df[<target column>]]
The resultant incorrect DataFrame will contain only records which are misclassified.
Hope this helps!

Categories