I'm trying to figure it out how to use the inverse_transform function from LabelEncoder(). For example, in the below code,
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Label'] = le.fit_transform(df[['Actual']]
If i want to reverse, i can simply call:
le.inverse_transform(df['Label'])
However, i need to apply that same transformation/inverse into a new dataset, which might be predicted from the model above. I.e, it is been done in a new notebook, so, it seems like i have to store the labels. Any ideas how to do this? My only idea is to export a dataframe with 2 columns, and use pd.merge.
Make a dictionary containing the inverse transform of that LabelEncoder that you used in 1st notebook. here
And then use that dictionary to remap values in 2nd notebook. here
Related
I would like to take the logarithm of specific columns of my dataframe.
I created a new transformer object:
from sklearn.preprocessing import FunctionTransformer
log_transformer = FunctionTransformer(np.log10, validate=True)
and it works nicely on one specific column of my dataframe:
log_transformer.transform(df['_column'])
Also, one could also overwrite the specific column of the original dataframe like:
df[['_column']]=log_transformer.transform(df[['_column']])
However, this operation then changes the original dataframe and wouldn't be useful in a pipeline.
When I try to include this transformer object into ColumnTransformer, I get an error message:
columnTransformer = ColumnTransformer([('log_transform', log_transformer.transform(), [0,5)], remainder='passthrough')
How should I pass on the custom-defined transformer object to ColumnTransformer? (The same syntax would work very nicely for built-in transformers, as suggested in this article: https://towardsdatascience.com/columntransformer-in-scikit-for-labelencoding-and-onehotencoding-in-machine-learning-c6255952731b)
Thank you for your help!
Preprocessing the training data (such as centering or scaling) before training an XGBoost model, can lead to a loss of feature names. Most answers on SO suggest training the model in such a way that feature names aren't lost (such as using pd.get_dummies on data frame columns).
I have trained an XGBoost model using the preprocessed data (centre and scale using MinMaxScaler). Thereby, I am in a similar situation where feature names are lost.
For instance:
scaler = MinMaxScaler(feature_range=(0, 1))
X = scaler.fit_transform(X)
my_model_name = XGBClassifier()
my_model_name.fit(X,Y)`
where X and Y are the training data and labels respectively. The scaling above returns a 2D NumPy array, thereby discarding feature names from pandas DataFrame.
Thus, when I try to use plot_importance(my_model_name), it leads to the plot of feature importance, but only with feature names such as f0, f1, f2 etc., and not the actual feature names from the original data set.
Is there a way to map the feature names from the original training data to the feature importance plot generated, so that the original feature names are plotted in the graph? Any help in this regard is highly appreciated.
You can get the features names by:
model.get_booster().feature_names
You are right that when you pass NumPy array to fit method of XGBoost, you loose the feature names. In such a case calling model.get_booster().feature_names is not useful because the returned names are in the form [f0, f1, ..., fn] and these names are shown in the output of plot_importance method as well.
But there should be several ways how to achieve what you want - supposed you stored your original features names somewhere, e.g. orig_feature_names = ['f1_name', 'f2_name', ..., 'fn_name'] or directly orig_feature_names = X.columns if X was pandas DataFrame.
Then you should be able to:
change stored feature names (model.get_booster().feature_names = orig_feature_names) and then use plot_importance method that should already take the updated names and show it on the plot
or since this method return matplotlib ax, you can modified labels using plot_importance(model).set_yticklabels(orig_feature_names) (but you have to set the correct order of you features)
or you can take model.feature_importances_ and combine it with your original feature names by yourselves (i.e. plotting it by ourselves)
similarly, you can also use model.get_booster().get_score() method and combine it with your feature names
or you can try Learning API with xgboost DMatrix and specify your feature names during creating of the dataset (after scaling) with train_data = xgb.DMatrix(X, label=Y, feature_names=orig_feature_names) (but I do not have much experience with this way of training since I usually use Scikit-Learn API)
EDIT:
Thanks to #Noob Programmer (see comments below) there might be some "inconsistencies" based on using different feature importance method. Those are the most important ones:
xgboost.plot_importance uses "weight" as the default importance type (see plot_importance)
model.get_booster().get_score() also uses "weight" as the default (see get_score)
model.feature_importances_ depends on importance_type parameter (model.importance_type) and it seems that the result is normalized to sum of 1 (see this comment)
For more info on this topic, look at How to get feature importance.
I tried the above answers, and didn't work while loading the model after training.
So, the working code for me is :
model.feature_names
it returns a list of the feature names
I think, it is best to turn numpy array back into pandas DataFrame. E.g.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBClassifier
Y=label
X_df = pd.read_csv("train.csv")
orig_feature_names = list(X_df.columns)
scaler = MinMaxScaler(feature_range=(0, 1))
X_scaled_np = scaler.fit_transform(X_df)
X_scaled_df = pd.DataFrame(X_scaled_np, columns=orig_feature_names)
my_model_name = XGBClassifier(max_depth=2, n_estimators=2)
my_model_name.fit(X_scaled_df,Y)
xgb.plot_importance(my_model_name)
plt.show()
This will show the original names.
I am working with the House Prices Kaggle dataset. I am trying to use the RobustScaler from sklearn only on numerical features in the dataset (LotFrontage, LotArea, etc.). First, I fit the data to the numerical values of my dataframe by calling select_dtypes(exclude=['object']. Once the transformer has been fit to those values, I call the transform function, trying to transform those same values I just fit the data on by setting the transformer equal to object excluded attributes. Once I attempt that, I get the following error message:
SyntaxError: can't assign to function call
Data has already been rid of null values. What has worked is when I set the transform results equal to some variable, I get the results back as a numpy.ndarray
from sklearn.preprocessing import RobustScaler
transformer = RobustScaler().fit(df_train.select_dtypes(exclude=['object']))
df_train.select_dtypes(exclude=['object']) = transformer.transform(df_train.select_dtypes(exclude=['object'])) # This doesn't work
test = transformer.transform(df_train.select_dtypes(exclude=['object'])) # This DOES work, but not in the format I need
All I want is for the transformed attributes to go back into the original pandas data frame at their corresponding locations. Is there some workaround I can implement if I can't convert the original dataframe results directly?
I managed to get it to work. Not sure how Pythonic this solution is, but it got me back on track:
df_train[list(df_train.select_dtypes(exclude=['object']).columns)] = RobustScaler().fit_transform(df_train[list(df_train.select_dtypes(exclude=['object']).columns)])
Hello I am a beginner in Machine Learning, I have previously worked with some binary ml tasks where the data was numerical. Now I am facing an issue where I have to find the probability of a particular combination. I can not disclose the dataset or the code at this point. My data is a dataframe of 10 columns. I have to train my model on 8 columns and predict the possibility of the last 2 columns. That is my labels are a combination of the last 2 columns. What I am facing a problem with is, these column values are not numerical. I have tried everything I came across but can't find any suitable means of converting this to numerical values. I have tried LabelEncoder from sklearn,which works with the labels, but throws memory error if I use it again. I have tried to_numeric from pandas, which reads all the values as Nan. The values are in the form '2be74fad-4d4'. Any suggestions would be highly appreciated about how to handle this issue.
To convert categorical data to numerical, you can try these approaches in sklearn:
Label Encoding
Label Binarizer
OneHot Encoding
Now, for your problem, you can use LabelEncoder. But there is a catch. In other sklearn models, you can declare it once and then use it to fit and then transform on a number of columns.
In LabelEncoding, you have to fit_transform the model on one column in train data and then transform the same column in test data. Then the same process for the next categorial column.
You can iterate over a list of categorical columns to make it simple. Consider the snippet below:
cat_cols = ['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type', 'Item_Type_Combined']
enc = LabelEncoder()
for col in cat_cols:
train[col] = train[col].astype('str')
test[col] = test[col].astype('str')
train[col] = enc.fit_transform(train[col])
test[col] = enc.transform(test[col])
You can create a dictionary with the mapping form a string to integer. An example can be found here: enter link description here. Then you use onehot encoding or just feed the integer to the neural network. If the characters have some meaning you could also do it on a per character base instead of wordbased. But that depends on the task. If this String is a unique identifier of the column or so, just leave it away and don't feed it to your model.
How to specify multiple target_column in tflearn.data_utils.load_csv method.
According to Tflearn docs load_csv takes target_column as integer.
Tried passing my target_columns as a list in the load_csv method and as expected got a TypeError: 'list' object cannot be interpreted as an integer traceback.
Any solutions for this.
Thanks
That's not how regression works. You must have only one column as a target. That's why the tensorflow API only allows one column to be the target of regression, specified with an integer.
After going through TFLearn github repo ,it turns out , we cant directly use load_csv to specify multiple targets.
Need to read the csv file through pandas as dataframe and convert them to matrix using dataframe.as_matrix() and specify the data and label columns.
For example I had 34 columns as data set and 9 columns as targets in my csv file.
Heres what i did
import tflearn
from tflearn.optimizers import SGD
import pandas as pd,numpy as np
#name=input('name for run_id:')
csvfile = pd.read_csv('path/to/file.csv')
data = csvfile[['my','data','columns']]
label= csvfile[['target','columns']]
data=data.as_matrix()
label= label.as_matrix()
and as mentioned by this post https://datascience.stackexchange.com/questions/16890/neural-network-for-multiple-output-regression ...just mention the no of units for the output layer and set activation as linear i.e no activation and done.
Hope this helps for others with similar situation.