How to apply Target Encoding in test dataset? - python

I am working on a project, where I had to apply target encoding for 3 categorical variables:
merged_data['SpeciesEncoded'] = merged_data.groupby('Species')['WnvPresent'].transform(np.mean)
merged_data['BlockEncoded'] = merged_data.groupby('Block')['WnvPresent'].transform(np.mean)
merged_data['TrapEncoded'] = merged_data.groupby('Trap')['WnvPresent'].transform(np.mean)
I received the results and ran the model. Now the problem is that I have to apply the same model to test data that has columns Block, Trap, and Species, but doesn't have the values of the target variable WnvPresent (which has to be predicted).
How can I transfer my encoding from training sample to the test? I would greatly appreciate any help.
P.S. I hope it makes sense.

You need to same the mapping between the feature and the mean value, if you want to apply it to the test dataset.
Here is a possible solution:
species_encoding = df.groupby(['Species'])['WnvPresent'].mean().to_dict()
block_encoding = df.groupby(['Block'])['WnvPresent'].mean().to_dict()
trap_encoding = df.groupby(['Trap'])['WnvPresent'].mean().to_dict()
merged_data['SpeciesEncoded'] = df['Species'].map(species_encoding)
merged_data['BlockEncoded'] = df['Block'].map(species_encoding)
merged_data['TrapEncoded'] = df['Trap'].map(species_encoding)
test_data['SpeciesEncoded'] = df['Species'].map(species_encoding)
test_data['BlockEncoded'] = df['Block'].map(species_encoding)
test_data['TrapEncoded'] = df['Trap'].map(species_encoding)
This would answer your question, but I want to add, that this approach can be improved. Directly using mean values of targets could make the models overfit on the data.
There are many approaches to improve target encoding, one of them is smoothing, here is a link to an example: https://maxhalford.github.io/blog/target-encoding/
Here is an example:
m = 10
mean = df['WnvPresent'].mean()
# Compute the number of values and the mean of each group
agg = df.groupby('Species')['WnvPresent'].agg(['count', 'mean'])
counts = agg['count']
means = agg['mean']
# Compute the "smoothed" means
species_encoding = ((counts * means + m * mean) / (counts + m)).to_dict()

There are 2 open source Python libraries that offer this functionality off-the-shelf: Feature-engine and Category encoders.
Assuming that we have a train and a testing set...
With Feature engine it would work as follows:
from feature_engine.encoding import MeanEncoder
# set up the encoder
encoder = MeanEncoder(variables=['Species', 'Block', 'Trap'])
# fit the encoder - finds the mean target value per category
encoder.fit(X_train, X_train['WnvPresent'])
# transform data
X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)
We find the replacement values in the encoding_dict_ attribute as follows:
encoder.encoding_dict_
With category encoders it works as follows:
from category_encoders.target_encoder import TargetEncoder
# set up the encoder
encoder = TargetEncoder(cols=['Species', 'Block', 'Trap'])
# fit the encoder - finds the mean target value per category
encoder.fit(X_train, X_train['WnvPresent'])
# transform data
X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)
The replacement values can be found in the attribute mapping:
encoder.mapping
More details in the respective documentation:
MeanEncoder
TargetEncoder
Category encoders' TargetEncoder also offers smoothing as suggested by #andrey-lukyanenko out-of-the-box.

Related

Strange Behavior in Emukit When Combining Multifidelity and Experimental Design

I have a Python code that aims to build several multi-fidelity models (one for each of several variables) and use Emukit's experimental design functions to update them iteratively. I am using simple uncertainty acquisition (ModelVariance) and the multi-fidelity-wrapped gradient optimizer as shown in the examples here and here. I started by applying this technique to only one of my several variables. When doing that I noticed that 1) all update points (x_new) seemed to be selected from the LF model and 2) the variance dropped precipitously everywhere after adding only a single update point. I shrugged this off initially, and applied the technique to all my variables (using a loop over a dictionary to do each variable in turn). When I did that, I discovered that the mean predictions (new model points) seemed perfectly reasonable, but the reported variances using .predict() for ALL the models of ALL the variables were exactly the same, and were in fact what I had been given by the program when just doing the single variable. Something seems to be going very wrong finding and updating the variances after adding a new training point and using .set_data to update the model and I am not sure what or where the problem is. Is there an emukit bug? Am I using an incorrect setting? Is the problem with my dictionaries or for-loops? I am at a loss. Can anyone offer some insight?
Here is the code I currently have, somewhat redacted. I am sorry that it's such a long read....
# SKIPPING GENERAL IMPORTS
def make_mf(x,y,kernel,fidels):
# Generic multifidelity model builder.
# Returns a mutlifidelity model built based on the training points (x and y),
# kernels, and number of fidelities
mf_lin_model=GPyLinearMultiFidelityModel(x, y,kernel, n_fidelities=fidels)
# set up loop to fix noise to 0 for all fidelities, indicating training points are exact
for i in range(fidels):
if i == 0:
caller = "mf_lin_model.mixed_noise.Gaussian_noise.fix(0)"
else:
caller = "mf_lin_model.mixed_noise.Gaussian_noise_" + str(i) + ".fix(0)"
eval(caller)
## Wrap the model using the given 'GPyMultiOutputWrapper'
mf_model= model = GPyMultiOutputWrapper(mf_lin_model, 2, n_optimization_restarts=5,verbose_optimization=False)
# Fit the model
mf_model.optimize()
# Return the final model to the calling procedure
return(mf_model)
np.random.seed(20)
# list of y (result variables)
yvars=["VAR1","VAR2","VAR3"]
#list of x (input) variables
xvars=["XVAR"]
# list of fidelity levels. levels should be in order of ascending fidelity (0=lowest)
levels=["lf","hf"]
# list of what we'll need to store for each variable and level
# these are the model itself, the predicted values for plotting,
# and the predicted values at the training points
contents=['surrogate','y_plot','y_train']
# list of medium_fidelity variables
# these are the training coordintaes, the model, predicted values for plotting,
# predicted variances, the maximum and mean variance, and predicted
# values at the training points
multifivars=['y_plot','variance','varmax','varmean','pl_train']
mainvars=['model','x_train','y_train']
# set up a dictionary to store the models and related results for each y-variable
# and each fidelity
MyModels={key:{lkey:{ckey:None for ckey in contents} for lkey in levels} for key in yvars}
# Set up a dictionary for the multi-fidelity models
MultiFidelity={key:{vkey: None for vkey in mainvars}for key in yvars}
for key in MultiFidelity.keys():
for level in levels:
MultiFidelity[key][level]={mkey:None for mkey in multifivars}
#set up a dictionary to easily access data
MyData={key:None for key in levels}
# set up a dictionaries to easily access training and plotting points
x_train={key:None for key in levels}
Y_plot={key:None for key in levels}
T_plot={key:None for key in levels}
# Number of initial points evaluated at each fidelity level
npoints=[5,2]
MyPoints={levels[i]:npoints[i] for i in range(len(levels))}
## SKIPPED THE SECTION WHERE I READ IN THE RAW DATA
# High sampling of models for plotting of functions
x_plot = np.linspace(2, 16, 200)[:, None]
# set up points for plotting and retrieving MF model
X_plot = convert_x_list_to_array([x_plot, x_plot])
for i in range(len(levels)):
Y_plot[levels[i]] = X_plot[i*len(x_plot):(i+1)*len(x_plot)]
Y_plot_h=X_plot[len(x_plot):]
# Sampling for training for multi-fidelity analysis
x_train[levels[0]] = np.atleast_2d(np.random.rand(MyPoints[levels[0]])*14+2).T
for i in range (1,len(levels)):
x_train[levels[i]] = np.atleast_2d(np.random.permutation(x_train[levels[i-1]])[:MyPoints[levels[i]]])
#x_train_h = np.atleast_2d([3, 9.5, 11, 15]).T
# set up points for plotting mf result at training points
X_train=convert_x_list_to_array([x_train[levels[0]],x_train[levels[0]]])
for i in range(len(levels)):
T_plot[levels[i]] = X_train[i*len(x_train[levels[0]]):(i+1)*len(x_train[levels[0]])]
#print(X_train)
# combine the training points of all fidelity levels into a list of arrays
xtemp=[]
for level in levels:
xtemp.append(x_train[level])
kernels = [GPy.kern.RBF(1), GPy.kern.RBF(1)]
lin_mf_kernel = emukit.multi_fidelity.kernels.LinearMultiFidelityKernel(kernels)
for var in MyModels.keys():
ytemp=[]
for level in levels:
# use SciPy interpolate to build surrogate for given variable and fidelity level
MyModels[var][level]['surrogate']=interpolate.interp1d(MyData[level]['Coll'],MyData[level][var])
# find y-values for training MF points and append to a list of arrays
MyModels[var][level]['y_train']=MyModels[var][level]['surrogate'](x_train[level])
ytemp.append(MyModels[var][level]['y_train'])
MyModels[var][level]['y_plot']=MyModels[var][level]['surrogate'](x_plot)
## Convert lists of arrays to ndarrays augmented with fidelity indicators
MultiFidelity[var]['x_train'],MultiFidelity[var]['y_train']=convert_xy_lists_to_arrays(xtemp,ytemp)
# Build the multi-fidelity model
## Construct a linear multi-fidelity model
MultiFidelity[var]['model']= make_mf(MultiFidelity[var]['x_train'], MultiFidelity[var]['y_train'], lin_mf_kernel,len(levels))
# Get multifidelity model values and variances at plotting points
for level in levels:
MultiFidelity[var][level]['y_plot'],MultiFidelity[var][level]['variance']=MultiFidelity[var]['model'].predict(Y_plot[level])
# find maximum and average variance to measure the accuracy of the MF model
MultiFidelity[var][level]['varmax']=np.amax(MultiFidelity[var][level]['variance'])
MultiFidelity[var][level]['varmean']=np.mean(MultiFidelity[var][level]['variance'])
MultiFidelity[var][level]['pl_train'], _ = MultiFidelity[var]['model'].predict(T_plot[level])
for key in MyModels.keys():
for level in levels:
print(key,level,MultiFidelity[key][level]['varmax'],MultiFidelity[key][level]['varmean'])
# set up the parameter space. we are scanning in x between 2 and 16 to match the range of my input
parameter_space = ParameterSpace([ContinuousParameter('x', 2, 16), InformationSourceParameter(len(levels))])
# set up how we will look for the target of our search
optimizer = MultiSourceAcquisitionOptimizer(GradientAcquisitionOptimizer(parameter_space), parameter_space)
# Plot each variable vs X for BEFORE any new points are added
for var in yvars:
plot_vars(var,0)
# Note: right now I am basing the aquisition function on the first variable ONLY. I intend to
build a more complex function later when I get these bugs worked out.
acquisition=ModelVariance(MultiFidelity[yvars[0]]['model'])
# perform optimization to find the target point
x_new, val = optimizer.optimize(acquisition)
# x_new=np.atleast_2d(0)
# x_new[0][0]=np.random.rand()*14+2
print('first update points is',x_new)
# I want to manually specify that I add one HF training point and 4 LF training points,
# hence the way the following code is built. This could be a source of problems?
# construct our own version of the new data point because we will want it from the HF surrogate model
# (hence the value 1 in the final column)
new_point_x_hi = [[x_new[0][0],1.]]
# also, since this is an HF point, we include it as a training point in the LF model
new_point_x_lo = [[x_new[0][0],0.]]
# # we also append the new x-value to the training point x-array
x_train[levels[0]]=np.append(x_train[levels[0]],[[x_new[0][0]]],axis=0)
x_train[levels[1]]=np.append(x_train[levels[1]],[[x_new[0][0]]],axis=0)
# next, prepare points to allow the plotting of the training points on each model
X_train=convert_x_list_to_array([x_train[levels[0]],x_train[levels[0]]])
for i in range(len(levels)):
T_plot[levels[i]] = X_train[i*len(x_train[levels[0]]):(i+1)*len(x_train[levels[0]])]
for var in yvars:
# Now, for every variable in our list we add training points and update the models
# find the corresponding y-values from the respective surrogates
new_point_y_hi = np.atleast_2d(MyModels[var]['hf']['surrogate'](x_new[0][0]))
new_point_y_lo = np.atleast_2d(MyModels[var]['lf']['surrogate'](x_new[0][0]))
# Note that, as usual, we make these into 2D arrays to match EMUKit's formatting
# now append the new point to our model's training data arrays
MultiFidelity[var]['x_train']=np.append(MultiFidelity[var]['x_train'],new_point_x_hi,axis=0)
MultiFidelity[var]['y_train']=np.append(MultiFidelity[var]['y_train'],new_point_y_hi,axis=0)
MultiFidelity[var]['x_train']=np.append(MultiFidelity[var]['x_train'],new_point_x_lo,axis=0)
MultiFidelity[var]['y_train']=np.append(MultiFidelity[var]['y_train'],new_point_y_lo,axis=0)
# now we use .set_data to update the model based on the extended training data
# MultiFidelity[var]['model']= make_mf(MultiFidelity[var]['x_train'], MultiFidelity[var]['y_train'], lin_mf_kernel,len(levels))
MultiFidelity[var]['model'].set_data(MultiFidelity[var]['x_train'],MultiFidelity[var]['y_train'])
# and finally, re-calculate the values and variances at our plotting points to create an updated plot
# MultiFidelity[var]['lf']['y_plot'],MultiFidelity[var]['lf']['variance']=MultiFidelity[var]['model'].predict(Y_plot['lf'])
# MultiFidelity[var]['hf']['y_plot'],MultiFidelity[var]['hf']['variance']=MultiFidelity[var]['model'].predict(Y_plot['hf'])
# MultiFidelity[var]['hf']['pl_train'], _ = MultiFidelity[var]['model'].predict(T_plot['hf'])
# not forgetting to update the maximum and average variances
for level in levels:
# get new plotting points
MultiFidelity[var][level]['y_plot'],MultiFidelity[var][level]['variance']=MultiFidelity[var]['model'].predict(Y_plot[level])
MultiFidelity[var][level]['pl_train'], _ = MultiFidelity[var]['model'].predict(T_plot[level])
# find maximum and average variance to measure the accuracy of the MF model
MultiFidelity[var][level]['varmax']=np.amax(MultiFidelity[var][level]['variance'])
MultiFidelity[var][level]['varmean']=np.mean(MultiFidelity[var][level]['variance'])
# report maximum and avarage variance
print(var,level,'max = ',MultiFidelity[var][level]['varmax'],'mean = ', MultiFidelity[var][level]['varmean'])
# Plot each variable vs Coll for rcas, helios and the low and high-fidelity models for aftr HF point added
plot_vars(var,1)
# NOW DID THE SAME THING FOR A SEQUENCE OF 4 LF POINTS
I have tried using different acquisition functions and got the same behavior. I have also tried rebilding the model from scratch using model.optimize() and only got stranger behavior.

How to take confidence interval of statsmodels.tsa.holtwinters-ExponentialSmoothing Models in python?

I did time series forecasting analysis with ExponentialSmoothing in python. I used statsmodels.tsa.holtwinters.
model = ExponentialSmoothing(df, seasonal='mul', seasonal_periods=12).fit()
pred = model.predict(start=df.index[0], end=122)
plt.plot(df_fc.index, df_fc, label='Train')
plt.plot(pred.index, pred, label='Holt-Winters')
plt.legend(loc='best')
I want to take confidence interval of the model result. But I couldn't find any function about this in "statsmodels.tsa.holtwinters - ExponentialSmoothing". How to I do that?
From this answer from a GitHub issue, it is clear that you should be using the new ETSModel class, and not the old (but still present for compatibility) ExponentialSmoothing.
ETSModel includes more parameters and more functionality than ExponentialSmoothing.
To calculate confidence intervals, I suggest you to use the simulate method of ETSResults:
from statsmodels.tsa.exponential_smoothing.ets import ETSModel
import pandas as pd
# Build model.
ets_model = ETSModel(
endog=y, # y should be a pd.Series
seasonal='mul',
seasonal_periods=12,
)
ets_result = ets_model.fit()
# Simulate predictions.
n_steps_prediction = y.shape[0]
n_repetitions = 500
df_simul = ets_result.simulate(
nsimulations=n_steps_prediction,
repetitions=n_repetitions,
anchor='start',
)
# Calculate confidence intervals.
upper_ci = df_simul.quantile(q=0.9, axis='columns')
lower_ci = df_simul.quantile(q=0.1, axis='columns')
Basically, calling the simulate method you get a DataFrame with n_repetitions columns, and with n_steps_prediction steps (in this case, the same number of items in your training data-set y).
Then, you calculate the confidence intervals with DataFrame quantile method (remember the axis='columns' option).
You could also calculate other statistics from the df_simul.
I also checked the source code: simulate is internally called by the forecast method to predict steps in the future. So, you could also predict steps in the future and their confidence intervals with the same approach: just use anchor='end', so that the simulations will start from the last step in y.
To be fair, there is also a more direct approach to calculate the confidence intervals: the get_prediction method (which uses simulate internally). But I do not really like its interface, it is not flexible enough for me, I did not find a way to specify the desired confidence intervals. The approach with the simulate method is pretty easy to understand, and very flexible, in my opinion.
If you want further details on how this kind of simulations are performed, read this chapter from the excellent Forecasting: Principles and Practice online book.
Complementing the answer from #Enrico, we can use the get_prediction in the following way:
ci = model.get_prediction(start = forecast_data.index[0], end = forecast_data.index[-1])
preds = ci.pred_int(alpha = .05) #confidence interval
limits = ci.predicted_mean
preds = pd.concat([limits, preds], axis = 1)
preds.columns = ['yhat', 'yhat_lower', 'yhat_upper']
preds
Implemented answer (by myself).... #Enrico, we can use the get_prediction in the following way:
from statsmodels.tsa.exponential_smoothing.ets import ETSModel
#---sales:pd.series, time series data(index should be timedate format)
#---new advanced holt's winter ts model implementation
HWTES_Model = ETSModel(endog=sales, trend= 'mul', seasonal='mul', seasonal_periods=4).fit()
point_forecast = HWTES_Model.forecast(16)
#-------Confidence Interval forecast calculation start------------------
ci = HWTES_Model.get_prediction(start = point_forecast.index[0],
end = point_forecast.index[-1])
lower_conf_forecast = ci.pred_int(alpha=alpha_1).iloc[:,0]
upper_conf_forecast = ci.pred_int(alpha=alpha_1).iloc[:,1]
#-------Confidence Interval forecast calculation end-----------------
To complement the previous answers, I provide the function to plot the CI on top of the forecast.
def ets_forecast(model, h=8):
# Simulate predictions.
n_steps_prediction =h
n_repetitions = 1000
yhat = model.forecast(h)
df_simul = model.simulate(
nsimulations=n_steps_prediction,
repetitions=n_repetitions,
anchor='end',
)
# Calculate confidence intervals.
upper_ci = df_simul.quantile(q=0.975, axis='columns')
lower_ci = df_simul.quantile(q=0.025, axis='columns')
plt.plot(yhat.index, yhat.values)
plt.fill_between(yhat.index, (lower_ci), (upper_ci), color='blue', alpha=0.1)
return yhat
plt.plot(y)
ets_forecast(model2, h=8)
plt.show()
enter image description here

Investigating the features' importance and weights evolution in given DL model

I apologize for a longer than usual intro, but it is important for the question:
I've recently been assigned to work on an existing project, which uses Keras+Tensorflow to create a Fully Connected Net.
Overall the model has 3 fully connected layers with 500 neurons and has 2 output classes. The first layer has 500 neurons which are connected to 82 input features. The model is used in the production and is retrained weekly, using this week information generated by an outer source.
The engineer which designed the model is no longer working here and I'm trying to reverse engineer and understand the behavior of the model.
Couple of objectives I have defined for myself are:
Understand the feature selection process and feature importance.
Understand and control the weekly re-training process.
In order to try and answer both of them, I've implemented an experiment where I feed my code with two models: one from the previous week and the other from the current week:
import pickle
import numpy as np
import matplotlib.pyplot as plt
from keras.models import model_from_json
path1 = 'C:/Model/20190114/'
path2 = 'C:/Model/20190107/'
model_name1 = '0_10.1'
model_name2 = '0_10.2'
models = [path1 + model_name1, path2 + model_name2]
features_cum_weight = {}
I then take each feature and try to sum all the weights (their absolute value) which connect it to the first hidden layer.
This way I create two vectors of 82 values:
for model_name in models:
structure_filename = model_name + "_structure.json"
weights_filename = model_name + "_weights.h5"
with open(structure_filename, 'r') as model_json:
model = model_from_json(model_json.read())
model.load_weights(weights_filename)
in_layer_weights = model.layers[0].get_weights()[0]
in_layer_weights = abs(in_layer_weights)
features_cum_weight[model_name] = in_layer_weights.sum(axis=1)
I then plot them, using MatplotLib:
# Plot the Evolvement of Input Neuron Weights:
keys = list(features_cum_weight.keys())
weights_1 = features_cum_weight[keys[0]]
weights_2 = features_cum_weight[keys[1]]
fig, ax = plt.subplots(nrows=2, ncols=2)
width = 0.35 # the width of the bars
n_plots = 4
batch = int(np.ceil(len(weights_1)/n_plots))
for i in range(n_plots):
start = i*(batch+1)
stop = min(len(weights_1), start + batch + 1)
cur_w1 = weights_1[start:stop]
cur_w2 = weights_2[start:stop]
ind = np.arange(len(cur_w1))
cur_ax = ax[i//2][i%2]
cur_ax.bar(ind - width/2, cur_w1, width, color='SkyBlue', label='Current Model')
cur_ax.bar(ind + width/2, cur_w2, width, color='IndianRed', label='Previous Model')
cur_ax.set_ylabel('Sum of Weights')
cur_ax.set_title('Sum of all weights connected by feature')
cur_ax.set_xticks(ind)
cur_ax.legend()
cur_ax.set_ylim(0, 30)
plt.show()
Resulting in the following plot:
MatPlotLib plot
I then try to compare the vectors to deduce:
If the vectors have been changed drastically - there might be some major change in the training data or some problem while retraining the model.
If some value is close to zero the model might have recognized this feature as not important.
I want your opinion and insights on the following:
The overall approach to this experiment.
Advice on other ideas on reverse engineering on a given model.
Insights on the output I provide here.
Thank you all, I am open to any suggestions and critic!
This type of deduction is not entirely true. The combination between the features is not linear. It is true that if is strictly 0 does not matter, but it may be that it is then recombined in another way and in another deep layer.
It would be true if your model is linear. In fact, this is how the PCA analysis works, where it searches for linear relationships through the covariance matrix. The eigenvalue would indicate the importance of each feature.
I think that there are several ways to confirm your suspicions:
Eliminate features that you think are not important to train again and see the result. If it is similar, your suspicions are correct.
Apply the current model, take an example (we will call it as pivot) to evaluate and significantly change the features that you consider irrelevant and create many examples. This applies for several pivots. If the result is similar, that field should not matter. Example (I consider the first feature to be irrelevant):
data = np.array([[0.5, 1, 0.5], [1, 2, 5]])
range_values = 50
new_data = []
for i in range(data.shape[0]):
sample = data[i]
# We create new samples
for i in range (1000):
    noise = np.random.rand () * range_values
    new_sample = sample.copy()
new_sample[0] += noise
new_data.append(new_sample)

Spark: how to get cluster's points (KMeans)

I'm trying to retrieve the data points belonging to a specific cluster in Spark. In the following piece of code, the data is made up but I actually obtain the predicted clustered.
Here is the code I have so far:
import numpy as np
# Example data
flight_routes = np.array([[1,3,2,0],
[4,2,1,4],
[3,6,2,2],
[0,5,2,1]])
flight_routes = sc.parallelize(flight_routes)
model = KMeans.train(rdd=flight_routes, k=500, maxIterations=10)
route_test = np.array([[0,2,3,4]])
test = sc.parallelize(route_test)
prediction = model.predict(test)
cluster_number_predicted = prediction.collect()
print cluster_number_predicted # it returns [100] <-- COOL!!
Now, I'd like to have all the data points belonging to the cluster number 100. How do I get those ?
What I want achieve is something like the answer given to this SO question: Cluster points after Means (Sklearn)
Thank you in advance.
If you both record and prediction (and not willing to switch to Spark ML) you can zip RDDs:
predictions_and_values = model.predict(test).zip(test)
and filter afterwards:
predictions_and_values.filter(lambda x: x[1] == 100)

Relating column names to model parameters in pySpark ML

I'm running a model using GLM (using ML in Spark 2.0) on data that has one categorical independent variable. I'm converting that column into dummy variables using StringIndexer and OneHotEncoder, then using VectorAssembler to combine it with a continuous independent variable into a column of sparse vectors.
If my column names are continuous and categorical where the first is a column of floats and the second is a column of strings denoting (in this case, 8) different categories:
string_indexer = StringIndexer(inputCol='categorical',
outputCol='categorical_index')
encoder = OneHotEncoder(inputCol ='categorical_index',
outputCol='categorical_vector')
assembler = VectorAssembler(inputCols=['continuous', 'categorical_vector'],
outputCol='indep_vars')
pipeline = Pipeline(stages=string_indexer+encoder+assembler)
model = pipeline.fit(df)
df = model.transform(df)
Everything works fine to this point, and I run the model:
glm = GeneralizedLinearRegression(family='gaussian',
link='identity',
labelCol='dep_var',
featuresCol='indep_vars')
model = glm.fit(df)
model.params
Which outputs:
DenseVector([8440.0573, 3729.449, 4388.9042, 2879.1802, 4613.7646, 5163.3233, 5186.6189, 5513.1392])
Which is great, because I can verify that these coefficients are essentially correct (via other sources). However, I haven't found a good way to link these coefficients to the original column names, which I need to do (I've simplified this model for SO; there's more involved.)
The relationship between column names and coefficients is broken by StringIndexer and OneHotEncoder. I've found one fairly slow way:
df[['categorical', 'categorical_index']].distinct()
Which gives me a small dataframe relating the the string names to the numerical names, which I think I could then relate back to the keys in the sparse vector? This is very clunky and slow though, when you consider the scale of the data.
Is there a better way to do this?
For PySpark, here is the solution to map feature index to feature name:
First, train your model:
pipeline = Pipeline().setStages([label_stringIdx,assembler,classifier])
model = pipeline.fit(x)
Transform your data:
df_output = model.transform(x)
Extract the mapping between feature index and feature name. Merge numeric attributes and binary attributes into a single list.
numeric_metadata = df_output.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('numeric')
binary_metadata = df_output.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('binary')
merge_list = numeric_metadata + binary_metadata
OUTPUT:
[{'name': 'variable_abc', 'idx': 0},
{'name': 'variable_azz', 'idx': 1},
{'name': 'variable_azze', 'idx': 2},
{'name': 'variable_azqs', 'idx': 3},
....
I also came across the exact problem and I've got your solution :)
This is based on the Scala version here:
How to map variable names to features after pipeline
# transform data
best_model = pipeline.fit(df)
best_pred = best_model.transform(df)
# extract features metadata
meta = [f.metadata
for f in best_pred.schema.fields
if f.name == 'features'][0]
# access feature name and index
features_name_ind = meta['ml_attr']['attrs']['numeric'] + \
meta['ml_attr']['attrs']['binary']
print features_name_ind[:2]
# [{'name': 'feature_name_1', 'idx': 0}, {'name': 'feature_name_2', 'idx': 1}]
I didn't investigate the previous versions, but in Spark 2.4.3 it is possible to retrieve a lot of information about the features just by using the summary attribute of a GeneralizedLinearRegressionModel.
Printing summary results in something like this:
Coefficients:
Feature Estimate Std Error T Value P Value
(Intercept) -0.1742 0.4298 -0.4053 0.6853
x1_enc_(-inf,5.5] -0.7781 0.3661 -2.1256 0.0335
x1_enc_(5.5,8.5] 0.1850 0.3736 0.4953 0.6204
x1_enc_(8.5,9.5] -0.3937 0.4324 -0.9106 0.3625
x45_enc_1-10-7-8-9 -0.5382 0.2718 -1.9801 0.0477
x45_enc_2-3-4-ND 0.5187 0.2811 1.8454 0.0650
x45_enc_5 -0.0456 0.3353 -0.1361 0.8917
x33_enc_1 0.6361 0.4043 1.5731 0.1157
x33_enc_10 0.0059 0.4083 0.0145 0.9884
x33_enc_2-3-4-8-ND 0.6121 0.1741 3.5152 0.0004
x102_enc_(-inf,4.5] 0.5315 0.1695 3.1354 0.0017
(Dispersion parameter for binomial family taken to be 1.0000)
Null deviance: 937.7397 on 666 degrees of freedom
Residual deviance: 858.8846 on 666 degrees of freedom
AIC: 880.8846
The Feature column can be constructed by accessing an internal Java object:
In [131]: glm.summary._call_java('featureNames')
Out[131]:
['x1_enc_(-inf,5.5]',
'x1_enc_(5.5,8.5]',
'x1_enc_(8.5,9.5]',
'x45_enc_1-10-7-8-9',
'x45_enc_2-3-4-ND',
'x45_enc_5',
'x33_enc_1',
'x33_enc_10',
'x33_enc_2-3-4-8-ND',
'x102_enc_(-inf,4.5]']
The Estimate column can be constructed by the following concatenation:
In [134]: [glm.intercept] + list(glm.coefficients)
Out[134]:
[-0.17419580191414719,
-0.7781490190325139,
0.1850214800764976,
-0.3936963366945294,
-0.5382255101657534,
0.5187453074755956,
-0.045649677050663987,
0.6360647167539958,
0.00593020879299306,
0.6121475986933201,
0.531510974697773]
PS.: This line shows why the column Features can be retrieved by using an internal Java object.
Sorry, this seems to be a very late answer and maybe you might have already figured it out but wth, anyways. I recently did the same implementation of String Indexer, OneHotEncoder and VectorAssembler and as far as I have understood, the following code will present what you are looking for.
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
categoricalColumns = ["one_categorical_variable"]
stages = [] # stages in the pipeline
for categoricalCol in categoricalColumns:
# Category Indexing with StringIndexer
stringIndexer = StringIndexer(inputCol=categoricalCol,
outputCol=categoricalCol+"Index")
# Using OneHotEncoder to convert categorical variables into binary
SparseVectors
encoder = OneHotEncoder(inputCol=stringIndexer.getOutputCol(),
outputCol=categoricalCol+"classVec")
# Adding the stages so that they will be run all at once later
stages += [stringIndexer, encoder]
# convert label into label indices using the StringIndexer
label_stringIdx = StringIndexer(inputCol = "Service_Level", outputCol =
"label")
stages += [label_stringIdx]
# Transform all features into a vector using VectorAssembler
numericCols = ["continuous_variable"]
assemblerInputs = map(lambda c: c + "classVec", categoricalColumns) +
numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
# Creating a Pipeline for Training
pipeline = Pipeline(stages=stages)
# Running the feature transformations.
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)

Categories