Relating column names to model parameters in pySpark ML - python

I'm running a model using GLM (using ML in Spark 2.0) on data that has one categorical independent variable. I'm converting that column into dummy variables using StringIndexer and OneHotEncoder, then using VectorAssembler to combine it with a continuous independent variable into a column of sparse vectors.
If my column names are continuous and categorical where the first is a column of floats and the second is a column of strings denoting (in this case, 8) different categories:
string_indexer = StringIndexer(inputCol='categorical',
outputCol='categorical_index')
encoder = OneHotEncoder(inputCol ='categorical_index',
outputCol='categorical_vector')
assembler = VectorAssembler(inputCols=['continuous', 'categorical_vector'],
outputCol='indep_vars')
pipeline = Pipeline(stages=string_indexer+encoder+assembler)
model = pipeline.fit(df)
df = model.transform(df)
Everything works fine to this point, and I run the model:
glm = GeneralizedLinearRegression(family='gaussian',
link='identity',
labelCol='dep_var',
featuresCol='indep_vars')
model = glm.fit(df)
model.params
Which outputs:
DenseVector([8440.0573, 3729.449, 4388.9042, 2879.1802, 4613.7646, 5163.3233, 5186.6189, 5513.1392])
Which is great, because I can verify that these coefficients are essentially correct (via other sources). However, I haven't found a good way to link these coefficients to the original column names, which I need to do (I've simplified this model for SO; there's more involved.)
The relationship between column names and coefficients is broken by StringIndexer and OneHotEncoder. I've found one fairly slow way:
df[['categorical', 'categorical_index']].distinct()
Which gives me a small dataframe relating the the string names to the numerical names, which I think I could then relate back to the keys in the sparse vector? This is very clunky and slow though, when you consider the scale of the data.
Is there a better way to do this?

For PySpark, here is the solution to map feature index to feature name:
First, train your model:
pipeline = Pipeline().setStages([label_stringIdx,assembler,classifier])
model = pipeline.fit(x)
Transform your data:
df_output = model.transform(x)
Extract the mapping between feature index and feature name. Merge numeric attributes and binary attributes into a single list.
numeric_metadata = df_output.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('numeric')
binary_metadata = df_output.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('binary')
merge_list = numeric_metadata + binary_metadata
OUTPUT:
[{'name': 'variable_abc', 'idx': 0},
{'name': 'variable_azz', 'idx': 1},
{'name': 'variable_azze', 'idx': 2},
{'name': 'variable_azqs', 'idx': 3},
....

I also came across the exact problem and I've got your solution :)
This is based on the Scala version here:
How to map variable names to features after pipeline
# transform data
best_model = pipeline.fit(df)
best_pred = best_model.transform(df)
# extract features metadata
meta = [f.metadata
for f in best_pred.schema.fields
if f.name == 'features'][0]
# access feature name and index
features_name_ind = meta['ml_attr']['attrs']['numeric'] + \
meta['ml_attr']['attrs']['binary']
print features_name_ind[:2]
# [{'name': 'feature_name_1', 'idx': 0}, {'name': 'feature_name_2', 'idx': 1}]

I didn't investigate the previous versions, but in Spark 2.4.3 it is possible to retrieve a lot of information about the features just by using the summary attribute of a GeneralizedLinearRegressionModel.
Printing summary results in something like this:
Coefficients:
Feature Estimate Std Error T Value P Value
(Intercept) -0.1742 0.4298 -0.4053 0.6853
x1_enc_(-inf,5.5] -0.7781 0.3661 -2.1256 0.0335
x1_enc_(5.5,8.5] 0.1850 0.3736 0.4953 0.6204
x1_enc_(8.5,9.5] -0.3937 0.4324 -0.9106 0.3625
x45_enc_1-10-7-8-9 -0.5382 0.2718 -1.9801 0.0477
x45_enc_2-3-4-ND 0.5187 0.2811 1.8454 0.0650
x45_enc_5 -0.0456 0.3353 -0.1361 0.8917
x33_enc_1 0.6361 0.4043 1.5731 0.1157
x33_enc_10 0.0059 0.4083 0.0145 0.9884
x33_enc_2-3-4-8-ND 0.6121 0.1741 3.5152 0.0004
x102_enc_(-inf,4.5] 0.5315 0.1695 3.1354 0.0017
(Dispersion parameter for binomial family taken to be 1.0000)
Null deviance: 937.7397 on 666 degrees of freedom
Residual deviance: 858.8846 on 666 degrees of freedom
AIC: 880.8846
The Feature column can be constructed by accessing an internal Java object:
In [131]: glm.summary._call_java('featureNames')
Out[131]:
['x1_enc_(-inf,5.5]',
'x1_enc_(5.5,8.5]',
'x1_enc_(8.5,9.5]',
'x45_enc_1-10-7-8-9',
'x45_enc_2-3-4-ND',
'x45_enc_5',
'x33_enc_1',
'x33_enc_10',
'x33_enc_2-3-4-8-ND',
'x102_enc_(-inf,4.5]']
The Estimate column can be constructed by the following concatenation:
In [134]: [glm.intercept] + list(glm.coefficients)
Out[134]:
[-0.17419580191414719,
-0.7781490190325139,
0.1850214800764976,
-0.3936963366945294,
-0.5382255101657534,
0.5187453074755956,
-0.045649677050663987,
0.6360647167539958,
0.00593020879299306,
0.6121475986933201,
0.531510974697773]
PS.: This line shows why the column Features can be retrieved by using an internal Java object.

Sorry, this seems to be a very late answer and maybe you might have already figured it out but wth, anyways. I recently did the same implementation of String Indexer, OneHotEncoder and VectorAssembler and as far as I have understood, the following code will present what you are looking for.
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
categoricalColumns = ["one_categorical_variable"]
stages = [] # stages in the pipeline
for categoricalCol in categoricalColumns:
# Category Indexing with StringIndexer
stringIndexer = StringIndexer(inputCol=categoricalCol,
outputCol=categoricalCol+"Index")
# Using OneHotEncoder to convert categorical variables into binary
SparseVectors
encoder = OneHotEncoder(inputCol=stringIndexer.getOutputCol(),
outputCol=categoricalCol+"classVec")
# Adding the stages so that they will be run all at once later
stages += [stringIndexer, encoder]
# convert label into label indices using the StringIndexer
label_stringIdx = StringIndexer(inputCol = "Service_Level", outputCol =
"label")
stages += [label_stringIdx]
# Transform all features into a vector using VectorAssembler
numericCols = ["continuous_variable"]
assemblerInputs = map(lambda c: c + "classVec", categoricalColumns) +
numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
# Creating a Pipeline for Training
pipeline = Pipeline(stages=stages)
# Running the feature transformations.
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)

Related

Is there a way to print a list of the most important features of an Light GBM Classifier model?

Running an LGBM Classifier model and I'm able to use lgbm.plot_importance to plot the most important features but I would prefer having a list of these features instead, does anybody know how to go about doing this?
The lightgbm.Booster object has a method .feature_importance() which can be used to access feature importances.
That method returns an array with one importance value per feature, and supports two types of importance, based on the value of importance_type:
"gain" = "cumulative gain of all splits using this feature"
"split" = "number of splits this feature was used in"
You can explore this using the following code. I ran this with lightgbm==3.3.0, numpy==1.21.0, pandas==1.2.3, and scikit-learn==0.24.1, using Python 3.8.
import lightgbm as lgb
import pandas as pd
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
data = lgb.Dataset(X, label=y)
# train model
bst = lgb.train(
params={"objective": "binary"},
train_set=data,
num_boost_round=10
)
# compute importances
importance_df = (
pd.DataFrame({
'feature_name': bst.feature_name(),
'importance_gain': bst.feature_importance(importance_type='gain'),
'importance_split': bst.feature_importance(importance_type='split'),
})
.sort_values('importance_gain', ascending=False)
.reset_index(drop=True)
)
print(importance_df)
Here's an example of the output.
feature_name importance_gain importance_split
0 Column_22 1051.204456 8
1 Column_23 862.363854 10
2 Column_27 262.272097 19
3 Column_7 161.842017 13
4 Column_21 66.431762 24
This is saying that, for example, feature Column_21 was used in more splits than other top features, but the improvement those splits provided were much less impactful than the 8 splits using Column_22.
Seems like you are using Sklearn API for Lightgbm. This should help.
General idea:
LGBMClassifier.feature_importances_
Particular case:
model_name.feature_importances_
Full code snippet (assuming pandas dataframe was used for training):
features = train_x.columns
importances = model.feature_importances_
feature_importance = pd.DataFrame({'importance':importances,'features':features}).sort_values('importance', ascending=False).reset_index(drop=True)
feature_importance
Also you can plot importances:
lgb.plot_importance(model_name)

How to apply Target Encoding in test dataset?

I am working on a project, where I had to apply target encoding for 3 categorical variables:
merged_data['SpeciesEncoded'] = merged_data.groupby('Species')['WnvPresent'].transform(np.mean)
merged_data['BlockEncoded'] = merged_data.groupby('Block')['WnvPresent'].transform(np.mean)
merged_data['TrapEncoded'] = merged_data.groupby('Trap')['WnvPresent'].transform(np.mean)
I received the results and ran the model. Now the problem is that I have to apply the same model to test data that has columns Block, Trap, and Species, but doesn't have the values of the target variable WnvPresent (which has to be predicted).
How can I transfer my encoding from training sample to the test? I would greatly appreciate any help.
P.S. I hope it makes sense.
You need to same the mapping between the feature and the mean value, if you want to apply it to the test dataset.
Here is a possible solution:
species_encoding = df.groupby(['Species'])['WnvPresent'].mean().to_dict()
block_encoding = df.groupby(['Block'])['WnvPresent'].mean().to_dict()
trap_encoding = df.groupby(['Trap'])['WnvPresent'].mean().to_dict()
merged_data['SpeciesEncoded'] = df['Species'].map(species_encoding)
merged_data['BlockEncoded'] = df['Block'].map(species_encoding)
merged_data['TrapEncoded'] = df['Trap'].map(species_encoding)
test_data['SpeciesEncoded'] = df['Species'].map(species_encoding)
test_data['BlockEncoded'] = df['Block'].map(species_encoding)
test_data['TrapEncoded'] = df['Trap'].map(species_encoding)
This would answer your question, but I want to add, that this approach can be improved. Directly using mean values of targets could make the models overfit on the data.
There are many approaches to improve target encoding, one of them is smoothing, here is a link to an example: https://maxhalford.github.io/blog/target-encoding/
Here is an example:
m = 10
mean = df['WnvPresent'].mean()
# Compute the number of values and the mean of each group
agg = df.groupby('Species')['WnvPresent'].agg(['count', 'mean'])
counts = agg['count']
means = agg['mean']
# Compute the "smoothed" means
species_encoding = ((counts * means + m * mean) / (counts + m)).to_dict()
There are 2 open source Python libraries that offer this functionality off-the-shelf: Feature-engine and Category encoders.
Assuming that we have a train and a testing set...
With Feature engine it would work as follows:
from feature_engine.encoding import MeanEncoder
# set up the encoder
encoder = MeanEncoder(variables=['Species', 'Block', 'Trap'])
# fit the encoder - finds the mean target value per category
encoder.fit(X_train, X_train['WnvPresent'])
# transform data
X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)
We find the replacement values in the encoding_dict_ attribute as follows:
encoder.encoding_dict_
With category encoders it works as follows:
from category_encoders.target_encoder import TargetEncoder
# set up the encoder
encoder = TargetEncoder(cols=['Species', 'Block', 'Trap'])
# fit the encoder - finds the mean target value per category
encoder.fit(X_train, X_train['WnvPresent'])
# transform data
X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)
The replacement values can be found in the attribute mapping:
encoder.mapping
More details in the respective documentation:
MeanEncoder
TargetEncoder
Category encoders' TargetEncoder also offers smoothing as suggested by #andrey-lukyanenko out-of-the-box.

How to do feature selection/feature importance using PySpark?

I am trying to get feature selection/feature importances from my dataset using PySpark but I am having trouble doing it with PySpark.
This is what I have done using Python Pandas to do it but I would like to accomplish it using PySpark:
cols = [col for col in new_result.columns if col not in ['treatment']]
data = new_result[cols]
target = new_result['treatment']
model = ExtraTreesClassifier()
model.fit(data,target)
print(model.feature_importances_)
feat_importances = pd.Series(model.feature_importances_, index=data.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()
This is what I have tried but I don't feel the code for PySpark have achieved what I wanted. I know the model is different but I would like to get the same result as what I did for Pandas please:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
assembler = VectorAssembler(
inputCols=['Primary_ID',
'Age',
'Gender',
'Country',
'self_employed',
'family_history',
'work_interfere',
'no_employees',
'remote_work',
'tech_company',
'benefits',
'care_options',
'wellness_program',
'seek_help',
'anonymity',
'leave',
'mental_health_consequence',
'phys_health_consequence',
'coworkers',
'supervisor',
'mental_vs_physical',
'obs_consequence',
'mental_issue_in_tech'],
outputCol="features")
output = assembler.transform(new_result)
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="treatment", outputCol="treatment_index")
output_fixed = indexer.fit(output).transform(output)
final_data = output_fixed.select("features",'treatment_index')
train_data,test_data = final_data.randomSplit([0.7,0.3])
rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="treatment", seed=42)
model = rf.fit(output)
model.featureImportances
Return result of SparseVector(23, {2: 0.0961, 5: 0.1798, 6: 0.3232, 11: 0.0006, 14: 0.1307, 22: 0.2696}) What does this mean? Please advise and thank you in advance for all the help!
Vectors are represented in 2 flavours internally in the spark.
DenseVector
This takes more memory as all the elements are stored as Array[Double]
SparseVector
This is memory efficient way of storing the vector. representation having 3 parts-
size of vector
array of indices - It contains only those indices which has value other than 0.
array of values - it contains actual values associated with the indices.
Example -
val sparseVector = SparseVector(4, [1, 3], [3.0, 4.0])
println(sparseVector.toArray.mkString(", "))
// 0.0, 3.0, 0.0, 4.0
all the missing values are considered as 0
Regarding your problem-
you can map your sparse vector having feature importance with vector assembler input columns. Please note that size of feature vector and the feature importance are same.
val vectorToIndex = vectorAssembler.getInputCols.zipWithIndex.map(_.swap).toMap
val featureToWeight = rf.fit(trainingData).featureImportances.toArray.zipWithIndex.toMap.map{
case(featureWeight, index) => vectorToIndex(index) -> featureWeight
}
println(featureToWeight)
The similar code should work in python too

Tensorflow: Inspecting Hash buckets for Categorical and Feature Columns

I am fitting Linear Classifier for pretty wide and sparse data using number of Categorical Columns with hash bucket and Crossed Feature Columns as Feature Columns.
Later I want to use the weights/coefficients of the model in a custom serving infrastructure. I know how to extract the weights from the model, but obviously, for aforementioned columns, they come for an already hashed feature values.
I can reconstruct a Hashtable (value -> hashed value) for a simple categorical columns using tf.string_to_hash_bucket_fast, but I am getting trouble doing that for Crossed Feature Columns.
For a pair of values of two categorical columns building up a Crossed Column - how can I understand which bucket they will get into?
After inspecting the source code I found out that the simplest way would be to construct an Input Layer for input data consisting of the all the distinct values (or their combinations) in the column.
As a result you get a DenseTensor consisting of 0 and 1, each row corresponds to a distinct value and where 1s are sitting in the columns corresponding to the actual hash bucket number (I've verified that for Categorical Columns, should be the same for CrossedColumns).
Here is the example code (for both Categorical Column and Crossed Column):
import tensorflow as tf
from tensorflow.python.feature_column import feature_column as fc
actual_sex = {'sex': tf.Variable(['male', 'female', 'female', 'male'], tf.string)}
actual_nationality = {'nationality': tf.Variable(['belgian', 'french', 'belgian', 'belgian'], tf.string)}
actual_sex_nationality = dict(actual_sex, **actual_nationality)
# hashed_column
sex_hashed_raw = fc.categorical_column_with_hash_bucket("sex", 10)
sex_hashed = fc.indicator_column(sex_hashed_raw)
# crossed column
crossed_sn_raw = fc.crossed_column(['sex', 'nationality'], hash_bucket_size = 20)
crossed_sn = fc.indicator_column(crossed_sn_raw)
layer_s = tf.feature_column.input_layer(actual_sex_nationality, sex_hashed)
layer_sn = tf.feature_column.input_layer(actual_sex_nationality, crossed_sn)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
print(sess.run(layer_s))
print(sess.run(layer_sn))

How to find the features names of the coefficients using scikit linear regression?

I use scikit linear regression and if I change the order of the features, the coef are still printed in the same order, hence I would like to know the mapping of the feature with the coeff.
#training the model
model_1_features = ['sqft_living', 'bathrooms', 'bedrooms', 'lat', 'long']
model_2_features = model_1_features + ['bed_bath_rooms']
model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
model_2 = linear_model.LinearRegression()
model_2.fit(train_data[model_2_features], train_data['price'])
model_3 = linear_model.LinearRegression()
model_3.fit(train_data[model_3_features], train_data['price'])
# extracting the coef
print model_1.coef_
print model_2.coef_
print model_3.coef_
The trick is that right after you have trained your model, you know the order of the coefficients:
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
print(list(zip(model_1.coef_, model_1_features)))
This will print the coefficients and the correct feature. (Tested with pandas DataFrame)
If you want to reuse the coefficients later you can also put them in a dictionary:
coef_dict = {}
for coef, feat in zip(model_1.coef_,model_1_features):
coef_dict[feat] = coef
(You can test it for yourself by training two models with the same features but, as you said, shuffled order of features.)
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
coef_table = pd.DataFrame(list(X_train.columns)).copy()
coef_table.insert(len(coef_table.columns),"Coefs",regressor.coef_.transpose())
#Robin posted a great answer, but for me I had to make one tweak on it to work the way I wanted, and it was to refer to the dimension of the 'coef_' np.array that I wanted, namely modifying to this: model_1.coef_[0,:], as below:
coef_dict = {}
for coef, feat in zip(model_1.coef_[0,:],model_1_features):
coef_dict[feat] = coef
Then the dict was created as I pictured it, with {'feature_name' : coefficient_value} pairs.
Here is what I use for pretty printing of coefficients in Jupyter. I'm not sure I follow why order is an issue - as far as I know the order of the coefficients should match the order of the input data that you gave it.
Note that the first line assumes you have a Pandas data frame called df in which you originally stored the data prior to turning it into a numpy array for regression:
fieldList = np.array(list(df)).reshape(-1,1)
coeffs = np.reshape(np.round(clf.coef_,5),(-1,1))
coeffs=np.concatenate((fieldList,coeffs),axis=1)
print(pd.DataFrame(coeffs,columns=['Field','Coeff']))
Borrowing from Robin, but simplifying the syntax:
coef_dict = dict(zip(model_1_features, model_1.coef_))
Important note about zip: zip assumes its inputs are of equal length, making it especially important to confirm that the lengths of the features and coefficients match (which in more complicated models might not be the case). If one input is longer than the other, the longer input will have the values in its extra index positions cut off. Notice the missing 7 in the following example:
In [1]: [i for i in zip([1, 2, 3], [4, 5, 6, 7])]
Out[1]: [(1, 4), (2, 5), (3, 6)]
pd.DataFrame(data=regression.coef_, index=X_train.columns)
All of these answers were great but what personally worked for me was this, as the feature names I needed were the columns of my train_date dataframe:
pd.DataFrame(data=model_1.coef_,columns=train_data.columns)
Right after training the model, the coefficient values are stored in the variable model.coef_[0]. We can iterate over the column names and store the column name and their coefficient value in a dictionary.
model.fit(X_train,y)
# assuming all the columns except last one is used in training
columns = data.iloc[:,-1].columns
coef_dict = {}
for i in range(0,len(columns)):
coef_dict[columns[i]] = model.coef_[0][i]
Hope this helps!
As of scikit-learn version 1.0, the LinearRegression estimator has a feature_names_in_ attribute. From the docs:
feature_names_in_ : ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
Assuming you're fitting on a pandas.DataFrame (train_data), your estimators (model_1, model_2, and model_3) will have the attribute. You can line up your coefficients using any of the methods listed in previous answers, but I'm in favor of this one:
coef_series = pd.Series(
data=model_1.coef_,
index=model_1.feature_names_in_
)
A minimally reproducible example
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# for repeatability
np.random.seed(0)
# random data
Xy = pd.DataFrame(
data=np.random.random((10, 3)),
columns=["x0", "x1", "y"]
)
# separate X and y
X = Xy.drop(columns="y")
y = Xy.y
# initialize estimator
lr = LinearRegression()
# fit to pandas.DataFrame
lr.fit(X, y)
# get coeficients and their respective feature names
coef_series = pd.Series(
data=lr.coef_,
index=lr.feature_names_in_
)
print(coef_series)
x0 0.230524
x1 -0.275611
dtype: float64

Categories