I am fitting Linear Classifier for pretty wide and sparse data using number of Categorical Columns with hash bucket and Crossed Feature Columns as Feature Columns.
Later I want to use the weights/coefficients of the model in a custom serving infrastructure. I know how to extract the weights from the model, but obviously, for aforementioned columns, they come for an already hashed feature values.
I can reconstruct a Hashtable (value -> hashed value) for a simple categorical columns using tf.string_to_hash_bucket_fast, but I am getting trouble doing that for Crossed Feature Columns.
For a pair of values of two categorical columns building up a Crossed Column - how can I understand which bucket they will get into?
After inspecting the source code I found out that the simplest way would be to construct an Input Layer for input data consisting of the all the distinct values (or their combinations) in the column.
As a result you get a DenseTensor consisting of 0 and 1, each row corresponds to a distinct value and where 1s are sitting in the columns corresponding to the actual hash bucket number (I've verified that for Categorical Columns, should be the same for CrossedColumns).
Here is the example code (for both Categorical Column and Crossed Column):
import tensorflow as tf
from tensorflow.python.feature_column import feature_column as fc
actual_sex = {'sex': tf.Variable(['male', 'female', 'female', 'male'], tf.string)}
actual_nationality = {'nationality': tf.Variable(['belgian', 'french', 'belgian', 'belgian'], tf.string)}
actual_sex_nationality = dict(actual_sex, **actual_nationality)
# hashed_column
sex_hashed_raw = fc.categorical_column_with_hash_bucket("sex", 10)
sex_hashed = fc.indicator_column(sex_hashed_raw)
# crossed column
crossed_sn_raw = fc.crossed_column(['sex', 'nationality'], hash_bucket_size = 20)
crossed_sn = fc.indicator_column(crossed_sn_raw)
layer_s = tf.feature_column.input_layer(actual_sex_nationality, sex_hashed)
layer_sn = tf.feature_column.input_layer(actual_sex_nationality, crossed_sn)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
print(sess.run(layer_s))
print(sess.run(layer_sn))
Related
I am working on a project, where I had to apply target encoding for 3 categorical variables:
merged_data['SpeciesEncoded'] = merged_data.groupby('Species')['WnvPresent'].transform(np.mean)
merged_data['BlockEncoded'] = merged_data.groupby('Block')['WnvPresent'].transform(np.mean)
merged_data['TrapEncoded'] = merged_data.groupby('Trap')['WnvPresent'].transform(np.mean)
I received the results and ran the model. Now the problem is that I have to apply the same model to test data that has columns Block, Trap, and Species, but doesn't have the values of the target variable WnvPresent (which has to be predicted).
How can I transfer my encoding from training sample to the test? I would greatly appreciate any help.
P.S. I hope it makes sense.
You need to same the mapping between the feature and the mean value, if you want to apply it to the test dataset.
Here is a possible solution:
species_encoding = df.groupby(['Species'])['WnvPresent'].mean().to_dict()
block_encoding = df.groupby(['Block'])['WnvPresent'].mean().to_dict()
trap_encoding = df.groupby(['Trap'])['WnvPresent'].mean().to_dict()
merged_data['SpeciesEncoded'] = df['Species'].map(species_encoding)
merged_data['BlockEncoded'] = df['Block'].map(species_encoding)
merged_data['TrapEncoded'] = df['Trap'].map(species_encoding)
test_data['SpeciesEncoded'] = df['Species'].map(species_encoding)
test_data['BlockEncoded'] = df['Block'].map(species_encoding)
test_data['TrapEncoded'] = df['Trap'].map(species_encoding)
This would answer your question, but I want to add, that this approach can be improved. Directly using mean values of targets could make the models overfit on the data.
There are many approaches to improve target encoding, one of them is smoothing, here is a link to an example: https://maxhalford.github.io/blog/target-encoding/
Here is an example:
m = 10
mean = df['WnvPresent'].mean()
# Compute the number of values and the mean of each group
agg = df.groupby('Species')['WnvPresent'].agg(['count', 'mean'])
counts = agg['count']
means = agg['mean']
# Compute the "smoothed" means
species_encoding = ((counts * means + m * mean) / (counts + m)).to_dict()
There are 2 open source Python libraries that offer this functionality off-the-shelf: Feature-engine and Category encoders.
Assuming that we have a train and a testing set...
With Feature engine it would work as follows:
from feature_engine.encoding import MeanEncoder
# set up the encoder
encoder = MeanEncoder(variables=['Species', 'Block', 'Trap'])
# fit the encoder - finds the mean target value per category
encoder.fit(X_train, X_train['WnvPresent'])
# transform data
X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)
We find the replacement values in the encoding_dict_ attribute as follows:
encoder.encoding_dict_
With category encoders it works as follows:
from category_encoders.target_encoder import TargetEncoder
# set up the encoder
encoder = TargetEncoder(cols=['Species', 'Block', 'Trap'])
# fit the encoder - finds the mean target value per category
encoder.fit(X_train, X_train['WnvPresent'])
# transform data
X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)
The replacement values can be found in the attribute mapping:
encoder.mapping
More details in the respective documentation:
MeanEncoder
TargetEncoder
Category encoders' TargetEncoder also offers smoothing as suggested by #andrey-lukyanenko out-of-the-box.
iw ould like to get a dataframe of important features. With the code below i have got the shap_values and i am not sure, what do the values mean. In my df are 142 features and 67 experiments, but got an array with ca. 2500 values.
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")
I have tried to store them in a df:
rf_resultX = pd.DataFrame(shap_values, columns = ['shap_values'])
but got: ValueError: Shape of passed values is (18, 142), indices imply (18, 1)
142 - the number of the features.
18 - i have no idea.
I believe it works as follows:
shap_values need to be averaged.
and paired with the feature names: pd.DataFrame(feature_names, columns = ['feature_names'])
Does anybody have an experience, how to interpret shap_values?
At first i thought, that the number of values are the number of features x number of rows.
Combining the other two answers like this worked for me.
feature_names = X_train.columns
rf_resultX = pd.DataFrame(shap_values, columns = feature_names)
vals = np.abs(rf_resultX.values).mean(0)
shap_importance = pd.DataFrame(list(zip(feature_names, vals)),
columns=['col_name','feature_importance_vals'])
shap_importance.sort_values(by=['feature_importance_vals'],
ascending=False, inplace=True)
shap_importance.head()
shap_values have (num_rows, num_features) shape; if you want to convert it to dataframe, you should pass the list of feature names to the columns parameter: rf_resultX = pd.DataFrame(shap_values, columns = feature_names).
Each sample has its own shap value for each feature; the shap value tells you how much that feature has contributed to the prediction for that particular sample; this is called a local explanation. You could average shap values for each feature to get a feeling of global feature importance, but I'd suggest you take a look at the documentation since the shap package itself provides much more powerful visualizations/interpretations.
From https://github.com/slundberg/shap/issues/632
vals = np.abs(shap_values.values).mean(0)
feature_names = train_x.columns()
feature_importance = pd.DataFrame(list(zip(feature_names, vals)),
columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],
ascending=False, inplace=True)
feature_importance.head()
I wrote a short function for this which also works for multi-class classifications. It expects the data as a pandas DataFrame, a list of shap value arrays with one array for each class, and optionally a list of columns for which you want the average shap values.
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
def shap_feature_ranking(data, shap_values, columns=[]):
if not columns: columns = data.columns.tolist() # If columns are not given, take all columns
c_idxs = []
for column in columns: c_idxs.append(data.columns.get_loc(column)) # Get column locations for desired columns in given dataframe
if isinstance(shap_values, list): # If shap values is a list of arrays (i.e., several classes)
means = [np.abs(shap_values[class_][:, c_idxs]).mean(axis=0) for class_ in range(len(shap_values))] # Compute mean shap values per class
shap_means = np.sum(np.column_stack(means), 1) # Sum of shap values over all classes
else: # Else there is only one 2D array of shap values
assert len(shap_values.shape) == 2, 'Expected two-dimensional shap values array.'
shap_means = np.abs(shap_values).mean(axis=0)
# Put into dataframe along with columns and sort by shap_means, reset index to get ranking
df_ranking = pd.DataFrame({'feature': columns, 'mean_shap_value': shap_means}).sort_values(by='mean_shap_value', ascending=False).reset_index(drop=True)
df_ranking.index += 1
return df_ranking
For the latest version 0.40.0:
feature_names = shap_values.feature_names
shap_df = pd.DataFrame(shap_values.values, columns=feature_names)
vals = np.abs(shap_df.values).mean(0)
shap_importance = pd.DataFrame(list(zip(feature_names, vals)), columns=['col_name', 'feature_importance_vals'])
shap_importance.sort_values(by=['feature_importance_vals'], ascending=False, inplace=True)
I need to use spark with python, and I need to perform binary classification. After some research(I'm new to spark) I find MultilayerPerceptronClassifier but I don't understand some things.
Witch kind of Type must be the labelCol and the featuresCol? can be a simple integer(0 or 1) and a vector(the output of a VectorAssembler)?
PySpark takes your feature columns and generates a single column containing a vector of features. And label column is the target variable which is binary in your case.
Here is an example of how the vector assembler works.
from pyspark.ml.feature import VectorAssembler
# Input data frame consists of four columns id, hour, mobile, clicked.
dataset = spark.createDataFrame(
[(0, 18, 1.0, , 1.0)],
["id", "hour", "mobile", "clicked"])
# We take the first two features hour and mobile and create a vector of features.
# This column is named "features".
assembler = VectorAssembler(
inputCols=["hour", "mobile"],
outputCol="features")
# We use the VectorAssembler assembler to transform our dataset into an
# "data" data frame which is the same as dataset but with an additional column
# "features.
data = assembler.transform(dataset)
After this step, you can apply your prediction models.
I'm trying to create implement k-means clustering in pyspark. I am using the mnist as my dataset, which has hundreds of columns with integer values.
After creating a data-frame, when I try to create a features column to be used in the clustering, I don't know what to give as a inputCols parameter for VectorAssembler. Below is my code
sc = SparkContext('local')
spark = SparkSession(sc)
df = spark.read.csv('mnist_train.csv')
df.show()
df_feat = df.select(*(df[c].cast("float").alias(c) for c in df.columns[0:]))
df_feat.show()
vecAssembler = VectorAssembler(inputCols = ???????, outputCol = "features")
What should I put as parameter for inputCols for this large integer valued data that I am using?
The Vectorassembler needs a list of column names to create a feature vector. So for the mnist dataset you could give him everything except of the labels. For example:
#I assume that df_feat.columns[0] is the column which contains the labels
cols = df_feat.columns[1:]
vecAssembler = VectorAssembler(inputCols = cols, outputCol = "features")
I'm running a model using GLM (using ML in Spark 2.0) on data that has one categorical independent variable. I'm converting that column into dummy variables using StringIndexer and OneHotEncoder, then using VectorAssembler to combine it with a continuous independent variable into a column of sparse vectors.
If my column names are continuous and categorical where the first is a column of floats and the second is a column of strings denoting (in this case, 8) different categories:
string_indexer = StringIndexer(inputCol='categorical',
outputCol='categorical_index')
encoder = OneHotEncoder(inputCol ='categorical_index',
outputCol='categorical_vector')
assembler = VectorAssembler(inputCols=['continuous', 'categorical_vector'],
outputCol='indep_vars')
pipeline = Pipeline(stages=string_indexer+encoder+assembler)
model = pipeline.fit(df)
df = model.transform(df)
Everything works fine to this point, and I run the model:
glm = GeneralizedLinearRegression(family='gaussian',
link='identity',
labelCol='dep_var',
featuresCol='indep_vars')
model = glm.fit(df)
model.params
Which outputs:
DenseVector([8440.0573, 3729.449, 4388.9042, 2879.1802, 4613.7646, 5163.3233, 5186.6189, 5513.1392])
Which is great, because I can verify that these coefficients are essentially correct (via other sources). However, I haven't found a good way to link these coefficients to the original column names, which I need to do (I've simplified this model for SO; there's more involved.)
The relationship between column names and coefficients is broken by StringIndexer and OneHotEncoder. I've found one fairly slow way:
df[['categorical', 'categorical_index']].distinct()
Which gives me a small dataframe relating the the string names to the numerical names, which I think I could then relate back to the keys in the sparse vector? This is very clunky and slow though, when you consider the scale of the data.
Is there a better way to do this?
For PySpark, here is the solution to map feature index to feature name:
First, train your model:
pipeline = Pipeline().setStages([label_stringIdx,assembler,classifier])
model = pipeline.fit(x)
Transform your data:
df_output = model.transform(x)
Extract the mapping between feature index and feature name. Merge numeric attributes and binary attributes into a single list.
numeric_metadata = df_output.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('numeric')
binary_metadata = df_output.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('binary')
merge_list = numeric_metadata + binary_metadata
OUTPUT:
[{'name': 'variable_abc', 'idx': 0},
{'name': 'variable_azz', 'idx': 1},
{'name': 'variable_azze', 'idx': 2},
{'name': 'variable_azqs', 'idx': 3},
....
I also came across the exact problem and I've got your solution :)
This is based on the Scala version here:
How to map variable names to features after pipeline
# transform data
best_model = pipeline.fit(df)
best_pred = best_model.transform(df)
# extract features metadata
meta = [f.metadata
for f in best_pred.schema.fields
if f.name == 'features'][0]
# access feature name and index
features_name_ind = meta['ml_attr']['attrs']['numeric'] + \
meta['ml_attr']['attrs']['binary']
print features_name_ind[:2]
# [{'name': 'feature_name_1', 'idx': 0}, {'name': 'feature_name_2', 'idx': 1}]
I didn't investigate the previous versions, but in Spark 2.4.3 it is possible to retrieve a lot of information about the features just by using the summary attribute of a GeneralizedLinearRegressionModel.
Printing summary results in something like this:
Coefficients:
Feature Estimate Std Error T Value P Value
(Intercept) -0.1742 0.4298 -0.4053 0.6853
x1_enc_(-inf,5.5] -0.7781 0.3661 -2.1256 0.0335
x1_enc_(5.5,8.5] 0.1850 0.3736 0.4953 0.6204
x1_enc_(8.5,9.5] -0.3937 0.4324 -0.9106 0.3625
x45_enc_1-10-7-8-9 -0.5382 0.2718 -1.9801 0.0477
x45_enc_2-3-4-ND 0.5187 0.2811 1.8454 0.0650
x45_enc_5 -0.0456 0.3353 -0.1361 0.8917
x33_enc_1 0.6361 0.4043 1.5731 0.1157
x33_enc_10 0.0059 0.4083 0.0145 0.9884
x33_enc_2-3-4-8-ND 0.6121 0.1741 3.5152 0.0004
x102_enc_(-inf,4.5] 0.5315 0.1695 3.1354 0.0017
(Dispersion parameter for binomial family taken to be 1.0000)
Null deviance: 937.7397 on 666 degrees of freedom
Residual deviance: 858.8846 on 666 degrees of freedom
AIC: 880.8846
The Feature column can be constructed by accessing an internal Java object:
In [131]: glm.summary._call_java('featureNames')
Out[131]:
['x1_enc_(-inf,5.5]',
'x1_enc_(5.5,8.5]',
'x1_enc_(8.5,9.5]',
'x45_enc_1-10-7-8-9',
'x45_enc_2-3-4-ND',
'x45_enc_5',
'x33_enc_1',
'x33_enc_10',
'x33_enc_2-3-4-8-ND',
'x102_enc_(-inf,4.5]']
The Estimate column can be constructed by the following concatenation:
In [134]: [glm.intercept] + list(glm.coefficients)
Out[134]:
[-0.17419580191414719,
-0.7781490190325139,
0.1850214800764976,
-0.3936963366945294,
-0.5382255101657534,
0.5187453074755956,
-0.045649677050663987,
0.6360647167539958,
0.00593020879299306,
0.6121475986933201,
0.531510974697773]
PS.: This line shows why the column Features can be retrieved by using an internal Java object.
Sorry, this seems to be a very late answer and maybe you might have already figured it out but wth, anyways. I recently did the same implementation of String Indexer, OneHotEncoder and VectorAssembler and as far as I have understood, the following code will present what you are looking for.
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
categoricalColumns = ["one_categorical_variable"]
stages = [] # stages in the pipeline
for categoricalCol in categoricalColumns:
# Category Indexing with StringIndexer
stringIndexer = StringIndexer(inputCol=categoricalCol,
outputCol=categoricalCol+"Index")
# Using OneHotEncoder to convert categorical variables into binary
SparseVectors
encoder = OneHotEncoder(inputCol=stringIndexer.getOutputCol(),
outputCol=categoricalCol+"classVec")
# Adding the stages so that they will be run all at once later
stages += [stringIndexer, encoder]
# convert label into label indices using the StringIndexer
label_stringIdx = StringIndexer(inputCol = "Service_Level", outputCol =
"label")
stages += [label_stringIdx]
# Transform all features into a vector using VectorAssembler
numericCols = ["continuous_variable"]
assemblerInputs = map(lambda c: c + "classVec", categoricalColumns) +
numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
# Creating a Pipeline for Training
pipeline = Pipeline(stages=stages)
# Running the feature transformations.
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)