I'm trying to build a simple preprocessing Pipeline for a clustering model that uses K-Means and export it to PMML format.
I manage to make the Pipeline work but can't manage to finally export it to pmml.
I have divided the pipeline in two steps, handle numerical data and handle categorical data.
numeric_features = ['column1','column2','column3']
categorical_features = ['column4','column5']
num_mapper = sklearn_pandas.DataFrameMapper([([numeric_column],SimpleImputer(strategy='median')) for numeric_column in numeric_features]
,df_out=True,default=None)
categorical_mapper = sklearn_pandas.DataFrameMapper([([categorical_column],LabelBinarizer()) for categorical_column in categorical_features]
,df_out=True,default=None)
pipeline = PMMLPipeline(steps=[
('num_mapper',num_mapper),
('cat_mapper',categorical_mapper)
])
Note that i have setted default to None in the first dataFrameMapper since it allows the output dataframe to preserve columns that haven't been selected (columns that indeed will be needed by the second mapper).
These workarounds work ok, the problem comes later when i try to export the pipeline to PMML
sklearn2pmml.sklearn2pmml(pipeline,'mypath')
This line of code yields the following error
java.lang.IllegalArgumentException: Attribute 'sklearn_pandas.dataframe_mapper.DataFrameMapper.default' has a missing (None/null) value
at org.jpmml.sklearn.PyClassDict.get(PyClassDict.java:46)
at org.jpmml.sklearn.PyClassDict.getObject(PyClassDict.java:97)
I know this error is probably generated by the fact that i'm setting default to None in both DataFrameMappers, but the thing is it was the only workaround i found in order to preserve the columns needed for the second mapper.
Is there any other workaround I could use? I know i could do all the transformations in the first DataFrameMapper but I don't like that idea since I want to separate numerical transformation from categorical transformation.
Recently could kinda understand the use of FeatureUnion, and realized it could be an elegant solution.
Create the same mappers
numeric_features = ['column1','column2','column3']
categorical_features = ['column4','column5']
num_mapper = sklearn_pandas.DataFrameMapper([([numeric_column],SimpleImputer(strategy='median')) for numeric_column in numeric_features]
)
categorical_mapper = sklearn_pandas.DataFrameMapper([([categorical_column],LabelBinarizer()) for categorical_column in categorical_features])
preprocessing = FeatureUnion(transformer_list=[('num_mapper',num_mapper),('cat_mapper',categorical_mapper)])
pipeline = PMMLPipeline(steps=[
('preprocessing',preprocessing)
])
sklearn2pmml.sklearn2pmml(pipeline,'mypath')
With this workaround even managed to avoid the use of df_out and default flags in the function call.
Related
I have a heterogenous dataframe that looks as follows (Note: the dataset being used is one from Kaggle - IEE Fraud Detection:
I have used ColumnTransformer to implement multiple transformations:
# merging transformations into one pipeline with columntransformer
process_pipe = ColumnTransformer(
[
(
"most_frequent_imputer",
PandasSimpleImputer(strategy="most_frequent"),
impute_freq
),
(
"aggregate_high_cardinality_features",
AggregateCategorical(high_cardinality_cats),
high_cardinality_cats
),
(
"get_categorical_codes",
FunctionTransformer(convert_to_category),
cat_codes_cols
),
(
"mean_imputer",
PandasSimpleImputer(strategy="mean"),
continuous_features
)
],
remainder="passthrough",
verbose_feature_names_out=False
)
Note: PandasSimpleImputer is a wrapper class I created to return a pandas dataframe from sklearn SimpleImputer. AggregateCategorical is a class I created to reduce the cardinality of a high dimensional feature.
However, the ColumnTransformer creates new columns for each step with the name of the transformation prefixed at the front. But, I've got features that I perform multiple transformations on in different steps of the pipeline - meaning after one transformation happens to my feature, I want another transformation to happen to the feature only after the first transformation (i.e. impute missing values --> get category codes).
At the moment, my pipeline imputes the missing values then references the original feature (not the imputed one) and get the codes from that feature. This is not the functionality I want.
The most straightforward solution I could think of is to name the features with multiple features with the prefix so that the transformations happen to the same features, then delete the features I do not want from the dataframe. This solution includes lots of manual labor. Is there a faster way?
Similar question but different scenario -> How to apply multiple transforms to the same columns using ColumnTransformer in scikit-learn
If I understand your question properly, you want to perform multiple stacked transformations on a columns but at the moment you only get multiple outputs for that column, one where each transform is applied individually.
To get the behaviour you want, I think you need to write pipelines for each combination of transformations you want, just like the accepted answer in the similar question you linked to. Your column transformer will then consist of multiple pipelines, one for each combination of transforms you want to apply.
This is a perfectly acceptable solution but you can quickly end up with a lot of code to maintain. Because I kept hitting on this problem, I wrote a package called skdag to try to make this sort of task simpler. You can read the full documentation at https://skdag.readthedocs.io/ but here's a quick demo based on your question:
from skdag import DAGBuilder
dag = (
DAGBuilder(infer_dataframe=True)
.add_step("input", "passthrough")
.add_step(
"cat_imputer",
SimpleImputer(strategy="most_frequent"),
deps={"input": ["C1", "C2", "C3"]}
)
.add_step(
"get_categorical_codes",
FunctionTransformer(convert_to_category),
deps={"cat_imputer": ["C1", "C2", "C3"], "input": ["C4"]}
)
.add_step(
"mean_imputer",
SimpleImputer(strategy="mean"),
deps={"input": ["N1", "N2", "N3", "N4"]}
)
.add_step(
"pca",
PCA(n_components=2),
deps=["mean_imputer"]
)
.add_step(
"numerics",
"passthrough",
deps={"pca": ["pca1", "pca2"], "input": ["N5", "N6"]}
)
.add_step(
"output",
"passthrough",
deps=["get_categorical_codes", "numerics"]
)
.make_dag()
)
dag.fit_transform(X, y)
Note that there's no need for any custom dataframe wrappers any more. skdag deals with it all natively for you if you set the infer_dataframe option.
You could also add a predictor (or multiple predictors) to the end of the graph and then call fit_predict instead if you like.
This is quite a big, complex workflow now so it may become hard to keep track of from looking at the code alone. If you want to visualise the graph, you can call dag.show() in an interactive environment like Jupyter Notebooks, or dag.draw() to produce an image or text file:
dag.show()
This hopefully makes it easier to understand the workflow. We have four categorical features, three of which are run through an most frequent imputer first and then all of them go through the function transformer to convert them to codes. We have four numerical features which are mean imputed and then run through PCA, and two more numerical features that are just passed through without any modification.
I would like to take the logarithm of specific columns of my dataframe.
I created a new transformer object:
from sklearn.preprocessing import FunctionTransformer
log_transformer = FunctionTransformer(np.log10, validate=True)
and it works nicely on one specific column of my dataframe:
log_transformer.transform(df['_column'])
Also, one could also overwrite the specific column of the original dataframe like:
df[['_column']]=log_transformer.transform(df[['_column']])
However, this operation then changes the original dataframe and wouldn't be useful in a pipeline.
When I try to include this transformer object into ColumnTransformer, I get an error message:
columnTransformer = ColumnTransformer([('log_transform', log_transformer.transform(), [0,5)], remainder='passthrough')
How should I pass on the custom-defined transformer object to ColumnTransformer? (The same syntax would work very nicely for built-in transformers, as suggested in this article: https://towardsdatascience.com/columntransformer-in-scikit-for-labelencoding-and-onehotencoding-in-machine-learning-c6255952731b)
Thank you for your help!
When trying to create a sklearn2pmml pipeline I use the following code to do a custom mapping and then use PMMLLabelBinarizer to create the dummy variables. Things is, I want to avoid the dummy variable trap. Is there a way to do that using PMMLPipelines and avoid using any custom FunctionTransformer functions (I want to eventually convert the pipeline to a PMML file)
I couldn't find a way to drop my last column using a readily available PMML compatible function. (DataframeMapper is an sklearn_pandas function).
DataFrameMapper([
('Merchant', [CategoricalDomain(missing_values=[None, np.nan])
, LookupTransformer(map_dict, 'ZZ'), PMMLLabelBinarizer()
])
])
You can use sklearn.compose.ColumnTransformer to limit the number of columns; the idea is to specify ColumnTransformer.remainder = "drop".
For example, if your pipeline starts with a DataFrameMapper that is producing a 5-column matrix, but you want to retain only the first four columns:
pipeline = PMMLPipeline([
("mapper", DataFrameMapper[...]),
("slicer", ColumnTransformer([
("keep", "passthrough", [0, 1, 2, 3])
], remainder = "drop"),
("estimator", ...)
])
Support for the ColumnTransformer is available starting from the latest SkLearn2PMML version 0.42.0, so you might need to upgrade to it first.
I am doing a text classification and I have built a model using the pipeline method. I have created the RF classifier object and have set the features column and the label column that I obtained in my previous steps (steps not shown).
I am fitting my training data which I have created using a dataframe and it has the columns "labels" and "sentences". The labels are different question types. The DF looks like,
training = sqlContext.createDataFrame([
("DESC:manner", "How did serfdom develop in and then leave Russia ?"),
("DESC:def", "What does '' extended definition '' mean and how would one a paper on it ? "),
("HUM:ind", " Who was The Pride of the Yankees ?")
], ["label", "sentence"])
The code for the pipeline is,
rf = RandomForestClassifier().setFeaturesCol("features").setLabelCol("idxlabel")
pipeline = Pipeline(stages=[pos, tokenizer, hashingTF, idf, indexer,rf])
model = pipeline.fit(training)
So now I can get the predictions by using the following code,
prediction = model.transform(test)
selected = prediction.select("sentence","prediction")
I can do the select() operation to get the predicted labels.
But for my use case, there is a stream of data that is coming from Kinesis and it will be only sentences (plain strings). For each sentence, I have to predict the label. But now I am not finding any predict() function when I do dir(model). How come there is no predict() method for the RandomForestClassifier obtained from pyspark.ml? If not, how can I perform my use case successfully? I need the predict() method to satisfy the requirement. What ML algorithm should I use if not RF? Am I doing anything wrong? Can anyone please suggest something? Any help is appreciated. My environment is Spark 1.6 and Python 2.7.
So I figured it out that there is no predict() method that can be used. So instead, we need to use the transform() method to make predictions. Just remove the label column and create a new dataframe. For example, in my case, I did,
pred = sqlContext.createDataFrame([("What are liver enzymes ?" ,)], ["sentence"])
prediction = model.transform(pred)
And then we can find the prediction using the select() method. Atleast for now, this solution worked successfully for me. Please do let me know if there is any correction or a better approach than this.
I am also doing the same problem. Can you tell me what is "pos"(part of speech) in pipeline stage and how you are getting it. And also how are you preparing test data. Below is my code -
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(training)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
indexer = StringIndexer(inputCol="label", outputCol="idxlabel")
rf = RandomForestClassifier().setFeaturesCol("features").setLabelCol("idxlabel")
pipeline = Pipeline(stages=[tokenizer, hashingTF, idf, indexer, rf])
model = pipeline.fit(training)
Please tell me if I am doing anything wrong.
I am trying to use Patsy (with sklearn, pandas) for creating a simple regression model. The R style formula creation is a major draw.
My data contains a field called 'ship_city' which can have any city from India. Since I am partitioning the data into train and test sets, there are several cities which appear only in one of the sets. A code snippet is given below:
df_train_Y, df_train_X = dmatrices(formula, data=df_train, return_type='dataframe')
df_train_Y_design_info, df_train_X_design_info = df_train_Y.design_info, df_train_X.design_info
df_test_Y, df_test_X = build_design_matrices([df_train_Y_design_info.builder, df_train_X_design_info.builder], df_test, return_type='dataframe')
The last line throws the following error:
patsy.PatsyError: Error converting data to categorical: observation
with value 'Kolkata' does not match any of the expected levels
I believe this is a very common use case where training data will not have all levels of all categorical fields. Sklearn's DictVectorizer handles this quite well.
Is there any way I can make this work with Patsy?
The problem of course is that if you just give patsy a raw list of values, it has no way to know that there are other values that could potentially happen as well. You have to somehow tell it what the complete set of possible values is.
One way is by using the levels= argument to C(...), like:
# If you have a data frame with all the data before splitting:
all_cities = sorted(df_all["Cities"].unique())
# Alternative approach:
all_cities = sorted(set(df_train["Cities"]).union(set(df_test["Cities"])))
dmatrices("y ~ C(Cities, levels=all_cities)", data=df_train)
Another option if you're using pandas's default categorical support is to record the set of possible values when you set up your data frame; if patsy detects that the object you've passed it is a pandas categorical then it automatically uses the pandas categories attribute instead of trying to guess what the possible categories are by looking at the data.
I ran into a similar problem and I built the design matrices prior to splitting the data.
df_Y, df_X = dmatrices(formula, data=df, return_type='dataframe')
df_train_X, df_test_X, df_train_Y, df_test_Y = \
train_test_split(df_X, df_Y, test_size=test_size)
Then as an example of applying a fit:
model = smf.OLS(df_train_Y, df_train_X)
model2 = model.fit()
predicted = model2.predict(df_test_X)
Technically I haven't built a test case, but I haven't run into the Error converting data to categorical error again since implementing the above.