Writing a custom predict method using MLFlow and pyspark - python

I am having trouble writing a custom predict method using MLFlow and pyspark (2.4.0). What I have so far is a custom transformer that changes the data into the format I need.
class CustomGroupBy(Transformer):
def __init__(self):
pass
def _transform(self, dataset):
df = dataset.select("userid", explode(split("widgetid", ',')).alias("widgetid"))
return(df)
Then I built a custom estimator to run one of the pyspark machine learning algorithms
class PipelineFPGrowth(Estimator, HasInputCol, DefaultParamsReadable, DefaultParamsWritable):
def __init__(self, inputCol=None, minSupport=0.005, minConfidence=0.01):
super(PipelineFPGrowth, self).__init__()
self.minSupport = minSupport
self.minConfidence = minConfidence
def setInputCol(self, value):
return(self._set(inputCol=value))
def _fit(self, dataset):
c = self.getInputCol()
fpgrowth = FPGrowth(itemsCol=c, minSupport=self.minSupport, minConfidence=self.minConfidence)
model = fpgrowth.fit(dataset)
return(model)
This runs in the MLFlow pipeline.
pipeline = Pipeline(stages = [CustomGroupBy,PipelineFPGrowth]).fit(df)
This all works. If I create a new pyspark dataframe with new data to predict on, I get predictions.
newDF = spark.createDataFrame([(123456,['123ABC', '789JSF'])], ["userid", "widgetid"])
pipeline.stages[1].transform(newDF).show(3, False)
# How to access frequent itemset.
pipeline.stages[1].freqItemsets.show(3, False)
Where I run into problems is writing a custom predict. I need to append the frequent itemset that FPGrowth generates to the end of the predictions. I have written the logic for that, but I am having a hard time figuring out how to put it into a custom method. I have tried adding it to my custom estimator but this didn't work. Then I wrote a separate class to take in the returned model and give the extended predictions. This was also unsuccessful.
Eventually I need to log and save the model so I can Dockerize it, which means I will need a custom flavor and to use the pyfunc function. Does anyone have a hint on how to extend the predict method and then log and save the model?

Related

Is there way to embed non-tf functions to a tf.Keras model graph as SavedModel Signature?

I want to add preprocessing functions and methods to the model graph as a SavedModel signature.
example:
# suppose we have a keras model
# ...
# defining the function I want to add to the model graph
#tf.function
def process(model, img_path):
# do some preprocessing using different libs. and modules...
outputs = {"preds": model.predict(preprocessed_img)}
return outputs
# saving the model with a custom signature
tf.saved_model.save(new_model, dst_path,
signatures={"process": process})
or we can use tf.Module here. However, the problem is I can not embed custom functions into the saved model graph.
Is there any way to do that?
I think you slightly misunderstand the purpose of save_model method in Tensorflow.
As per the documentation the intent is to have a method which serialises the model's graph so that it can be loaded with load_model afterwards.
The model returned by load_model is a class of tf.Module with all it's methods and attributes. Instead you want to serialise the prediction pipeline.
To be honest, I'm not aware of a good way to do that, however what you can do is to use a different method for serialisation of your preprocessing parameters, for example pickle or a different one, provided by the framework you use and write a class on top of that, which would do the following:
class MyModel:
def __init__(self, model_path, preprocessing_path):
self.model = load_model(model_path)
self.preprocessing = load_preprocessing(preprocessing_path)
def predict(self, img_path):
return self.model.predict(self.preprocessing(img_path))

Customizing Spacy's Text Categorizer

I am trying to to train a spacy model with a small dataset in Spacy 2.2. It is overfitting, I want to customize the architecture of the TextCategorizer. I referred to this post on GitHub :
https://github.com/explosion/spaCy/issues/3320
However, I am unable
from spacy.pipeline import TextCategorizer
from thinc.api import layerize
from spacy.language import Language
class StupidTextCategorizer(TextCategorizer):
name = 'stupid_textcat'
#classmethod
def Model(cls, nr_class, **cfg):
return create_dummy_model(nr_class, cfg.get('preferred_class', 0))
def create_dummy_model(nr_class, preferred_class):
"""Create a Thinc model that always predicts the same class."""
def dummy_model(docs, drop=0.):
scores = model.ops.allocate((len(docs), nr_class))
scores[:, preferred_class] = 1.0
return scores
model = layerize(dummy_model)
return model
However, when I'm trying to pass it to my training script, it throws this error which I can't seem to understand.
"[E002] Can't find factory for 'stupid_textcat'. This usually happens when spaCy calls `nlp.create_pipe` with a component name that's not built in - for example, when constructing the pipeline from a model's meta.json. If you're using a custom component, you can write to `Language.factories['stupid_textcat']` or remove it from the model meta and add it via `nlp.add_pipe` instead."
PS : Still learning Spacy but I can't find any helping documentation or tutorial for the above.

Python Vetiver model - use alternative prediction method

I'm trying to use Vetiver to deploy an isolation forest model (for anomaly detection) to an API endpoint.
All is going well by adapting the example here.
However, when deployed, the endpoint uses the model.predict() method by default (which returns 1 for normal or -1 for anomaly).
I want the model to return a score between 0 and 1 as given by the model.score_samples() method.
Does anybody know how I can configure the Vetiver endpoint such that it uses .score_samples() rather than .predict() for scoring?
Thanks
vetiver.predict() primarily acts as a router to an API endpoint, so it does not have to have the same behavior as model.predict(). You can overload what function vetiver.predict() uses on your model by defining a custom handler.
In your case, an implementation might look something like below.
from vetiver.handlers.base import VetiverHandler
class ScoreSamplesHandler(VetiverHandler):
def __init__(self, model, ptype_data):
super().__init__(model, ptype_data)
def handler_predict(self, input_data, check_ptype):
"""
Define how to make predictions from your model
"""
prediction = self.model.score_samples(input_data)
return prediction
Initialize your custom handler and pass it into VetiverModel().
custom_model = ScoreSamplesHandler(model, ptype_data)
VetiverModel(custom_model, "model_name")
Then, when you use the vetiver.predict() function, it will return the values for model.score_samples(), rather than model.predict().

Saving data with DataCatalog

I was looking at iris project example provided by kedro. Apart from logging the accuracy I also wanted to save the predictions and test_y as a csv.
This is the example node provided by kedro.
def report_accuracy(predictions: np.ndarray, test_y: pd.DataFrame) -> None:
"""Node for reporting the accuracy of the predictions performed by the
previous node. Notice that this function has no outputs, except logging.
"""
# Get true class index
target = np.argmax(test_y.to_numpy(), axis=1)
# Calculate accuracy of predictions
accuracy = np.sum(predictions == target) / target.shape[0]
# Log the accuracy of the model
log = logging.getLogger(__name__)
log.info("Model accuracy on test set: %0.2f%%", accuracy * 100)
I added the following to save the data.
data = pd.DataFrame({"target": target , "prediction": predictions})
data_set = CSVDataSet(filepath="data/test.csv")
data_set.save(data)
This works as intended, however, my question is "is it the kedro way of doing thing" ? Can I provide the data_set in catalog.yml and later save data to it? If I want to do it, how do I access the data_set from catalog.yml inside a node.
Is there a way to save data without creating a catalog inside a node like this data_set = CSVDataSet(filepath="data/test.csv") ? I want this in catalog.yml, if possible and if it follows kedro convention!.
Kedro actually abstracts this part for you. You don't need to access the datasets via their Python API.
Your report_accuracy method does need to be tweaked to return the DataFrame instead of None.
Your node needs to be defined as such:
node(
func=report_accuracy,
inputs='dataset_a',
outputs='dataset_b'
)
Kedro then looks at your catalog and will load/save dataset_a and dataset_b as required:
dataset_a:
type: pandas.CSVDataSet
path: xxxx.csv
dataset_b:
type: pandas.ParquetDataSet
path: yyyy.pq
As you run the node/pipeline Kedro will handle the load/save operations for you. You also don't need to save every dataset if it's only used mid-way in a pipeline, you can read about MemoryDataSets here.

Oversampling with Cross Validation in PySpark Pipeline

I'm working on a PySpark binary classification Pipeline where I want to perform CrossValidation with an Oversampling stage (My dataset is not balanced). The issue is that the oversampling stage is executed also on the test dataset.
The pipeline:
pipeline=Pipeline(stages=[cast_and_fill_na, smote, vec_assembler, rf])
smote is the stage I want to skip when transforming the test dataset.
I took a look in spark documentation and source code, There's no way to skip a stage in a PipelineModel. My solution was to override _transform methode of the original class in order to skip the ovesampling stage.
This works fine when fiting the pipeline in my source code. I use this:
pipeline_model.__class__ = CustomPipelineModel
CustomPipelineModel is a class that inherits from pyspark.ml.PipelineModel and overrides the _transform method.
But as the CrossValidator uses the original implementation of the PipelineModel class, I can't use my custom method.
evaluator = BinaryClassificationEvaluator(labelCol=target)
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=10,
parallelism=1)
cvModel = crossval.fit(train_set)
What is the best way to skip the oversampling stage when using Cross Validator ?
I started to look into the source code of the _fit method of pyspark.ml.tuning.CrossValidator thinking about overriding it too ... The second solution is to perform oversampling on the training dataset but this will introduce bias into the models in the cross validation process.
I came up with a work around for this problem.
In my SMOTEOversmapler class (smote stage is an instance of it), I added a atteribute namede skip_transform which is set to None when instancing a SMOTEOversmapler object. In the _transform method, I set this attribute to True. The next call to _transform (which is in the test phase) will be skipped. Here is a code snippet.
def __init__(self, ...):
self.skip_transfrom = None
def _transform(self, df):
if self.skip_transform:
retrun df
else:
#Execute oversampling
self.skip_transform = True

Categories