Oversampling with Cross Validation in PySpark Pipeline - python

I'm working on a PySpark binary classification Pipeline where I want to perform CrossValidation with an Oversampling stage (My dataset is not balanced). The issue is that the oversampling stage is executed also on the test dataset.
The pipeline:
pipeline=Pipeline(stages=[cast_and_fill_na, smote, vec_assembler, rf])
smote is the stage I want to skip when transforming the test dataset.
I took a look in spark documentation and source code, There's no way to skip a stage in a PipelineModel. My solution was to override _transform methode of the original class in order to skip the ovesampling stage.
This works fine when fiting the pipeline in my source code. I use this:
pipeline_model.__class__ = CustomPipelineModel
CustomPipelineModel is a class that inherits from pyspark.ml.PipelineModel and overrides the _transform method.
But as the CrossValidator uses the original implementation of the PipelineModel class, I can't use my custom method.
evaluator = BinaryClassificationEvaluator(labelCol=target)
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=10,
parallelism=1)
cvModel = crossval.fit(train_set)
What is the best way to skip the oversampling stage when using Cross Validator ?
I started to look into the source code of the _fit method of pyspark.ml.tuning.CrossValidator thinking about overriding it too ... The second solution is to perform oversampling on the training dataset but this will introduce bias into the models in the cross validation process.

I came up with a work around for this problem.
In my SMOTEOversmapler class (smote stage is an instance of it), I added a atteribute namede skip_transform which is set to None when instancing a SMOTEOversmapler object. In the _transform method, I set this attribute to True. The next call to _transform (which is in the test phase) will be skipped. Here is a code snippet.
def __init__(self, ...):
self.skip_transfrom = None
def _transform(self, df):
if self.skip_transform:
retrun df
else:
#Execute oversampling
self.skip_transform = True

Related

Add transform method to sklearn predictor to use it as an intermediate step of a sklearn pipeline

I have a sklearn Pipeline with several transformers and a LinearRegression predictor at the end. I want to add more custom transformers to the end of the pipeline and a final custom predictor, but the LinearRegression predictor doesn't have the transform method so it gives an error when I call the full pipeline's predict method.
I thought about adding the transform method to the LinearRegression class using inheritance and doing something like:
class NewModel(LinearRegression):
def transform(self, X):
return X["prediction"] = self.predict(X)
but I want to know if there is a better way to solve the problem so I can use any type of sklearn predictor in the middle of the pipeline. For instance, I would like a new class you can pass a sklearn predictor as an argument and the new class simply adds a transform method to the class calling the predict method of the predictor as in the example above, and adds the new column to the dataframe.

Python Vetiver model - use alternative prediction method

I'm trying to use Vetiver to deploy an isolation forest model (for anomaly detection) to an API endpoint.
All is going well by adapting the example here.
However, when deployed, the endpoint uses the model.predict() method by default (which returns 1 for normal or -1 for anomaly).
I want the model to return a score between 0 and 1 as given by the model.score_samples() method.
Does anybody know how I can configure the Vetiver endpoint such that it uses .score_samples() rather than .predict() for scoring?
Thanks
vetiver.predict() primarily acts as a router to an API endpoint, so it does not have to have the same behavior as model.predict(). You can overload what function vetiver.predict() uses on your model by defining a custom handler.
In your case, an implementation might look something like below.
from vetiver.handlers.base import VetiverHandler
class ScoreSamplesHandler(VetiverHandler):
def __init__(self, model, ptype_data):
super().__init__(model, ptype_data)
def handler_predict(self, input_data, check_ptype):
"""
Define how to make predictions from your model
"""
prediction = self.model.score_samples(input_data)
return prediction
Initialize your custom handler and pass it into VetiverModel().
custom_model = ScoreSamplesHandler(model, ptype_data)
VetiverModel(custom_model, "model_name")
Then, when you use the vetiver.predict() function, it will return the values for model.score_samples(), rather than model.predict().

Saving data with DataCatalog

I was looking at iris project example provided by kedro. Apart from logging the accuracy I also wanted to save the predictions and test_y as a csv.
This is the example node provided by kedro.
def report_accuracy(predictions: np.ndarray, test_y: pd.DataFrame) -> None:
"""Node for reporting the accuracy of the predictions performed by the
previous node. Notice that this function has no outputs, except logging.
"""
# Get true class index
target = np.argmax(test_y.to_numpy(), axis=1)
# Calculate accuracy of predictions
accuracy = np.sum(predictions == target) / target.shape[0]
# Log the accuracy of the model
log = logging.getLogger(__name__)
log.info("Model accuracy on test set: %0.2f%%", accuracy * 100)
I added the following to save the data.
data = pd.DataFrame({"target": target , "prediction": predictions})
data_set = CSVDataSet(filepath="data/test.csv")
data_set.save(data)
This works as intended, however, my question is "is it the kedro way of doing thing" ? Can I provide the data_set in catalog.yml and later save data to it? If I want to do it, how do I access the data_set from catalog.yml inside a node.
Is there a way to save data without creating a catalog inside a node like this data_set = CSVDataSet(filepath="data/test.csv") ? I want this in catalog.yml, if possible and if it follows kedro convention!.
Kedro actually abstracts this part for you. You don't need to access the datasets via their Python API.
Your report_accuracy method does need to be tweaked to return the DataFrame instead of None.
Your node needs to be defined as such:
node(
func=report_accuracy,
inputs='dataset_a',
outputs='dataset_b'
)
Kedro then looks at your catalog and will load/save dataset_a and dataset_b as required:
dataset_a:
type: pandas.CSVDataSet
path: xxxx.csv
dataset_b:
type: pandas.ParquetDataSet
path: yyyy.pq
As you run the node/pipeline Kedro will handle the load/save operations for you. You also don't need to save every dataset if it's only used mid-way in a pipeline, you can read about MemoryDataSets here.

Writing a custom predict method using MLFlow and pyspark

I am having trouble writing a custom predict method using MLFlow and pyspark (2.4.0). What I have so far is a custom transformer that changes the data into the format I need.
class CustomGroupBy(Transformer):
def __init__(self):
pass
def _transform(self, dataset):
df = dataset.select("userid", explode(split("widgetid", ',')).alias("widgetid"))
return(df)
Then I built a custom estimator to run one of the pyspark machine learning algorithms
class PipelineFPGrowth(Estimator, HasInputCol, DefaultParamsReadable, DefaultParamsWritable):
def __init__(self, inputCol=None, minSupport=0.005, minConfidence=0.01):
super(PipelineFPGrowth, self).__init__()
self.minSupport = minSupport
self.minConfidence = minConfidence
def setInputCol(self, value):
return(self._set(inputCol=value))
def _fit(self, dataset):
c = self.getInputCol()
fpgrowth = FPGrowth(itemsCol=c, minSupport=self.minSupport, minConfidence=self.minConfidence)
model = fpgrowth.fit(dataset)
return(model)
This runs in the MLFlow pipeline.
pipeline = Pipeline(stages = [CustomGroupBy,PipelineFPGrowth]).fit(df)
This all works. If I create a new pyspark dataframe with new data to predict on, I get predictions.
newDF = spark.createDataFrame([(123456,['123ABC', '789JSF'])], ["userid", "widgetid"])
pipeline.stages[1].transform(newDF).show(3, False)
# How to access frequent itemset.
pipeline.stages[1].freqItemsets.show(3, False)
Where I run into problems is writing a custom predict. I need to append the frequent itemset that FPGrowth generates to the end of the predictions. I have written the logic for that, but I am having a hard time figuring out how to put it into a custom method. I have tried adding it to my custom estimator but this didn't work. Then I wrote a separate class to take in the returned model and give the extended predictions. This was also unsuccessful.
Eventually I need to log and save the model so I can Dockerize it, which means I will need a custom flavor and to use the pyfunc function. Does anyone have a hint on how to extend the predict method and then log and save the model?

What does calling fit() multiple times on the same model do?

After I instantiate a scikit model (e.g. LinearRegression), if I call its fit() method multiple times (with different X and y data), what happens? Does it fit the model on the data like if I just re-instantiated the model (i.e. from scratch), or does it keep into accounts data already fitted from the previous call to fit()?
Trying with LinearRegression (also looking at its source code) it seems to me that every time I call fit(), it fits from scratch, ignoring the result of any previous call to the same method. I wonder if this true in general, and I can rely on this behavior for all models/pipelines of scikit learn.
If you will execute model.fit(X_train, y_train) for a second time - it'll overwrite all previously fitted coefficients, weights, intercept (bias), etc.
If you want to fit just a portion of your data set and then to improve your model by fitting a new data, then you can use estimators, supporting "Incremental learning" (those, that implement partial_fit() method)
You can use term fit() and train() word interchangeably in machine learning. Based on classification model you have instantiated, may be a clf = GBNaiveBayes() or clf = SVC(), your model uses specified machine learning technique.
And as soon as you call clf.fit(features_train, label_train) your model starts training using the features and labels that you have passed.
you can use clf.predict(features_test) to predict.
If you will again call clf.fit(features_train2, label_train2) it will start training again using passed data and will remove the previous results. Your model will reset the following inside model:
Weights
Fitted Coefficients
Bias
And other training related stuff...
You can use partial_fit() method as well if you want your previous calculated stuff to stay and additionally train using next data
Beware that the model is passed kind of "by reference". Here, model1 will be overwritten:
df1 = pd.DataFrame(np.random.rand(100).reshape(10,10))
df2 = df1.copy()
df2.iloc[0,0] = df2.iloc[0,0] -2 # change one value
pca = PCA()
model1 = pca.fit(df)
model2 = pca.fit(df2)
np.unique(model1.explained_variance_ == model2.explained_variance_)
Returns
array([ True])
To avoid this use
from copy import deepcopy
model1 = deepcopy(pca.fit(df))

Categories