Saving data with DataCatalog - python

I was looking at iris project example provided by kedro. Apart from logging the accuracy I also wanted to save the predictions and test_y as a csv.
This is the example node provided by kedro.
def report_accuracy(predictions: np.ndarray, test_y: pd.DataFrame) -> None:
"""Node for reporting the accuracy of the predictions performed by the
previous node. Notice that this function has no outputs, except logging.
"""
# Get true class index
target = np.argmax(test_y.to_numpy(), axis=1)
# Calculate accuracy of predictions
accuracy = np.sum(predictions == target) / target.shape[0]
# Log the accuracy of the model
log = logging.getLogger(__name__)
log.info("Model accuracy on test set: %0.2f%%", accuracy * 100)
I added the following to save the data.
data = pd.DataFrame({"target": target , "prediction": predictions})
data_set = CSVDataSet(filepath="data/test.csv")
data_set.save(data)
This works as intended, however, my question is "is it the kedro way of doing thing" ? Can I provide the data_set in catalog.yml and later save data to it? If I want to do it, how do I access the data_set from catalog.yml inside a node.
Is there a way to save data without creating a catalog inside a node like this data_set = CSVDataSet(filepath="data/test.csv") ? I want this in catalog.yml, if possible and if it follows kedro convention!.

Kedro actually abstracts this part for you. You don't need to access the datasets via their Python API.
Your report_accuracy method does need to be tweaked to return the DataFrame instead of None.
Your node needs to be defined as such:
node(
func=report_accuracy,
inputs='dataset_a',
outputs='dataset_b'
)
Kedro then looks at your catalog and will load/save dataset_a and dataset_b as required:
dataset_a:
type: pandas.CSVDataSet
path: xxxx.csv
dataset_b:
type: pandas.ParquetDataSet
path: yyyy.pq
As you run the node/pipeline Kedro will handle the load/save operations for you. You also don't need to save every dataset if it's only used mid-way in a pipeline, you can read about MemoryDataSets here.

Related

Writing a custom predict method using MLFlow and pyspark

I am having trouble writing a custom predict method using MLFlow and pyspark (2.4.0). What I have so far is a custom transformer that changes the data into the format I need.
class CustomGroupBy(Transformer):
def __init__(self):
pass
def _transform(self, dataset):
df = dataset.select("userid", explode(split("widgetid", ',')).alias("widgetid"))
return(df)
Then I built a custom estimator to run one of the pyspark machine learning algorithms
class PipelineFPGrowth(Estimator, HasInputCol, DefaultParamsReadable, DefaultParamsWritable):
def __init__(self, inputCol=None, minSupport=0.005, minConfidence=0.01):
super(PipelineFPGrowth, self).__init__()
self.minSupport = minSupport
self.minConfidence = minConfidence
def setInputCol(self, value):
return(self._set(inputCol=value))
def _fit(self, dataset):
c = self.getInputCol()
fpgrowth = FPGrowth(itemsCol=c, minSupport=self.minSupport, minConfidence=self.minConfidence)
model = fpgrowth.fit(dataset)
return(model)
This runs in the MLFlow pipeline.
pipeline = Pipeline(stages = [CustomGroupBy,PipelineFPGrowth]).fit(df)
This all works. If I create a new pyspark dataframe with new data to predict on, I get predictions.
newDF = spark.createDataFrame([(123456,['123ABC', '789JSF'])], ["userid", "widgetid"])
pipeline.stages[1].transform(newDF).show(3, False)
# How to access frequent itemset.
pipeline.stages[1].freqItemsets.show(3, False)
Where I run into problems is writing a custom predict. I need to append the frequent itemset that FPGrowth generates to the end of the predictions. I have written the logic for that, but I am having a hard time figuring out how to put it into a custom method. I have tried adding it to my custom estimator but this didn't work. Then I wrote a separate class to take in the returned model and give the extended predictions. This was also unsuccessful.
Eventually I need to log and save the model so I can Dockerize it, which means I will need a custom flavor and to use the pyfunc function. Does anyone have a hint on how to extend the predict method and then log and save the model?

AutoML Pipelines: Label extraction from input data and sampling within Neuraxle or SKLearn Pipelines

I am working on a project that is looking for a lean Python AutoML pipeline implementation. As per project definition, data entering the pipeline is in the format of serialised business objects, e.g. (artificial example):
property.json:
{
"area": "124",
"swimming_pool": "False",
"rooms" : [
... some information on individual rooms ...
]
}
Machine learning targets (e.g. predicting whether a property has a swimming pool based on other attributes) are stored within the business object rather than delivered in a separate label vector and business objects may contain observations which should not be used for training.
What I am looking for
I need a pipeline engine which supports initial (or later) pipeline steps that i) dynamically change the targets in the machine learning problem (e.g. extract from input data, threshold real values) and ii) resample input data (e.g. upsampling, downsampling of classes, filtering observations).
The pipeline ideally should look as follows (pseudocode):
swimming_pool_pipeline = Pipeline([
("label_extractor", SwimmingPoolExtractor()), # skipped in prediction mode
("sampler", DataSampler()), # skipped in prediction mode
("featurizer", SomeFeaturization()),
("my_model", FitSomeModel())
])
swimming_pool_pipeline.fit(training_data) # not passing in any labels
preds = swimming_pool_pipeline.predict(test_data)
The pipeline execution engine needs to fulfill/allow for the following:
During model training (.fit()) SwimmingPoolExtractor extracts target labels from the input training data and passes labels on (alongside independent variables);
In training mode, DataSampler() uses the target labels extracted in the previous step to sample observations (e.g. could do minority upsampling or filter observations);
In prediction-mode, the SwimmingPoolExtractor() does nothing and just passes on the input data;
In prediction-mode, the DataSampler() does nothing and just passes on the input data;
Example
For example, assume that the data looks as follows:
property.json:
"properties" = [
{ "id_": "1",
"swimming_pool": "False",
...,
},
{ "id_": "2",
"swimming_pool": "True",
...,
},
{ "id_": "3",
# swimming_pool key missing
...,
}
]
The application of SwimmingPoolExtractor() would extract something like:
"labels": [
{"id_": "1", "label": "0"},
{"id_": "2", "label": "1"},
{"id_": "3", "label": "-1"}
]
from the input data and pass it set these as the machine learning pipeline's "targets".
The application of DataSampler() could for example further include logic that removes any training instance from the entire set of training data which did not contain any swimming_pool-key (label = -1).
Subsequent steps should use the modified training data (filtered, not including observation with id_=3) to fit the model. As stated above, in prediction mode, the DataSampler and SwimmingPoolExtractor would just pass through input data
How To
To my knowledge, neither neuraxle nor sklearn (for the latter I am sure) offer pipeline steps that meet the required functionality (from what I have gathered so far neuraxle must at least have support for slicing data, given it implements cross-validation meta-estimators).
Am I missing something, or is there a way to implement such functionality in either of the pipeline models? If not, are there alternatives to the listed libraries within the Python ecosystem that are reasonably mature and support such usecases (leaving aside issues that might arise from designing pipelines in such a manner)?
"Am I missing something, or is there a way to implement such functionality"
Yes, all you want to do can be done rather easily with Neuraxle:
You're missing out on the output handlers to transform output data! With this, you can send some x into y within the pipeline (thus effectively not passing in any labels to fit as you want to do).
You're also missing out on the TrainOnlyWrapper to transform data only at train time! This is useful to deactivate any pipeline step at test-time (and also at validation-time). Note that this way, it won't do the data filtering or resampling when evaluating the validation metrics.
You could also use the AutoML object to do the training loop.
Provided that your input data passed in "fit" is an iterable of something (e.g.: don't pass the whole json at once, at least make something that can be iterated on). At worst, pass a list of IDs and do a step that will convert the IDs to something else using an object that can go take the json by itself to do whatever it needs with the passed IDs, for instance.
Here is your updated code:
from neuraxle.pipeline import Pipeline
class SwimmingPoolExtractor(NonFittableMixin, InputAndOutputTransformerMixin, BaseStep): # Note here: you may need to delete the NonFittableMixin from the list here if you encounter problems, and define "fit" yourself rather than having it provided here by default using the mixin class.
def transform(self, data_inputs):
# Here, the InputAndOutputTransformerMixin will pass
# a tuple of (x, y) rather than just x.
x, _ = data_inputs
# Please note that you should pre-split your json into
# lists before the pipeline so as to have this assert pass:
assert hasattr(x, "__iter__"), "input data must be iterable at least."
x, y = self._do_my_extraction(x) # TODO: implement this as you wish!
# Note that InputAndOutputTransformerMixin expects you
# to return a (x, y) tuple, not only x.
outputs = (x, y)
return outputs
class DataSampler(NonFittableMixin, BaseStep):
def transform(self, data_inputs):
# TODO: implement this as you wish!
data_inputs = self._do_my_sampling(data_inputs)
assert hasattr(x, "__iter__"), "data must stay iterable at least."
return data_inputs
swimming_pool_pipeline = Pipeline([
TrainOnlyWrapper(SwimmingPoolExtractor()), # skipped in `.predict(...)` call
TrainOnlyWrapper(DataSampler()), # skipped in `.predict(...)` call
SomeFeaturization(),
FitSomeModel()
])
swimming_pool_pipeline.fit(training_data) # not passing in any labels!
preds = swimming_pool_pipeline.predict(test_data)
Note that you could also do as follow to replace the call to fit:
auto_ml = AutoML(
swimming_pool_pipeline,
validation_splitter=ValidationSplitter(0.20), # You can create your own splitter class if needed to replace this one. Dig in the source code of Neuraxle an see how it's done to create your own replacement.
refit_trial=True,
n_trials=10,
epochs=1,
cache_folder_when_no_handle=str(tmpdir),
scoring_callback=ScoringCallback(mean_squared_error, higher_score_is_better=False) # mean_squared_error from sklearn
hyperparams_repository=InMemoryHyperparamsRepository(cache_folder=str(tmpdir))
)
best_swimming_pool_pipeline = auto_ml.fit(training_data).get_best_model()
preds = best_swimming_pool_pipeline.predict(test_data)
Side note if you want to use the advanced data caching features
If you want to use caching, you should not define any transform methods, and instead you should define handle_transform methods (or related methods) so as to keep the order of the data "ID"s when you resample the data. Neuraxle is made to process iterable data, and this is why I've done some asserts above so as to ensure your json is already preoprocessed such that it is some kind of list of something.
Other useful code references:
https://github.com/Neuraxio/Neuraxle/blob/7957be352e564dd5dfc325f7ae23f51e9c4690a2/neuraxle/steps/data.py#L33
https://github.com/Neuraxio/Neuraxle/blob/d30abfc5f81b261db7c6717fb939f0e64aca1583/neuraxle/metaopt/auto_ml.py#L586

Oversampling with Cross Validation in PySpark Pipeline

I'm working on a PySpark binary classification Pipeline where I want to perform CrossValidation with an Oversampling stage (My dataset is not balanced). The issue is that the oversampling stage is executed also on the test dataset.
The pipeline:
pipeline=Pipeline(stages=[cast_and_fill_na, smote, vec_assembler, rf])
smote is the stage I want to skip when transforming the test dataset.
I took a look in spark documentation and source code, There's no way to skip a stage in a PipelineModel. My solution was to override _transform methode of the original class in order to skip the ovesampling stage.
This works fine when fiting the pipeline in my source code. I use this:
pipeline_model.__class__ = CustomPipelineModel
CustomPipelineModel is a class that inherits from pyspark.ml.PipelineModel and overrides the _transform method.
But as the CrossValidator uses the original implementation of the PipelineModel class, I can't use my custom method.
evaluator = BinaryClassificationEvaluator(labelCol=target)
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=10,
parallelism=1)
cvModel = crossval.fit(train_set)
What is the best way to skip the oversampling stage when using Cross Validator ?
I started to look into the source code of the _fit method of pyspark.ml.tuning.CrossValidator thinking about overriding it too ... The second solution is to perform oversampling on the training dataset but this will introduce bias into the models in the cross validation process.
I came up with a work around for this problem.
In my SMOTEOversmapler class (smote stage is an instance of it), I added a atteribute namede skip_transform which is set to None when instancing a SMOTEOversmapler object. In the _transform method, I set this attribute to True. The next call to _transform (which is in the test phase) will be skipped. Here is a code snippet.
def __init__(self, ...):
self.skip_transfrom = None
def _transform(self, df):
if self.skip_transform:
retrun df
else:
#Execute oversampling
self.skip_transform = True

Using the same script for training and serving (Estimator + hub)

I want to use hub at training and serving, but I am getting a little confused how to do it on the same graph. Namely I have something like
def build_graph(..., mode, ...):
tags_and_args= ... # one for training, one for serving
if mode == 'training':
hub.create_module_spec(module_fn, tags_and_args=tags_and_args)
module_output = hub.Module(...)
hub.register_module_for_export(module_fn, tags_and_args=tags_and_args)
loss, output = ...
else:
module_output = hub.Module(XXX)
should I reload the module from disk? Therefore XXX will be the path where i saved it before. Or is it somehow saved as a graph object in memory?
I will call my code as
estimator.train(...)
exporter = hub.LatestModuleExporter(...)
exporter.export(...)
esimator.export_savedmodel(...) # for serving
You can use a hub.Module in the model_fn of an Estimator without ever exporting it. At the start of Estimator.train(), the module's variables will be initialized from their pre-trained values (much like other variables are initialized randomly). After that, the module's variables behave much like the other variables of your model - they are part of the model's checkpoint, and restored from there for evaluation, resumed training, or export to a SavedModel for serving, like any other variable.
Exporting a hub.Module is only needed in case you want to create a new version of the module (with the weights updated from your training) available to yet another, separate Estimator.

Store/Reload CNTK Trainer, Model, Inputs, Outputs

What is the best way to store a trainer and all necessary components?
1. Storing:
Store checkpoint of the trainer: Use its trainer.save_checkpoint(filename, external_state={}) function
Additionally store the model separately: Use the z.save(filename) method, every cntk operation has. You can also get z = trainer.model.
2. Reloading:
Restore the model: Use C.load_model(...). (Don't get confused by the deprecated persist namespace from the Cntk 1.)
Get the inputs from the restored model.
Restore the trainer itself: Use trainer.restore_from_checkpoint as eg. shown here. The problem is, this function already needs a trainer object which probably has to be initialized in the same way as the trainer used to create the check point!?
How do I now restore the label-inputs which are going into the error function used by the trainer? In the following code I marked the variables which I think I have to restore after I once stored them.
z = C.layers.Dense(.... )
loss = error = C.squared_error(z, **l**)
**trainer** = C.Trainer(**z**, (loss, error), [mylearner], my_tensorboard_writer)
You can restore your trainer, but I actually prefer to just load my model m. The simple reason is that it is much easier to create a whole new trainer, beacuse then you can change all the other parameters of the trainer more easily.
Then you can get the input variable from the loaded model (if your network has only one input):
input_var = m.arguments[0]
then you need the output of your model:
output = m(input_var)
and define the loss function using your target output target_output:
C.squared_error(output, target_output)
using your model and the loss function you can recreate your trainer from there, setting the learning rate etc. as you like

Categories