Right way to apply pre-trained scikit-learn model to dask array?

Right way to apply pre-trained scikit-learn model to dask array? - python

I have a scikit-learn model, which has already been trained. It takes 8 features as an input.
Now I would like to apply that model to an arbitrary big dask array (of shape (n_samples, 8)) and write the result to disk.
I fail at applying my model to a dask array and getting back a dask array.
Setup:
import dask.array as da
from dask import delayed
# pre-trained classifier/model already exists
# I will refer to this model as "clf"
# test data
X = da.full((100,8), 0)
# this works fine, but returns a numpy array...
y_predicted = clf.predict(X)
What I have tried
# 1. intention: apply classifier to each row (features)
y_predicted = da.apply_along_axis(clf.pedict, 1, X)
# fails with
# AttributeError: 'RandomForestClassifier' object has no attribute 'pedict'
# 2. apply blocks
y_predicted = da.map_blocks(clf.predict, X)
# quickly runs out of memory, process killed
# 3. write a delayed function that returns a dask array
#delayed
def predict_delayed(da_array):
return da.from_array(clf.predict(da_array))
y_predicted = predict_delayed(X)
y_predicted.compute()
# fails with TypeError: Expected sequence or array-like, got <class 'method'>
I have spent quite some time on this and I am quite clueless on how to proceed.
Any suggestions?

I finally figured it out, here is a minimal reproducible example:
import dask.array as da
from dask import delayed
from sklearn.ensemble import RandomForestClassifier as RF
n_samples = 1000
# random training data
X_train = da.random.random((n_samples,8))
y_train = da.random.randint(0, 2, n_samples)
rf = RF(random_state = 42, n_estimators=50)
rf.fit(X_train, y_train)
# some data to predict on
X_new = da.random.random((n_samples,8))
# 1. option: delayed function
#delayed
def predict_delayed(da_array: da.array) -> da.array:
return da.from_array(rf.predict(da_array))
y_predicted = predict_delayed(X_new)
# 2. option: map_blocks
# (key was here to supply drop_axis and dtype arguments)
y_predicted = da.map_blocks(rf.predict, X_new, dtype="i8", drop_axis=1)

Related

Random split a PyTorch dataset of type TensorDataset

I have a custom dataset loaded into python in numpy: (20640x8) matrix of inputs, (20640x1) vector of labels.
I am trying to prepare the data for training in a PyTorch machine learning model, which requires a training set and test set split. In my attempt, the random_split() function reports an error:
TypeError: randperm() received an invalid combination of arguments.
I couldn't figure out how to split the dataset. Here is the code I wrote:
import numpy as np
import torch
from torch.utils.data import TensorDataset, random_split
x_numpy = # (20640x8) matrix of floats
y_numpy = # (20640x1) vector of floats
x = torch.from_numpy(x_numpy.astype(np.float32))
y = torch.from_numpy(y_numpy.astype(np.float32))
dataset = TensorDataset(x, y)
trainSet, testSet = random_split(dataset, [0.6*len(dataset), 0.4*len(dataset)])
Thanks in advance for the help!

The dataset length must be integers:
>>> split = int(0.6*len(dataset))
>>> trainSet, testSet = random_split(dataset, [split, len(dataset)-split])

How To Record The Index Of Training and Testing Data For Analysis After Training (keras and python)

I essentially want to tag each of the targets so that afterward, I can analyze each sample individually. This way I can look at that sample in much greater detail. I have a dataset and I split into training, validation, and testing sets.
import numpy as np
from sklearn.model_selection import train_test_split
ins = np.load('ins.npy')
ins.shape
# (100,12,60,60)
type(ins)
# <class 'numpy.ndarray'>
And then I send it through the training. I assume here I need to attach another np array that gives identifying information about the sample.
ins_train, ins_test , outs_train, outs_test = train_test_split(ins, outs, test_size=0.25, random_state=3)
ins_train, ins_val, outs_train, outs_val = train_test_split(ins_train, outs_train, test_size =0.2, random_state=2)
However, I do use a generator to send this data through a keras neural network. Should I provide that code as well? I will provide some, let me know if more is needed.
Here I make the generator:
def training_set_generator_images(ins, outs, batch_size=10,
input_name='input',
output_name='output'):
'''
Generator for producing random minibatches of image training samples.
#param ins Full set of training set inputs (examples x row x col x chan)
#param outs Corresponding set of sample (examples x nclasses)
#param batch_size Number of samples for each minibatch
#param input_name Name of the model layer that is used for the input of the model
#param output_name Name of the model layer that is used for the output of the model
'''
while True:
# Randomly select a set of example indices
example_indices = random.choices(range(ins.shape[0]), k=batch_size)
# The generator will produce a pair of return values: one for inputs and one for outputs
yield({input_name: ins[example_indices,:,:,:]},
{output_name: outs[example_indices,:,:,:]})
I create a keras neural network. I assume any neural network would do, but I'm not sure.
model = create_uNet(ins_train.shape[1:])
# call the generator
generator = training_set_generator_images(ins_train, outs_train,batch_size=50,
input_name='input',
output_name='output')
And then we fit the model.
history = model.fit(x=generator,epochs=1000)
# and we save the results
results = {}
results['predict_training'] = model.predict(ins_train)
results['predict_training_eval'] = model.evaluate(ins_train, outs_train)
results['true_training'] = outs_train
results['predict_validation'] = model.predict(ins_val)
results['predict_validation_eval'] = model.evaluate(ins_val, outs_val)
results['true_validation'] = outs_val
results['true_testing'] = outs_test
results['predict_testing'] = model.predict(ins_test)
results['predict_testing_eval'] = model.evaluate(ins_test, outs_test)
results['history'] = history.history
Within results['true_testing'] are the truths for the test set, and results['predict_testing'] are the corresponding predictions for the test set. For this to work, each example results['predict_testing'][i] will need to have, in addition to the image, but also the identifying information (in our case, a timestamp). Or there could be 3 dictionaries:
results['training_timestamps'] = training_timestamps
results['validation_timestamps'] = validation_timestamps
results['testing_timestamps'] = testing_timestamps
Please let me know if you need more information. It seems like the solution might be quick and easy, but the generator throws me off.

How to do multiple inferencing on onnx(onnxruntime) similar to sklearn

I want to infer outputs against many inputs from an onnx model using onnxruntime in python. One way is to use the for loop but it seems a very trivial and a slow method. Is there a way to do the same way as sklearn?
Single prediction on onnxruntime:
import onnxruntime as ort
sess = ort.InferenceSession("xxxxx.onnx")
input_name = sess.get_inputs()
label_name = sess.get_outputs()[0].name
pred_onnx= sess.run([label_name], {
input_name[0].name: np.array([[40]]).astype(np.int64),
input_name[1].name: np.array([[0]]).astype(np.int64),
input_name[2].name: np.array([[0]]).astype(np.int64)
})
pred_onnx
>> Output: [array([[23]], dtype=float32)]
Single/Multiple prediction in sklearn(depending on the size of x_test):
test_predictions = model.predict(x_test)

Best way is for the ONNX model to support batches. Based on the input you're providing it may already do that. Your 3 inputs appear to have shape [1,1] and your output has shape [1,1], which may mean the first dimension is the batch size. Example input with shape [2,1] (2 batches, 1 element per batch) would look like [[40],[50]].
I'm guessing if you provide two batches would of input you'd get two outputs, so something like this
pred_onnx= sess.run([label_name], {
input_name[0].name: np.array([[40],[40]]).astype(np.int64),
input_name[1].name: np.array([[0],[0]]).astype(np.int64),
input_name[2].name: np.array([[0],[0]]).astype(np.int64)
})
May give output of
[array([[23],[23]], dtype=float32)]

Here is a small working example using batch inference on a sklearn model exported to ONNX.
from sklearn import datasets, model_selection, linear_model, pipeline, preprocessing
import numpy as np
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
import onnxruntime
import pandas as pd
# load toy dataset, define sklearn pipeline and fit model
dataset = datasets.load_diabetes()
X, y = dataset.data, dataset.target
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
regr = pipeline.Pipeline(
[("std", preprocessing.StandardScaler()), ("reg", linear_model.LinearRegression())]
)
regr.fit(X_train, y_train)
# export model to onnx
initial_type = list(
zip(
dataset.feature_names,
[FloatTensorType([None, 1]) for _ in range(len(dataset.feature_names))],
)
)
onx = convert_sklearn(regr, initial_types=initial_type)
with open("model.onnx", "wb") as f:
f.write(onx.SerializeToString())
# load model in onnx runtime and make batch inference
df_test = pd.DataFrame(X_test, columns=dataset.feature_names)
sess = onnxruntime.InferenceSession("model.onnx")
inputs = {
f: df_test[f].astype(np.float32).values.reshape(-1, 1)
for f in dataset.feature_names
}
label_name = sess.get_outputs()[0].name
pred_onx = sess.run([label_name], inputs)[0]
# compare results
regr.predict(X_test)
pred_onx.flatten()
I think the trickiest part is to get the input shape right for inference.
Since we specified FloatTensorType([None, 1]) the shape of the single input arrays must be of shape (x,1) where x is the number of batches. Thus we need to reshape column values of shape (x,) into (x,1).

PyTorch DataLoader returns the batch as a list with the batch as the only entry. How is the best way to get a tensor from my DataLoader

I currently have the following situation where I want to use DataLoader to batch a numpy array:
import numpy as np
import torch
import torch.utils.data as data_utils
# Create toy data
x = np.linspace(start=1, stop=10, num=10)
x = np.array([np.random.normal(size=len(x)) for i in range(100)])
print(x.shape)
# >> (100,10)
# Create DataLoader
input_as_tensor = torch.from_numpy(x).float()
dataset = data_utils.TensorDataset(input_as_tensor)
dataloader = data_utils.DataLoader(dataset,
batch_size=100,
)
batch = next(iter(dataloader))
print(type(batch))
# >> <class 'list'>
print(len(batch))
# >> 1
print(type(batch[0]))
# >> class 'torch.Tensor'>
I expect the batchto be already a torch.Tensor. As of now I index the batch like so, batch[0] to get a Tensor but I feel this is not really pretty and makes the code harder to read.
I found that the DataLoader takes a batch processing function called collate_fn. However, setting data_utils.DataLoader(..., collage_fn=lambda batch: batch[0]) only changes the list to a tuple (tensor([ 0.8454, ..., -0.5863]),) where the only entry is the batch as a Tensor.
You would help me a lot by helping me finding out how to elegantly transform the batch to a tensor (even if this would include telling me that indexing the single entry in batch is okay).

Sorry for inconvenience with my answer.
Actually, you don't have to create Dataset from your tensor, you can pass torch.Tensor directly as it implements __getitem__ and __len__, so this is sufficient:
import numpy as np
import torch
import torch.utils.data as data_utils
# Create toy data
x = np.linspace(start=1, stop=10, num=10)
x = np.array([np.random.normal(size=len(x)) for i in range(100)])
# Create DataLoader
dataset = torch.from_numpy(x).float()
dataloader = data_utils.DataLoader(dataset, batch_size=100)
batch = next(iter(dataloader))

Really confused with this numpy shape mismatch error

I was implementing a simple k-nearest-neighbor algorithm from sklearn on google cloud platform ml engine. I used a custom metric to calculate the distance between two input vectors so that the distance is the weighted sum of elements in the element-wise squared difference between two vectors. The code is below:
import os.path
from sklearn import neighbors
import numpy as np
from six.moves import cPickle as pickle
import tensorflow as tf
from tensorflow.python.lib.io import file_io
flags = tf.app.flags
FLAGS = flags.FLAGS
flags.DEFINE_string('input_dir', 'input', 'Input Directory.')
flags.DEFINE_string('input_train_data','train_data','Input Training Data File Name.')
pickle_file = os.path.join(FLAGS.input_dir, FLAGS.input_train_data)
def mydist(x, y):
return np.dot((x - y) ** 2, weight)
with file_io.FileIO(pickle_file, 'r') as f:
save = pickle.load(f)
train_dataset, train_labels, valid_dataset, valid_labels = save['train_dataset'], save['train_labels'], save[
'valid_dataset'], save['valid_labels']
train_data = train_dataset[:1000]
train_label = train_labels[:1000]
test_data = valid_dataset[:100]
weight = [1.0]* len(train_dataset[1])
knn = neighbors.KNeighborsRegressor(weights='distance', n_neighbors=20, metric=lambda x, y: mydist(x, y))
knn.fit(train_data, train_label)
predict = knn.predict(test_data)
print(predict)
train_dataset is a numpy array of shape (86667,13) and valid_dataset has shape (8000,13). Train_labels has shape (86667,1) and valid_labels (8000,1). For some reason I got a dimension mismatch:
line 15, in mydist return np.dot((x - y) ** 2, weight) ValueError: shapes
(10,) and (13,) not aligned: 10 (dim 0) != 13 (dim 0)
Both x and y in the input of the custom metric should have size 13 yet somehow they have size 10. Can anyone explain what is wrong here?

You are taking the distance between the wrong terms. You cannot take the distance between the labels and the train features. These are of two different dimensions. You need to calculate the distance between any two feature points say x1 and x2, not between the label and it's feature point (say x1 and y1). Secondly, when declaring the KNeighborsRegressor object, you have specified the parameter incorrectly. In the 'metric' parameter you specify the 'string' or the 'DistanceMetric' Object. First you have to make a distance metric object and then pass it as metric. So, this is the correct way of calling your function:
my_metric=DistanceMetric.get_metric('myfunc',func=mydist)
knn = neighbors.KNeighborsRegressor(weights='distance', n_neighbors=20, metric='myfunc')
Sklearn will itself take care of how the parameters are to be passed in the distance function. I am assuming that the weights variable is global for your code to function correctly.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Right way to apply pre-trained scikit-learn model to dask array? - python

Related

Random split a PyTorch dataset of type TensorDataset

How To Record The Index Of Training and Testing Data For Analysis After Training (keras and python)

How to do multiple inferencing on onnx(onnxruntime) similar to sklearn

PyTorch DataLoader returns the batch as a list with the batch as the only entry. How is the best way to get a tensor from my DataLoader

Really confused with this numpy shape mismatch error

Categories

Resources