How to do multiple inferencing on onnx(onnxruntime) similar to sklearn - python

I want to infer outputs against many inputs from an onnx model using onnxruntime in python. One way is to use the for loop but it seems a very trivial and a slow method. Is there a way to do the same way as sklearn?
Single prediction on onnxruntime:
import onnxruntime as ort
sess = ort.InferenceSession("xxxxx.onnx")
input_name = sess.get_inputs()
label_name = sess.get_outputs()[0].name
pred_onnx= sess.run([label_name], {
input_name[0].name: np.array([[40]]).astype(np.int64),
input_name[1].name: np.array([[0]]).astype(np.int64),
input_name[2].name: np.array([[0]]).astype(np.int64)
})
pred_onnx
>> Output: [array([[23]], dtype=float32)]
Single/Multiple prediction in sklearn(depending on the size of x_test):
test_predictions = model.predict(x_test)

Best way is for the ONNX model to support batches. Based on the input you're providing it may already do that. Your 3 inputs appear to have shape [1,1] and your output has shape [1,1], which may mean the first dimension is the batch size. Example input with shape [2,1] (2 batches, 1 element per batch) would look like [[40],[50]].
I'm guessing if you provide two batches would of input you'd get two outputs, so something like this
pred_onnx= sess.run([label_name], {
input_name[0].name: np.array([[40],[40]]).astype(np.int64),
input_name[1].name: np.array([[0],[0]]).astype(np.int64),
input_name[2].name: np.array([[0],[0]]).astype(np.int64)
})
May give output of
[array([[23],[23]], dtype=float32)]

Here is a small working example using batch inference on a sklearn model exported to ONNX.
from sklearn import datasets, model_selection, linear_model, pipeline, preprocessing
import numpy as np
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
import onnxruntime
import pandas as pd
# load toy dataset, define sklearn pipeline and fit model
dataset = datasets.load_diabetes()
X, y = dataset.data, dataset.target
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
regr = pipeline.Pipeline(
[("std", preprocessing.StandardScaler()), ("reg", linear_model.LinearRegression())]
)
regr.fit(X_train, y_train)
# export model to onnx
initial_type = list(
zip(
dataset.feature_names,
[FloatTensorType([None, 1]) for _ in range(len(dataset.feature_names))],
)
)
onx = convert_sklearn(regr, initial_types=initial_type)
with open("model.onnx", "wb") as f:
f.write(onx.SerializeToString())
# load model in onnx runtime and make batch inference
df_test = pd.DataFrame(X_test, columns=dataset.feature_names)
sess = onnxruntime.InferenceSession("model.onnx")
inputs = {
f: df_test[f].astype(np.float32).values.reshape(-1, 1)
for f in dataset.feature_names
}
label_name = sess.get_outputs()[0].name
pred_onx = sess.run([label_name], inputs)[0]
# compare results
regr.predict(X_test)
pred_onx.flatten()
I think the trickiest part is to get the input shape right for inference.
Since we specified FloatTensorType([None, 1]) the shape of the single input arrays must be of shape (x,1) where x is the number of batches. Thus we need to reshape column values of shape (x,) into (x,1).

Related

Random split a PyTorch dataset of type TensorDataset

I have a custom dataset loaded into python in numpy: (20640x8) matrix of inputs, (20640x1) vector of labels.
I am trying to prepare the data for training in a PyTorch machine learning model, which requires a training set and test set split. In my attempt, the random_split() function reports an error:
TypeError: randperm() received an invalid combination of arguments.
I couldn't figure out how to split the dataset. Here is the code I wrote:
import numpy as np
import torch
from torch.utils.data import TensorDataset, random_split
x_numpy = # (20640x8) matrix of floats
y_numpy = # (20640x1) vector of floats
x = torch.from_numpy(x_numpy.astype(np.float32))
y = torch.from_numpy(y_numpy.astype(np.float32))
dataset = TensorDataset(x, y)
trainSet, testSet = random_split(dataset, [0.6*len(dataset), 0.4*len(dataset)])
Thanks in advance for the help!
The dataset length must be integers:
>>> split = int(0.6*len(dataset))
>>> trainSet, testSet = random_split(dataset, [split, len(dataset)-split])

Merge two tensorflow datasets into one dataset with inputs and lables

I have two tensorflow datasets that are generated using timeseries_dataset_from_array (docs). One corresponds to the input of my network and the other one to the output. I guess we can call them the inputs dataset and the targets dataset, which are both the same shape (a timeseries window of a fixed size).
The code I'm using to generate these datasets goes like this:
train_x = timeseries_dataset_from_array(
df_train['x'],
None,
sequence_length,
sequence_stride=sequence_stride,
batch_size=batch_size
)
train_y = timeseries_dataset_from_array(
df_train['y'],
None,
sequence_length,
sequence_stride=sequence_stride,
batch_size=batch_size
)
The problem is that when calling model.fit, tf.keras expects that if a tf.data.Dataset is given in the x argument, it has to provide both the inputs and targets. That is why I need to combine these two datasets into one, setting one as inputs and the other one as targets.
Simplest way would be to use tf.data.Dataset.zip:
import tensorflow as tf
import numpy as np
X = np.arange(100)
Y = X*2
sample_length = 20
input_dataset = tf.keras.preprocessing.timeseries_dataset_from_array(
X, None, sequence_length=sample_length, sequence_stride=sample_length)
target_dataset = tf.keras.preprocessing.timeseries_dataset_from_array(
Y, None, sequence_length=sample_length, sequence_stride=sample_length)
dataset = tf.data.Dataset.zip((input_dataset, target_dataset))
for x, y in dataset:
print(x.shape, y.shape)
(5, 20) (5, 20)
You can then feed dataset directly to your model.

Right way to apply pre-trained scikit-learn model to dask array?

I have a scikit-learn model, which has already been trained. It takes 8 features as an input.
Now I would like to apply that model to an arbitrary big dask array (of shape (n_samples, 8)) and write the result to disk.
I fail at applying my model to a dask array and getting back a dask array.
Setup:
import dask.array as da
from dask import delayed
# pre-trained classifier/model already exists
# I will refer to this model as "clf"
# test data
X = da.full((100,8), 0)
# this works fine, but returns a numpy array...
y_predicted = clf.predict(X)
What I have tried
# 1. intention: apply classifier to each row (features)
y_predicted = da.apply_along_axis(clf.pedict, 1, X)
# fails with
# AttributeError: 'RandomForestClassifier' object has no attribute 'pedict'
# 2. apply blocks
y_predicted = da.map_blocks(clf.predict, X)
# quickly runs out of memory, process killed
# 3. write a delayed function that returns a dask array
#delayed
def predict_delayed(da_array):
return da.from_array(clf.predict(da_array))
y_predicted = predict_delayed(X)
y_predicted.compute()
# fails with TypeError: Expected sequence or array-like, got <class 'method'>
I have spent quite some time on this and I am quite clueless on how to proceed.
Any suggestions?
I finally figured it out, here is a minimal reproducible example:
import dask.array as da
from dask import delayed
from sklearn.ensemble import RandomForestClassifier as RF
n_samples = 1000
# random training data
X_train = da.random.random((n_samples,8))
y_train = da.random.randint(0, 2, n_samples)
rf = RF(random_state = 42, n_estimators=50)
rf.fit(X_train, y_train)
# some data to predict on
X_new = da.random.random((n_samples,8))
# 1. option: delayed function
#delayed
def predict_delayed(da_array: da.array) -> da.array:
return da.from_array(rf.predict(da_array))
y_predicted = predict_delayed(X_new)
# 2. option: map_blocks
# (key was here to supply drop_axis and dtype arguments)
y_predicted = da.map_blocks(rf.predict, X_new, dtype="i8", drop_axis=1)

Tensorflow model predicting Nans

I am new to TensorFlow framework and I am trying to apply Tensorflow to predict the survivor based on this Titanic Dataset:https://www.kaggle.com/c/titanic/data.
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
#%%
titanictrain = pd.read_csv('train.csv')
titanictest = pd.read_csv('test.csv')
df = pd.concat([titanictrain,titanictest],join='outer',keys='PassengerId',sort=False,ignore_index=True).drop(['Name'],1)
#%%
def preprocess(df):
df['Fare'].fillna(value=df.groupby('Pclass')['Fare'].transform('median'),inplace=True)
df['Fare'] = df['Fare'].map(lambda x: np.log(x) if x>0 else 0)
df['Embarked'].fillna(value=df['Embarked'].mode()[0],inplace=True)
df['CabinAlphabet'] = df['Cabin'].str[0]
categories_to_one_hot = ['Pclass','Sex','Embarked','CabinAlphabet']
df = pd.get_dummies(df,columns=categories_to_one_hot,drop_first=True)
return df
df = preprocess(df)
df = df.drop(['PassengerId','Ticket','Cabin','Survived'],1)
titanic_trainandval = df.iloc[:len(titanictrain)]
titanic_test = df.iloc[len(titanictrain):] #test after preprocessing
titanic_test.head()
# split train into training and validation set
labels = titanictrain['Survived']
y = labels.values
test = titanic_test.copy() # real test sets
print(len(test), 'test examples')
Here I am trying to apply preprocessing on the data:
1.Drop Name column and Do one hot coding both on the train and test set
2.Drop ['PassengerId','Ticket','Cabin','Survived'] for Simplicity.
Split train and test following the original order
Here is a picture showing what the training set looks like.
"""# model training"""
from tensorflow.keras.layers import Input, Dense, Activation,Dropout
from tensorflow.keras.models import Model
X = titanic_trainandval.copy()
input_layer = Input(shape=(X.shape[1],))
dense_layer_1 = Dense(10, activation='relu')(input_layer)
dense_layer_2 = Dense(5, activation='relu')(dense_layer_1)
output = Dense(1, activation='softmax',name = 'predictions')(dense_layer_2)
model = Model(inputs=input_layer, outputs=output)
base_learning_rate = 0.0001
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), optimizer=tf.keras.optimizers.Adam(lr=base_learning_rate), metrics=['acc'])
history = model.fit(X, y, batch_size=5, epochs=20, verbose=2, validation_split=0.1,shuffle = False)
submission = pd.DataFrame()
submission['PassengerId'] = titanictest['PassengerId']
Then I put the training set X into the model to get the result. However, history shows the following result:
No matter how I change the learning rate and batch size, the result does not change, and the loss is always 'nan', and the prediction based on the test set is always 'nan' as well.
Could anybody explain where the problem is and give some possible solutions?
at first glance there are 2 major problems in your code:
your output layer must be Dense(2, activation='softmax'). this is because yours is a binary classification problem and if you are using softmax to generate probabilities the output dim must be equal to the number of classes. (you can use one output dimension with sigmoid activation)
you have to change your loss function. with softmax and numerical encoded target use sparse_categorical_crossentropy. (you can use binary_crossentropy with sigmoid and with from_logits=False as default)
PS: make sure to remove all NaNs in your original data before starting fit
Marco Cerliani is right with the points 1 and 2.
The real problem why you have NaNs is because you feed NaNs in your code. If you look carefully, even in your third photo, the 888th example on the column Age contains a NaN.
This is why you have NaNs. Solve this one, and apply Marco Cerliani's suggestions and you're good to go.
Apart from the above answers, 1 more thing which I would like to add is that whenever you want to use form_logits=True for classification problems, use Linear activation function i.e activation='linear' which is the default value for activation function in the last layer.

Error: Out Of Memory, tensorflow cnn

I am using the getting started example of Tensorflow CNN and updating parameters to my own data but since my model is large (244 * 244 features) I got OutOfMemory error.
I am running the training on Ubuntu 14.04 with 4 CPUs and 16Go of RAM.
Is there a way to shrink my data so I don't get this OOM error?
My code looks like this:
# Create the Estimator
mnist_classifier = tf.estimator.Estimator(
model_fn=cnn_model_fn, model_dir="path/to/model")
# Load the data
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": np.array(training_set.data)},
y=np.array(training_set.target),
num_epochs=None,
batch_size=5,
shuffle=True)
# Train the model
mnist_classifier.train(
input_fn=train_input_fn,
steps=100,
hooks=[logging_hook])
Is there a way to shrink my data so I don't get this OOM error?
You can slice your training_set to obtain just a portion of the dataset. Something like:
x={"x": np.array(training_set.data)[:(len(training_set)/2)]},
y=np.array(training_set.target)[:(len(training_set)/2)],
In this example you are getting the first half of your dataset (you can select up to what point of your dataset you want to load).
Edit: Another way you can do this is to obtain a random subset of your training dataset. This you can achieve by masking elements on your dataset array. For example:
import numpy as np
from random import random as rn
#obtain boolean mask to filter out some elements
#here you can define your sample %
r = 0.5 #say filter half the elements
mask = [True if rn() >= r else False for i in range(len(training_set))]
#finally, mask out those elements,
#the result will have ~r times the original elements
reduced_ds = training_set[mask]

Categories