Really confused with this numpy shape mismatch error - python

I was implementing a simple k-nearest-neighbor algorithm from sklearn on google cloud platform ml engine. I used a custom metric to calculate the distance between two input vectors so that the distance is the weighted sum of elements in the element-wise squared difference between two vectors. The code is below:
import os.path
from sklearn import neighbors
import numpy as np
from six.moves import cPickle as pickle
import tensorflow as tf
from tensorflow.python.lib.io import file_io
flags = tf.app.flags
FLAGS = flags.FLAGS
flags.DEFINE_string('input_dir', 'input', 'Input Directory.')
flags.DEFINE_string('input_train_data','train_data','Input Training Data File Name.')
pickle_file = os.path.join(FLAGS.input_dir, FLAGS.input_train_data)
def mydist(x, y):
return np.dot((x - y) ** 2, weight)
with file_io.FileIO(pickle_file, 'r') as f:
save = pickle.load(f)
train_dataset, train_labels, valid_dataset, valid_labels = save['train_dataset'], save['train_labels'], save[
'valid_dataset'], save['valid_labels']
train_data = train_dataset[:1000]
train_label = train_labels[:1000]
test_data = valid_dataset[:100]
weight = [1.0]* len(train_dataset[1])
knn = neighbors.KNeighborsRegressor(weights='distance', n_neighbors=20, metric=lambda x, y: mydist(x, y))
knn.fit(train_data, train_label)
predict = knn.predict(test_data)
print(predict)
train_dataset is a numpy array of shape (86667,13) and valid_dataset has shape (8000,13). Train_labels has shape (86667,1) and valid_labels (8000,1). For some reason I got a dimension mismatch:
line 15, in mydist return np.dot((x - y) ** 2, weight) ValueError: shapes
(10,) and (13,) not aligned: 10 (dim 0) != 13 (dim 0)
Both x and y in the input of the custom metric should have size 13 yet somehow they have size 10. Can anyone explain what is wrong here?

You are taking the distance between the wrong terms. You cannot take the distance between the labels and the train features. These are of two different dimensions. You need to calculate the distance between any two feature points say x1 and x2, not between the label and it's feature point (say x1 and y1). Secondly, when declaring the KNeighborsRegressor object, you have specified the parameter incorrectly. In the 'metric' parameter you specify the 'string' or the 'DistanceMetric' Object. First you have to make a distance metric object and then pass it as metric. So, this is the correct way of calling your function:
my_metric=DistanceMetric.get_metric('myfunc',func=mydist)
knn = neighbors.KNeighborsRegressor(weights='distance', n_neighbors=20, metric='myfunc')
Sklearn will itself take care of how the parameters are to be passed in the distance function. I am assuming that the weights variable is global for your code to function correctly.

Related

Right way to apply pre-trained scikit-learn model to dask array?

I have a scikit-learn model, which has already been trained. It takes 8 features as an input.
Now I would like to apply that model to an arbitrary big dask array (of shape (n_samples, 8)) and write the result to disk.
I fail at applying my model to a dask array and getting back a dask array.
Setup:
import dask.array as da
from dask import delayed
# pre-trained classifier/model already exists
# I will refer to this model as "clf"
# test data
X = da.full((100,8), 0)
# this works fine, but returns a numpy array...
y_predicted = clf.predict(X)
What I have tried
# 1. intention: apply classifier to each row (features)
y_predicted = da.apply_along_axis(clf.pedict, 1, X)
# fails with
# AttributeError: 'RandomForestClassifier' object has no attribute 'pedict'
# 2. apply blocks
y_predicted = da.map_blocks(clf.predict, X)
# quickly runs out of memory, process killed
# 3. write a delayed function that returns a dask array
#delayed
def predict_delayed(da_array):
return da.from_array(clf.predict(da_array))
y_predicted = predict_delayed(X)
y_predicted.compute()
# fails with TypeError: Expected sequence or array-like, got <class 'method'>
I have spent quite some time on this and I am quite clueless on how to proceed.
Any suggestions?
I finally figured it out, here is a minimal reproducible example:
import dask.array as da
from dask import delayed
from sklearn.ensemble import RandomForestClassifier as RF
n_samples = 1000
# random training data
X_train = da.random.random((n_samples,8))
y_train = da.random.randint(0, 2, n_samples)
rf = RF(random_state = 42, n_estimators=50)
rf.fit(X_train, y_train)
# some data to predict on
X_new = da.random.random((n_samples,8))
# 1. option: delayed function
#delayed
def predict_delayed(da_array: da.array) -> da.array:
return da.from_array(rf.predict(da_array))
y_predicted = predict_delayed(X_new)
# 2. option: map_blocks
# (key was here to supply drop_axis and dtype arguments)
y_predicted = da.map_blocks(rf.predict, X_new, dtype="i8", drop_axis=1)

How to do multiple inferencing on onnx(onnxruntime) similar to sklearn

I want to infer outputs against many inputs from an onnx model using onnxruntime in python. One way is to use the for loop but it seems a very trivial and a slow method. Is there a way to do the same way as sklearn?
Single prediction on onnxruntime:
import onnxruntime as ort
sess = ort.InferenceSession("xxxxx.onnx")
input_name = sess.get_inputs()
label_name = sess.get_outputs()[0].name
pred_onnx= sess.run([label_name], {
input_name[0].name: np.array([[40]]).astype(np.int64),
input_name[1].name: np.array([[0]]).astype(np.int64),
input_name[2].name: np.array([[0]]).astype(np.int64)
})
pred_onnx
>> Output: [array([[23]], dtype=float32)]
Single/Multiple prediction in sklearn(depending on the size of x_test):
test_predictions = model.predict(x_test)
Best way is for the ONNX model to support batches. Based on the input you're providing it may already do that. Your 3 inputs appear to have shape [1,1] and your output has shape [1,1], which may mean the first dimension is the batch size. Example input with shape [2,1] (2 batches, 1 element per batch) would look like [[40],[50]].
I'm guessing if you provide two batches would of input you'd get two outputs, so something like this
pred_onnx= sess.run([label_name], {
input_name[0].name: np.array([[40],[40]]).astype(np.int64),
input_name[1].name: np.array([[0],[0]]).astype(np.int64),
input_name[2].name: np.array([[0],[0]]).astype(np.int64)
})
May give output of
[array([[23],[23]], dtype=float32)]
Here is a small working example using batch inference on a sklearn model exported to ONNX.
from sklearn import datasets, model_selection, linear_model, pipeline, preprocessing
import numpy as np
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
import onnxruntime
import pandas as pd
# load toy dataset, define sklearn pipeline and fit model
dataset = datasets.load_diabetes()
X, y = dataset.data, dataset.target
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
regr = pipeline.Pipeline(
[("std", preprocessing.StandardScaler()), ("reg", linear_model.LinearRegression())]
)
regr.fit(X_train, y_train)
# export model to onnx
initial_type = list(
zip(
dataset.feature_names,
[FloatTensorType([None, 1]) for _ in range(len(dataset.feature_names))],
)
)
onx = convert_sklearn(regr, initial_types=initial_type)
with open("model.onnx", "wb") as f:
f.write(onx.SerializeToString())
# load model in onnx runtime and make batch inference
df_test = pd.DataFrame(X_test, columns=dataset.feature_names)
sess = onnxruntime.InferenceSession("model.onnx")
inputs = {
f: df_test[f].astype(np.float32).values.reshape(-1, 1)
for f in dataset.feature_names
}
label_name = sess.get_outputs()[0].name
pred_onx = sess.run([label_name], inputs)[0]
# compare results
regr.predict(X_test)
pred_onx.flatten()
I think the trickiest part is to get the input shape right for inference.
Since we specified FloatTensorType([None, 1]) the shape of the single input arrays must be of shape (x,1) where x is the number of batches. Thus we need to reshape column values of shape (x,) into (x,1).

Partial derivatives of Gaussian Process wrt features

Given a Gaussian Process Model with multidimensional features and scalar observations, how do I compute derivatives of the output wrt to each input, in GPyTorch or GPFlow (or scikit-learn)?
If I understand your question correctly, the following should give you what you want in GPflow with TensorFlow:
import numpy as np
import tensorflow as tf
import gpflow
### Set up toy data & model -- change as appropriate:
X = np.linspace(0, 10, 5)[:, None]
Y = np.random.randn(5, 1)
data = (X, Y)
kernel = gpflow.kernels.SquaredExponential()
model = gpflow.models.GPR(data, kernel)
Xtest = np.linspace(-1, 11, 7)[:, None] # where you want to predict
### Compute gradient of prediction with respect to input:
# TensorFlow can only compute gradients with respect to tensor objects,
# so let's convert the inputs to a tensor:
Xtest_tensor = tf.convert_to_tensor(Xtest)
with tf.GradientTape(
persistent=True # this allows us to compute different gradients below
) as tape:
# By default, only Variables are watched. For gradients with respect to tensors,
# we need to explicitly watch them:
tape.watch(Xtest_tensor)
mean, var = model.predict_f(Xtest_tensor) # or any other predict function
grad_mean = tape.gradient(mean, Xtest_tensor)
grad_var = tape.gradient(var, Xtest_tensor)

PCA().fit() is using the wrong axis for data input

I'm using sklearn.decomposition.PCA to pre-process some training data for a machine learning model. There is 247 data points with 4095 dimensions, imported from a csv file using pandas. I then scale the data
training_data = StandardScaler().fit_transform(training[:,1:4096])
before calling the PCA algorithm to obtain the variance for each dimension,
pca = PCA(n_components)
pca.fit(training_data).
The output is a vector of length 247, but it should have length 4095 so that I can work out the variance of each dimension, not the variance of each data point.
My code looks like:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
test = np.array(pd.read_csv("testing.csv", sep=','))
training = np.array(pd.read_csv("training.csv", sep=','))
# ID Number = [0]
# features = [1:4096]
training_data = StandardScaler().fit_transform(training[:,1:4096])
test_data = StandardScaler().fit_transform(test[:,1:4096])
training_labels = training[:,4609]
pca = PCA()
pca.fit(training_data)
pca_variance = pca.explained_variance_.
I have tried taking the transpose of training_data, but this didn't change the output. I have also tried changing n_components in the argument of the PCA function, but it is insistent that there can only be 247 dimensions.
This may be a stupid question, but I'm very new to this sort of data processing. Thank you.
You said:
" but it should have length 4095 so that I can work out the variance of
each dimension, not the variance of each data point."
No. This is only true if you would estimate 4095 components using pca = PCA(n_components=4095).
On the other hand, you define:
pca = PCA() # this is actually PCA(n_components=None)
so n_components is set to None.
When this happens we have (see the documentation here):
n_components == min(n_samples, n_features)
Thus, in your case, you have min(247, 4095) = 247 components.
So, pca.explained_variance_. will be a vector with shape 247 since you have 247 PC dimensions.
Why do we have n_components == min(n_samples, n_features) ?
This is related to the rank of the covariance/correlation matrix. Having a data matrix X with shape [247,4095], the covariance/correlation matrix would be [4095,4095] with max rank = min(n_samples, n_features). Thus, you have at most min(n_samples, n_features) meaningful PC components/dimensions.

Keras tensor has an additional dimension and causes wrong results for net.evaluate()

I'd like to train a neural network in Python and Keras using a metric learning custom loss function. The loss minimizes the distances of the outputs for similar inputs and maximizes the distances between dissimilar ones. The part considering similar inputs is:
# function to create a pairwise similarity matrix, i.e
# L[i,j] == 1 for similar samples i, j and 0 otherwise
def build_indicator_matrix(y_, thr=0.1):
# y_: contains the labels of the samples,
# samples are similar in case of same label
# prevent checking equality of floats --> check if absolute
# differences are below threshold
lbls_diff = K.expand_dims(y_, axis=0) - K.expand_dims(y_, axis=1)
lbls_thr = K.less(K.abs(lbls_diff), thr)
# cast bool tensor back to float32
L = K.cast(lbls_thr, 'float32')
# POSSIBLE WORKAROUND
#L = K.sum(L, axis=2)
return L
# function to compute the (squared) Euclidean distances between all pairs
# of samples, store in DIST[i,j] the distance between output y_pred[i,:] and y_pred[j,:]
def compute_pairwise_distances(y_pred):
DIFF = K.expand_dims(y_pred, axis=0) - K.expand_dims(y_pred, axis=1)
DIST = K.sum(K.square(DIFF), axis=-1)
return DIST
# function to compute the average distance between all similar samples
def my_loss(y_true, y_pred):
# y_true: contains true labels of the samples
# y_pred: contains network outputs
L = build_indicator_matrix(y_true)
DIST = compute_pairwise_distances(y_pred)
return K.mean(DIST * L, axis=1)
For training, I pass a numpy array y of shape (n,) as target variable to my_loss. However, I found (using the computational graph in TensorBoard) that the tensorflow backend creates a 2D variable out of y (displayed shape ? x ?), and hence L in build_indicator_matrix is not 2 but 3-dimensional (shape ? x ? x ? in TensorBoard). This causes net.evaulate() and net.fit() to compute wrong results.
Why does tensorflow create a 2D rather than a 1D array? And how does this affect net.evaluate() and net.fit()?
As quick workarounds I found that either replacing the build_indicator_matrix() with static numpy code for computing L , or collapsing the "fake" dimension with the line L = K.sum(L, axis=2) solves the problem. In the latter case, however, the output of K.eval(build_indicator_matrix(y)) is of only of shape (n,) and not (n,n), so I do not understand why this workaround still yields correct results. Why does tensorflow introduce an additional dimension?
My library versions are:
keras: 2.2.4
tensorflow: 1.8.0
numpy: 1.15.0
This is because evaluate and fit work in batches.
The first dimension you see in tensorboard is the batch dimension, unknown in advance and therefore denoted ?.
When using custom metrics, remember the tensors (y_true and y_pred) you get are the ones corresponding to the batch.
For more info, show us how you call both those functions.

Categories