vectorized "by-layer" scaling of numpy array - python

I have a numpy array (let's say 100x64x64).
My goal is to scale each 64x64 layer independently and store a scaler for later use.
This is how it can be achieved with a for-loop solution:
scalers_dict={}
for i in range(X.shape[0]):
scalers_dict[i] = MinMaxScaler()
#fitting the scaler
X[i, :, :] = scalers_dict[i].fit_transform(X[i, :, :])
#saving dict of scalers
joblib.dump(value=scalers_dict,filename="dict_of_scalers.scaler")
My real array is much bigger, and it takes quite a while to iterate through it.
Do you have in mind some more vectorized solution for that, or for-loop is the only way?

If I understand correctly how MinMaxScaler works, it can operate on independent arrays which reduce along axis=0.
To make this useful for your case, you'd need to transform X into a (64 * 64, 100) array:
s = X.shape
X = np.moveaxis(X, 0, -1).reshape(-1, s[0])
Alternatively, you can write
X = X.reshape(s[0], -1).T
Now you can do the scaling with
M = MinMaxScaler()
X = M.fit_transform(X)
Since the actual fit is computed on the first dimension, all the results will be of size 100. This will broadcast perfectly now that the last dimension is of the same size.
To get the original shape back, invert the original transformation:
X = X.T.reshape(s)
When you are done, M will be a scaler calibrated for 100 features. There is no need for a dictionary here. Remember that a dictionary keyed by a sequence of integers can better be expressed as a list or array, which is what happens here.

IIUC, you can manually scale:
mm, MM = inputs.min(axis=(1,2)), inputs.max(axis=(1,2))
# save these for later use
joblib.dump((mm,MM), 'minmax.joblib')
def scale(inputs, mm, MM):
return (inputs - mm[:,None,None])/(MM-mm)[:,None,None]
# load pre-saved min & max
mm, MM = joblib.load('minmax.joblib')
# scaled inputs
scale(inputs, mm, MM)

Related

How to do argmax in group in pytorch?

Is there any ways to implement maxpooling according to norm of sub vectors in a group in Pytorch? Specifically, this is what I want to implement:
Input:
x: a 2-D float tensor, shape #Nodes * dim
cluster: a 1-D long tensor, shape #Nodes
Output:
y, a 2-D float tensor, and:
y[i]=x[k] where k=argmax_{cluster[k]=i}(torch.norm(x[k],p=2)).
I tried torch.scatter with reduce="max", but this only works for dim=1 and x[i]>0.
Can someone help me to solve the problem?
I don't think there's any built-in function to do what you want. Basically this would be some form of scatter_reduce on the norm of x, but instead of selecting the max norm you want to select the row corresponding to the max norm.
A straightforward implementation may look something like this
"""
input
x: float tensor of size [NODES, DIMS]
cluster: long tensor of size [NODES]
output
float tensor of size [cluster.max()+1, DIMS]
"""
num_clusters = cluster.max().item() + 1
y = torch.zeros((num_clusters, DIMS), dtype=x.dtype, device=x.device)
for cluster_id in torch.unique(cluster):
x_cluster = x[cluster == cluster_id]
y[cluster_id] = x_cluster[torch.argmax(torch.norm(x_cluster, dim=1), dim=0)]
Which should work just fine if clusters.max() is relatively small. If there are many clusters though then this approach has to unnecessarily create masks over cluster for every unique cluster id. To avoid this you can make use of argsort. The best I could come up with in pure python was the following.
num_clusters = cluster.max().item() + 1
x_norm = torch.norm(x, dim=1)
cluster_sortidx = torch.argsort(cluster)
cluster_ids, cluster_counts = torch.unique_consecutive(cluster[cluster_sortidx], return_counts=True)
end_indices = torch.cumsum(cluster_counts, dim=0).cpu().tolist()
start_indices = [0] + end_indices[:-1]
y = torch.zeros((num_clusters, DIMS), dtype=x.dtype, device=x.device)
for cluster_id, a, b in zip(cluster_ids, start_indices, end_indices):
indices = cluster_sortidx[a:b]
y[cluster_id] = x[indices[torch.argmax(x_norm[indices], dim=0)]]
For example in random tests with NODES = 60000, DIMS = 512, cluster.max()=6000 the first version takes about 620ms whie the second version takes about 78ms.

Efficient way to fill NumPy array for independent entries?

I'm currently trying to fill a matrix K where each entry in the matrix is just a function applied to two entries of an array x.
At the moment I'm using the most obvious method of running through rows and columns one at a time using a double for-loop:
K = np.zeros((x.shape[0],x.shape[0]), dtype=np.float32)
for i in range(x.shape[0]):
for j in range(x.shape[0]):
K[i,j] = f(x[i],x[j])
While this works fine the resulting matrix is a 10,000 by 10,000 matrix and takes very long to calculate. I was wondering if there is a more efficient way to do this built into NumPy?
EDIT: The function in question here is a gaussian kernel:
def gaussian(a,b,sigma):
vec = a-b
return np.exp(- np.dot(vec,vec)/(2*sigma**2))
where I set sigma in advance before calculating the matrix.
The array x is an array of shape (10000, 8). So the scalar product in the gaussian is between two vectors of dimension 8.
You can use a single for loop together with broadcasting. This requires to change the implementation of the gaussian function to accept 2D inputs:
def gaussian(a,b,sigma):
vec = a-b
return np.exp(- np.sum(vec**2, axis=-1)/(2*sigma**2))
K = np.zeros((x.shape[0],x.shape[0]), dtype=np.float32)
for i in range(x.shape[0]):
K[i] = gaussian(x[i:i+1], x)
Theoretically you could accomplish this even without any for loop, again by using broadcasting, but here an intermediary array of size len(x)**2 * x.shape[1] will be created which might run out of memory for your array sizes:
K = gaussian(x[None, :, :], x[:, None, :])

Why does this euclidean distance calculation method explodes RAM usage?

I'm studying the KNN algorithm to classify images using some material from a 2017 Stanford course. We're given a dataset consisting of many images, later those sets are represented as 2D numpy arrays, and we're supposed to write functions that calculate distances between those images. More specifically, given a 2D array of the test images and a 2D array of the training images, I'm asked to write a L_2 distance function, which takes those two sets as inputs and returns a distance matrix, where every row i represents a test image and every column j represents a training image.
The exercise also asked me to do it without any loops and without using np.abs function. So I gave it a try and tried:
def compute_distances_no_loops(self, X):
"""
Compute the distance between each test point in X and each training point
in self.X_train using no explicit loops.
Input / Output: Same as compute_distances_two_loops
"""
num_test = X.shape[0]
num_train = self.X_train.shape[0]
dists = np.zeros((num_test, num_train))
all_test_subs_sq = (X[:, np.newaxis] - self.X_train)**2
dists = np.sqrt(np.sum(all_test_subs_sq), axis = 2)
return dists
Apparently that makes Google's Colab environment crash in 6 seconds due to allocating about 60 GB of RAM. I guess I should clarify the training set X_train has a shape of (5000, 3072), and the test set X has shape (500, 3072). I am not sure what happens here that is so RAM intensive, but then again I'm not the smartest guy to figure out space complexity.
I googled a bit and found out a solution that works without the need for a NASA computer, it uses the sum of the squares formula:
dists = np.reshape(np.sum(X**2, axis=1), [num_test,1]) + np.sum(self.X_train**2, axis=1)\
- 2 * np.matmul(X, self.X_train.T)
dists = np.sqrt(dists)
I'm also not really sure why doesn't this solution explode like mine did. I'd really appreciate any insight here, thank you very much for reading.
In the compute_distances_no_loops() function the intermediate array all_test_subs_sq has the shape (500, 3072, 5000), so it consists of 500 * 3072 * 5000 = 7,680,000,000 elements. Assuming that the dtype of X and X_train is float64, each element weights 8 bytes, so the total size of the array is 61,440,000,000 bytes i.e. about 60 GB.
The other solution you included avoids this problem since it does not create such a large intermediate array. The shape of np.reshape(np.sum(X**2, axis=1), [num_test,1]) is (500, 1) and the shape of np.sum(self.X_train**2, axis=1) is (5000,). When you add them you obtain an array of the shape (500, 5000). np.matmul(X, self.X_train.T) also produces an array of the same shape.
The problem is in
all_test_subs_sq = (X[:, np.newaxis] - self.X_train)**2
X[:, np.newaxis] is equivalent to X[:, np.newaxis, :] of shape (50, 1, 3072). After broadcasting, X[:, np.newaxis] - self.X_train yields a dense (500, 5000, 3072) array which is humongous 500 x 5000 x 3072 x 8 bytes ≈ 61.44 GB since you have np.float64.

MinMaxScaler for different shapes in fit and inverse_transform

I am using MinMaxScaler from sklearn to scale the input to values between 0 and 1 and then process the data to obtain another vector. I use inverse_transform on the obtained vector to get back values in the original range. The shapes of the input to the fit_transform and the input to the inverse_transform are different. As an MWE I have provided the following code.
import numpy as np
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range=(0, 1))
a = np.random.randint(0, 10, (10, 5))
b = sc.fit_transform(a) # b values are in [0, 1] range
c = np.random.rand(10, 30) # as an example, I have generated values between 0 and 1
d = sc.inverse_transform(c)
I run into the error
ValueError: operands could not be broadcast together with shapes (10,30) (5,) (10,30)
I understand it is because of the shape mismatch. But the shapes of inputs in my actual code are fixed and cannot be changed (and are also different from each other). How can I get this to work? Any help is appreciated.
What you're using learns a different transform for each of the 5 column vectors you give as the input. I guess you probably want to learn a fixed transform. I guess you can achieve it by vectorizing the matrix. I would suggest the below
mean = a.mean()
std = a.std()
# Transform
b = (a - mean) / std
# Inverse transform
c = np.random.rand(10, 30)
d = c * std + mean
This can be solved by writing your own inverse_transform function and using the attributes of the fitted scaler.
let's say I have scaled 5 features in an array of shape (n,5) and I have made a prediction for the feature at the second index. I can now use the function below to inverse the MinMaxScaling of only this feature:
def inverse_predictions(predictions,scaler,prediction_index=2):
'''This function uses the fitted scaler to inverse predictions,
the index should be set to the position of the target variable'''
max_val = scaler.data_max_[prediction_index]
min_val = scaler.data_min_[prediction_index]
original_values = (predictions*(max_val - min_val )) + min_val
return original_values

How do I apply FFT on a 3D Array

I have a 3D array that has the shape (features, timestep, samples). I would like to apply the numpy fft function on each feature for the length of timestep for each sample. I have this, but I am uncertain whether this is the best way or whether there needs to be a loop to iterate through each sample.
import numpy as np
x_train_fft = np.fft.fft(x_train, axis=0) #selected axis 0 as this is the axis of features
Looks like this was the way to do it
X_transform_FFT =[]
for i in range(x_train.shape[0]):
f = abs(np.fft.fft(x_train[i, :, :], axis = 1))
X_transform_FFT.append(f)
np.asarray(X_transform_FFT)
print(X_transform_FFT)

Categories