MinMaxScaler for different shapes in fit and inverse_transform - python

I am using MinMaxScaler from sklearn to scale the input to values between 0 and 1 and then process the data to obtain another vector. I use inverse_transform on the obtained vector to get back values in the original range. The shapes of the input to the fit_transform and the input to the inverse_transform are different. As an MWE I have provided the following code.
import numpy as np
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range=(0, 1))
a = np.random.randint(0, 10, (10, 5))
b = sc.fit_transform(a) # b values are in [0, 1] range
c = np.random.rand(10, 30) # as an example, I have generated values between 0 and 1
d = sc.inverse_transform(c)
I run into the error
ValueError: operands could not be broadcast together with shapes (10,30) (5,) (10,30)
I understand it is because of the shape mismatch. But the shapes of inputs in my actual code are fixed and cannot be changed (and are also different from each other). How can I get this to work? Any help is appreciated.

What you're using learns a different transform for each of the 5 column vectors you give as the input. I guess you probably want to learn a fixed transform. I guess you can achieve it by vectorizing the matrix. I would suggest the below
mean = a.mean()
std = a.std()
# Transform
b = (a - mean) / std
# Inverse transform
c = np.random.rand(10, 30)
d = c * std + mean

This can be solved by writing your own inverse_transform function and using the attributes of the fitted scaler.
let's say I have scaled 5 features in an array of shape (n,5) and I have made a prediction for the feature at the second index. I can now use the function below to inverse the MinMaxScaling of only this feature:
def inverse_predictions(predictions,scaler,prediction_index=2):
'''This function uses the fitted scaler to inverse predictions,
the index should be set to the position of the target variable'''
max_val = scaler.data_max_[prediction_index]
min_val = scaler.data_min_[prediction_index]
original_values = (predictions*(max_val - min_val )) + min_val
return original_values

Related

How to do argmax in group in pytorch?

Is there any ways to implement maxpooling according to norm of sub vectors in a group in Pytorch? Specifically, this is what I want to implement:
Input:
x: a 2-D float tensor, shape #Nodes * dim
cluster: a 1-D long tensor, shape #Nodes
Output:
y, a 2-D float tensor, and:
y[i]=x[k] where k=argmax_{cluster[k]=i}(torch.norm(x[k],p=2)).
I tried torch.scatter with reduce="max", but this only works for dim=1 and x[i]>0.
Can someone help me to solve the problem?
I don't think there's any built-in function to do what you want. Basically this would be some form of scatter_reduce on the norm of x, but instead of selecting the max norm you want to select the row corresponding to the max norm.
A straightforward implementation may look something like this
"""
input
x: float tensor of size [NODES, DIMS]
cluster: long tensor of size [NODES]
output
float tensor of size [cluster.max()+1, DIMS]
"""
num_clusters = cluster.max().item() + 1
y = torch.zeros((num_clusters, DIMS), dtype=x.dtype, device=x.device)
for cluster_id in torch.unique(cluster):
x_cluster = x[cluster == cluster_id]
y[cluster_id] = x_cluster[torch.argmax(torch.norm(x_cluster, dim=1), dim=0)]
Which should work just fine if clusters.max() is relatively small. If there are many clusters though then this approach has to unnecessarily create masks over cluster for every unique cluster id. To avoid this you can make use of argsort. The best I could come up with in pure python was the following.
num_clusters = cluster.max().item() + 1
x_norm = torch.norm(x, dim=1)
cluster_sortidx = torch.argsort(cluster)
cluster_ids, cluster_counts = torch.unique_consecutive(cluster[cluster_sortidx], return_counts=True)
end_indices = torch.cumsum(cluster_counts, dim=0).cpu().tolist()
start_indices = [0] + end_indices[:-1]
y = torch.zeros((num_clusters, DIMS), dtype=x.dtype, device=x.device)
for cluster_id, a, b in zip(cluster_ids, start_indices, end_indices):
indices = cluster_sortidx[a:b]
y[cluster_id] = x[indices[torch.argmax(x_norm[indices], dim=0)]]
For example in random tests with NODES = 60000, DIMS = 512, cluster.max()=6000 the first version takes about 620ms whie the second version takes about 78ms.

How can I reverse .reshape() and get back to a 3D array?

I have a dataset of shape (256, 180, 360). I reshaped it to 2D, removed the 0 values, and applied PCA using:
data = data.reshape(data.shape[0], data.shape[1] * data.shape[2]).T
data = data[~np.all(data == 0, axis = 1)]
# Dataset is now of shape (27719, 256)
data = StandardScaler().fit_transform(data)
pca = PCA()
transformed = pca.fit_transform(data)
Now, the next step is to reshape the transformed dataset back to 3D and plot the PCA results. I tried:
transformed.reshape(360, 180, 256)
which gives me the error "cannot reshape array of size 7096064 into shape (360,180,256)". I understand I cannot get back to the original shape because I removed 0 values which changes that shape, of course, but I have tried other variations of this alongside using variations with the transpose but I cannot get it back to 3D (not necessarily the exact dimensions as before). Any recommendations?
You can't.
What you can do in this scenario is to not use fit_transform, and instead have two separate pipelines. One that uses fit to train on the dataset with all the zero entries removed, and then use transforms on the original dataset to get your transformed data.
flat_data = data.reshape(data.shape[0], data.shape[1] * data.shape[2]).T
nonzero_data = flat_data[~np.all(flat_data == 0, axis = 1)]
scaler = StandardScaler()
pca = PCA()
pca.fit(scaler.fit_transform(nonzero_data))
transformed = pca.transform(scaler.transform(flat_data)).reshape(data.shape)

vectorized "by-layer" scaling of numpy array

I have a numpy array (let's say 100x64x64).
My goal is to scale each 64x64 layer independently and store a scaler for later use.
This is how it can be achieved with a for-loop solution:
scalers_dict={}
for i in range(X.shape[0]):
scalers_dict[i] = MinMaxScaler()
#fitting the scaler
X[i, :, :] = scalers_dict[i].fit_transform(X[i, :, :])
#saving dict of scalers
joblib.dump(value=scalers_dict,filename="dict_of_scalers.scaler")
My real array is much bigger, and it takes quite a while to iterate through it.
Do you have in mind some more vectorized solution for that, or for-loop is the only way?
If I understand correctly how MinMaxScaler works, it can operate on independent arrays which reduce along axis=0.
To make this useful for your case, you'd need to transform X into a (64 * 64, 100) array:
s = X.shape
X = np.moveaxis(X, 0, -1).reshape(-1, s[0])
Alternatively, you can write
X = X.reshape(s[0], -1).T
Now you can do the scaling with
M = MinMaxScaler()
X = M.fit_transform(X)
Since the actual fit is computed on the first dimension, all the results will be of size 100. This will broadcast perfectly now that the last dimension is of the same size.
To get the original shape back, invert the original transformation:
X = X.T.reshape(s)
When you are done, M will be a scaler calibrated for 100 features. There is no need for a dictionary here. Remember that a dictionary keyed by a sequence of integers can better be expressed as a list or array, which is what happens here.
IIUC, you can manually scale:
mm, MM = inputs.min(axis=(1,2)), inputs.max(axis=(1,2))
# save these for later use
joblib.dump((mm,MM), 'minmax.joblib')
def scale(inputs, mm, MM):
return (inputs - mm[:,None,None])/(MM-mm)[:,None,None]
# load pre-saved min & max
mm, MM = joblib.load('minmax.joblib')
# scaled inputs
scale(inputs, mm, MM)

How to use `apply_along_axis` with ndim > 2 arrays?

I am trying to apply gaussian filtering on the toy digits dataset images. It stores images in a (1797, 8, 8) array. Individually, I can make it work but when I try to apply it for the whole image set with apply_along_axis, something goes wrong.
Here is the core example:
from sklearn.datasets import load_digits
from scipy.ndimage.filters import gaussian_filter
images = load_digits().images
# Filter individually
individual = gaussian_filter(images[0], sigma=1, order=0)
# Use `apply_along_axis`
transformed = np.apply_along_axis(
func1d=lambda x: gaussian_filter(x, sigma=1, order=0),
axis=2,
arr=images
)
# They produce different arrays
(transformed[0] != individual).all()
Out: True
I tried to change the axis but that did not help. I also checked by, first, simply returning the image/squared values. In these cases, the results seem equivalent. Applying dot product, however, again produces different results.
# Squared values
transformed = np.apply_along_axis(
func1d=lambda x: x ** 2,
axis=2,
arr=images
)
# They produce the same arrays
(transformed[0] == images[0] ** 2).all()
Out: True
# Dot product
transformed = np.apply_along_axis(
func1d=lambda x: np.dot(x, x),
axis=2,
arr=images
)
individual = np.dot(images[0], images[0])
# They produce different arrays
(transformed[0] != individual).all()
Out: True
I am sure I misunderstand the way these functions work. What am I doing wrong?
Update: As #hpaulj pointed out in the comments, the func1d parameter in apply_along_axis takes in only 1d arrays. See...

how to get original data from normalized array

I have a simple piece of code given below which normalize array in terms of row.
import numpy as np
from sklearn import preprocessing
X = np.asarray([[-1,2,1],
[4,1,2]], dtype=np.float)
X_normalized = preprocessing.normalize(X, norm='l2')
Can you please help me to convert X-normalized to X again?
You cannot recover X from nothing more than the normalized version. Consider the trivial case of several data sets, each with 2 different elements:
[3, 4]
[-18, 20]
[0, 0.0001]
Each of these normalizes to the same data set:
[-1, 1]
The mapping is not a bijection: it's a many-to-one. Thus, it's not uniquely invertable.
However, you can recover the original set with a couple of simple techniques:
Keep the original data set intact (yes, that easy).
Store the normalization parameters: mean and standard deviation (or its square, the variance). This gives you the linear equation that transforms each original element into a normalized element; it's trivial to invert that equation.
All the scalers in https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing have inverse_transform method designed just for that.
For example, to scale and un-scale your DataFrame with MinMaxScaler you could do:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled = scaler.fit_transform(df)
unscaled = scaler.inverse_transform(scaled)
Just bear in mind that the transform function (and fit_transform as well) return a numpy.array, and not a pandas.Dataframe.
[Refrence][1]
[1]: https://stackoverflow.com/questions/43382716/how-can-i-cleanly-normalize-data-and-then-unnormalize-it-later/43383700

Categories