I have a dataset of shape (256, 180, 360). I reshaped it to 2D, removed the 0 values, and applied PCA using:
data = data.reshape(data.shape[0], data.shape[1] * data.shape[2]).T
data = data[~np.all(data == 0, axis = 1)]
# Dataset is now of shape (27719, 256)
data = StandardScaler().fit_transform(data)
pca = PCA()
transformed = pca.fit_transform(data)
Now, the next step is to reshape the transformed dataset back to 3D and plot the PCA results. I tried:
transformed.reshape(360, 180, 256)
which gives me the error "cannot reshape array of size 7096064 into shape (360,180,256)". I understand I cannot get back to the original shape because I removed 0 values which changes that shape, of course, but I have tried other variations of this alongside using variations with the transpose but I cannot get it back to 3D (not necessarily the exact dimensions as before). Any recommendations?
You can't.
What you can do in this scenario is to not use fit_transform, and instead have two separate pipelines. One that uses fit to train on the dataset with all the zero entries removed, and then use transforms on the original dataset to get your transformed data.
flat_data = data.reshape(data.shape[0], data.shape[1] * data.shape[2]).T
nonzero_data = flat_data[~np.all(flat_data == 0, axis = 1)]
scaler = StandardScaler()
pca = PCA()
pca.fit(scaler.fit_transform(nonzero_data))
transformed = pca.transform(scaler.transform(flat_data)).reshape(data.shape)
Related
I am trying to implement this paper. I have to try to interpolate the latent code of an autoencoder, as mentioned in the paper. The latent code is the encoded input of an autoencoder. The shape of the latent code (for two samples) is (2, 64, 64, 128).
This is what I have done:
image1 = sel_train_encodings[0]
image2 = sel_train_encodings[1]
x = image1[:,0,0]
x_new = image2[:,0,0]
new_array = interp1d(x, image1, axis=0, fill_value='extrapolate', kind='linear')(x_new)
I basically took the encodings of two images and tried to interpolate( with extrapolation for some points as all points don't lie in the same range) and then did interpolation over one of the axes. But the results I later obtain with these interpolated values are not so good, am I doing something wrong/how else to do it?
According to one of the given answers, I also tried to do 2D interpolation in the following way:
image1 = sel_train_encodings[0]
image2 = sel_train_encodings[1]
new_array = griddata((x,z),y,(x_new, z_new), method='cubic', fill_value='extrapolate')
But this resulted in the error:
ValueError: shape mismatch: objects cannot be broadcast to a single shape
Scipy has a couple of 2D interpolation routines, depending on the spacing of the (x, y):
If your data is on a regular grid, try scipy.interpolate.RectBivariateSpline(). This is probably most applicable to images.
If your data is collected on a grid with uneven rectangular intervals, you want to use scipy.interpolate.interp2d().
If all bets are off and the (x, y) are scattered without any clear grid, try scipy.interpolate.griddata()
I am using MinMaxScaler from sklearn to scale the input to values between 0 and 1 and then process the data to obtain another vector. I use inverse_transform on the obtained vector to get back values in the original range. The shapes of the input to the fit_transform and the input to the inverse_transform are different. As an MWE I have provided the following code.
import numpy as np
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler(feature_range=(0, 1))
a = np.random.randint(0, 10, (10, 5))
b = sc.fit_transform(a) # b values are in [0, 1] range
c = np.random.rand(10, 30) # as an example, I have generated values between 0 and 1
d = sc.inverse_transform(c)
I run into the error
ValueError: operands could not be broadcast together with shapes (10,30) (5,) (10,30)
I understand it is because of the shape mismatch. But the shapes of inputs in my actual code are fixed and cannot be changed (and are also different from each other). How can I get this to work? Any help is appreciated.
What you're using learns a different transform for each of the 5 column vectors you give as the input. I guess you probably want to learn a fixed transform. I guess you can achieve it by vectorizing the matrix. I would suggest the below
mean = a.mean()
std = a.std()
# Transform
b = (a - mean) / std
# Inverse transform
c = np.random.rand(10, 30)
d = c * std + mean
This can be solved by writing your own inverse_transform function and using the attributes of the fitted scaler.
let's say I have scaled 5 features in an array of shape (n,5) and I have made a prediction for the feature at the second index. I can now use the function below to inverse the MinMaxScaling of only this feature:
def inverse_predictions(predictions,scaler,prediction_index=2):
'''This function uses the fitted scaler to inverse predictions,
the index should be set to the position of the target variable'''
max_val = scaler.data_max_[prediction_index]
min_val = scaler.data_min_[prediction_index]
original_values = (predictions*(max_val - min_val )) + min_val
return original_values
I have a task to create a 30x40 feature matrix with random integers between 1 & 100:
import numpy as np
matrix= np.random.randint(1,100,size=(30,40))
Next I need to rescale the elements in the matrix to be between the range 5-10:
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()
scaler.fit (5,10)
matrix1 = scaler.fit_transform(matrix)
Which gives me this error:
ValueError: Expected 2D array, got scalar array instead:
array=5.0.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample
I've tried reshaping the data:
matrix.reshape(-1,1)
but I get the same error.
I think you need to define the feature range when you create an instance of MinMaxScaler like this:
scaler = preprocessing.MinMaxScaler(feature_range=(5, 10))
And then you could fit and transform the data like this:
matrix1 = scaler.fit_transform(matrix)
The last line is a short form for:
scaler.fit(matrix)
matrix1 = scaler.transform(matrix)
I have a numpy array (let's say 100x64x64).
My goal is to scale each 64x64 layer independently and store a scaler for later use.
This is how it can be achieved with a for-loop solution:
scalers_dict={}
for i in range(X.shape[0]):
scalers_dict[i] = MinMaxScaler()
#fitting the scaler
X[i, :, :] = scalers_dict[i].fit_transform(X[i, :, :])
#saving dict of scalers
joblib.dump(value=scalers_dict,filename="dict_of_scalers.scaler")
My real array is much bigger, and it takes quite a while to iterate through it.
Do you have in mind some more vectorized solution for that, or for-loop is the only way?
If I understand correctly how MinMaxScaler works, it can operate on independent arrays which reduce along axis=0.
To make this useful for your case, you'd need to transform X into a (64 * 64, 100) array:
s = X.shape
X = np.moveaxis(X, 0, -1).reshape(-1, s[0])
Alternatively, you can write
X = X.reshape(s[0], -1).T
Now you can do the scaling with
M = MinMaxScaler()
X = M.fit_transform(X)
Since the actual fit is computed on the first dimension, all the results will be of size 100. This will broadcast perfectly now that the last dimension is of the same size.
To get the original shape back, invert the original transformation:
X = X.T.reshape(s)
When you are done, M will be a scaler calibrated for 100 features. There is no need for a dictionary here. Remember that a dictionary keyed by a sequence of integers can better be expressed as a list or array, which is what happens here.
IIUC, you can manually scale:
mm, MM = inputs.min(axis=(1,2)), inputs.max(axis=(1,2))
# save these for later use
joblib.dump((mm,MM), 'minmax.joblib')
def scale(inputs, mm, MM):
return (inputs - mm[:,None,None])/(MM-mm)[:,None,None]
# load pre-saved min & max
mm, MM = joblib.load('minmax.joblib')
# scaled inputs
scale(inputs, mm, MM)
I have a python 4D array with a general structure of:
tdata = [sample, x, y, channel]
with overarching dimensions of [10000, 5, 5, 12]
and I would like to do either a minmaxscaler, or a standardscaler on the data. The problem is that both scalers only accepts 2D data. If I wanted to scale across each [x,y] 2D channel for every sample, is there an efficient way of doing this opposed to trying:
for i in range(0,len(sample)):
for j in range(0,len(channel)):
transformed_tdata[i,:,:,j] = scaler.fit(tdata[i,:,:,j])
But then wouldn't each sample be independently scaled for each channel?
You're on the right track. If you want a scaler for each channel, you can reshape each channel of the data to be of shape (10000, 5*5). Each channel (which was previously 5x5) is now a length 25 vector, and the scaler will work. You'll have to transform your evaluation data in the same way with the scalers in channel_scalers.
import numpy as np
from sklearn.preprocessing import MinMaxScaler
n_channels = 12
tdata = np.random.rand(10000, 5, 5, n_channels)
tdata_transformed = np.zeros_like(tdata)
channel_scalers = []
for i in range(n_channels):
mmx = MinMaxScaler()
slc = tdata[:, :, :, i].reshape(10000, 5*5) # make it a bunch of row vectors
transformed = mmx.fit_transform(slc)
transformed = transformed.reshape(10000, 5, 5) # reshape it back to tiles
tdata_transformed[:, :, :, i] = transformed # put it in the transformed array
channel_scalers.append(mmx) # store the transform