Python scaling with 4D data - python

I have a python 4D array with a general structure of:
tdata = [sample, x, y, channel]
with overarching dimensions of [10000, 5, 5, 12]
and I would like to do either a minmaxscaler, or a standardscaler on the data. The problem is that both scalers only accepts 2D data. If I wanted to scale across each [x,y] 2D channel for every sample, is there an efficient way of doing this opposed to trying:
for i in range(0,len(sample)):
for j in range(0,len(channel)):
transformed_tdata[i,:,:,j] = scaler.fit(tdata[i,:,:,j])
But then wouldn't each sample be independently scaled for each channel?

You're on the right track. If you want a scaler for each channel, you can reshape each channel of the data to be of shape (10000, 5*5). Each channel (which was previously 5x5) is now a length 25 vector, and the scaler will work. You'll have to transform your evaluation data in the same way with the scalers in channel_scalers.
import numpy as np
from sklearn.preprocessing import MinMaxScaler
n_channels = 12
tdata = np.random.rand(10000, 5, 5, n_channels)
tdata_transformed = np.zeros_like(tdata)
channel_scalers = []
for i in range(n_channels):
mmx = MinMaxScaler()
slc = tdata[:, :, :, i].reshape(10000, 5*5) # make it a bunch of row vectors
transformed = mmx.fit_transform(slc)
transformed = transformed.reshape(10000, 5, 5) # reshape it back to tiles
tdata_transformed[:, :, :, i] = transformed # put it in the transformed array
channel_scalers.append(mmx) # store the transform

Related

How can I reverse .reshape() and get back to a 3D array?

I have a dataset of shape (256, 180, 360). I reshaped it to 2D, removed the 0 values, and applied PCA using:
data = data.reshape(data.shape[0], data.shape[1] * data.shape[2]).T
data = data[~np.all(data == 0, axis = 1)]
# Dataset is now of shape (27719, 256)
data = StandardScaler().fit_transform(data)
pca = PCA()
transformed = pca.fit_transform(data)
Now, the next step is to reshape the transformed dataset back to 3D and plot the PCA results. I tried:
transformed.reshape(360, 180, 256)
which gives me the error "cannot reshape array of size 7096064 into shape (360,180,256)". I understand I cannot get back to the original shape because I removed 0 values which changes that shape, of course, but I have tried other variations of this alongside using variations with the transpose but I cannot get it back to 3D (not necessarily the exact dimensions as before). Any recommendations?
You can't.
What you can do in this scenario is to not use fit_transform, and instead have two separate pipelines. One that uses fit to train on the dataset with all the zero entries removed, and then use transforms on the original dataset to get your transformed data.
flat_data = data.reshape(data.shape[0], data.shape[1] * data.shape[2]).T
nonzero_data = flat_data[~np.all(flat_data == 0, axis = 1)]
scaler = StandardScaler()
pca = PCA()
pca.fit(scaler.fit_transform(nonzero_data))
transformed = pca.transform(scaler.transform(flat_data)).reshape(data.shape)

How to shuffle two numpy arrays, so that record indices are stay aligned in both after shuffling?

I have a 4D array and a 1D array:
import numpy as np
data = np.random.randn(10, 1, 5, 5) # num_records, depth, height, width
labels = np.array([1,1,1,1,1,0,0,0,0,0])
I want to shuffle the data and labels by num_records to get labels in a random order.
I know that one could use shuffle function: np.random.shuffle(data). But I don't know how to persist the relation between data and labels after shuffling.
This shuffles both arrays together:
import numpy as np
data = np.random.randn(10, 1, 5, 5) # num_records, depth, height, width
labels = np.array([1,1,1,1,1,0,0,0,0,0])
# shuffle indices
idx = np.random.permutation(range(len(labels)))
# shuffle together
data, labels = data[idx,:,:,:], labels[idx]
import sklearn.utils as sku
array1_shuffled, array2_shuffled = sku.shuffle(array1, array2)
https://www.kite.com/python/answers/how-to-shuffle-two-numpy-arrays-in-unision-in-python

How do I apply FFT on a 3D Array

I have a 3D array that has the shape (features, timestep, samples). I would like to apply the numpy fft function on each feature for the length of timestep for each sample. I have this, but I am uncertain whether this is the best way or whether there needs to be a loop to iterate through each sample.
import numpy as np
x_train_fft = np.fft.fft(x_train, axis=0) #selected axis 0 as this is the axis of features
Looks like this was the way to do it
X_transform_FFT =[]
for i in range(x_train.shape[0]):
f = abs(np.fft.fft(x_train[i, :, :], axis = 1))
X_transform_FFT.append(f)
np.asarray(X_transform_FFT)
print(X_transform_FFT)

Is it possible to use vector methods to shift images stored in a numpy ndarray for data augmentation?

Background: This is one of the exercise problems in the text book Hands on Machine Learning by Aurelien Geron.
The question is: Write a function that can shift an MNIST image in any direction (left, right, up, down) by one pixel. Then for each image in the training set, create four shifted copies (one per direction) and add them to the training set.
My thought process:
I have a numpy array of size (59500, 784) in X_train (Each row is a (28,28) image). For each row of X_train:
Reshape row to 28,28
For each direction (up, down, left, right):
Reshape to 784,0
Write to empty array
Append the new array to X_train
My code:
import numpy as np
from scipy.ndimage.interpolation import shift
def shift_and_append(X, n):
x_arr = np.zeros((1, 784))
for i in range(n):
for j in range(-1,2):
for k in range(-1,2):
if j!=k and j!=-k:
x_arr = np.append(x_arr, shift(X[i,:].reshape(28,28), [j, k]).reshape(1, 784), axis=0)
return np.append(X, x_arr[1:,:], axis=0)
X_train_new = shift_and_append(X_train, X_train.shape[0])
y_train_new = np.append(y_train, np.repeat(y_train, 4), axis=0)
It takes a long time to run. I feel this is brute forcing it. Is there an efficient vector like method to achieve this?
3 nested for loops with an if condition while reshaping and appending is clearly not a good idea; numpy.roll does the job beautifully in a vector way:
import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train.shape
# (60000, 28, 28)
# plot an original image
plt.gray()
plt.matshow(x_train[0])
plt.show()
Let's first demonstrate the operations:
# one pixel down:
x_down = np.roll(x_train[0], 1, axis=0)
plt.gray()
plt.matshow(x_down)
plt.show()
# one pixel up:
x_up = np.roll(x_train[0], -1, axis=0)
plt.gray()
plt.matshow(x_up)
plt.show()
# one pixel left:
x_left = np.roll(x_train[0], -1, axis=1)
plt.gray()
plt.matshow(x_left)
plt.show()
# one pixel right:
x_right = np.roll(x_train[0], 1, axis=1)
plt.gray()
plt.matshow(x_right)
plt.show()
Having established that, we can generate, say, "right" versions of all the training images simply by
x_all_right = [np.roll(x, 1, axis=1) for x in x_train]
and similarly for the other 3 directions.
Let's confirm that the first image in x_all_right is indeed what we want:
plt.gray()
plt.matshow(x_all_right[0])
plt.show()
You can even avoid the last list comprehension in favor of pure Numpy code, as
x_all_right = np.roll(x_train, 1, axis=2)
which is more efficient, although slightly less intuitive (just take the respective single-image command versions and increase axis by 1).

vectorized "by-layer" scaling of numpy array

I have a numpy array (let's say 100x64x64).
My goal is to scale each 64x64 layer independently and store a scaler for later use.
This is how it can be achieved with a for-loop solution:
scalers_dict={}
for i in range(X.shape[0]):
scalers_dict[i] = MinMaxScaler()
#fitting the scaler
X[i, :, :] = scalers_dict[i].fit_transform(X[i, :, :])
#saving dict of scalers
joblib.dump(value=scalers_dict,filename="dict_of_scalers.scaler")
My real array is much bigger, and it takes quite a while to iterate through it.
Do you have in mind some more vectorized solution for that, or for-loop is the only way?
If I understand correctly how MinMaxScaler works, it can operate on independent arrays which reduce along axis=0.
To make this useful for your case, you'd need to transform X into a (64 * 64, 100) array:
s = X.shape
X = np.moveaxis(X, 0, -1).reshape(-1, s[0])
Alternatively, you can write
X = X.reshape(s[0], -1).T
Now you can do the scaling with
M = MinMaxScaler()
X = M.fit_transform(X)
Since the actual fit is computed on the first dimension, all the results will be of size 100. This will broadcast perfectly now that the last dimension is of the same size.
To get the original shape back, invert the original transformation:
X = X.T.reshape(s)
When you are done, M will be a scaler calibrated for 100 features. There is no need for a dictionary here. Remember that a dictionary keyed by a sequence of integers can better be expressed as a list or array, which is what happens here.
IIUC, you can manually scale:
mm, MM = inputs.min(axis=(1,2)), inputs.max(axis=(1,2))
# save these for later use
joblib.dump((mm,MM), 'minmax.joblib')
def scale(inputs, mm, MM):
return (inputs - mm[:,None,None])/(MM-mm)[:,None,None]
# load pre-saved min & max
mm, MM = joblib.load('minmax.joblib')
# scaled inputs
scale(inputs, mm, MM)

Categories