Adding data into new dimension of numpy array - python

I have a numpy array of size 55 x 10 x 10, which represents 55 10 x 10 grayscale images. I'm trying to make them RGB by duplicating the 10 x 10 images 3 times.
From what I've understood, I first need to add a new dimension to house the duplicated data. I've done this using:
array_4d = np.expand_dims(array_3d, 1),
so I now have a 55 x 1 x 10 x 10 array. How do I now duplicate the 10 x 10 images and add them back into this array?
Quick edit: In the end I want a 55 x 3 x 10 x 10 array

Let us first create a 3d array of size 55x10x10
from matplotlib import pyplot as plt
import numpy as np
original_array = np.random.randint(10,255, (55,10,10))
print(original_array.shape)
>>>(55, 10, 10)
Visual of first image in array:
first_img = original_array[0,:,:]
print(first_img.shape)
plt.imshow(first_img, cmap='gray')
>>>(10, 10)
Now you can get your desired array in just one single step.
stacked_img = np.stack(3*(original_array,), axis=1)
print(stacked_img.shape)
>>>(55, 3, 10, 10)
Use axis=-1 if you want channel last
Now let us verify that the value are correct by extracting the first image from this array and taking average of 3 channels:
new_img = stacked_img[0,:,:,:]
print(new_img.shape)
>>> (3, 10, 10)
new_img_mean = new_img.mean(axis=0)
print(new_img_mean.shape)
>>> (10, 10)
np.allclose(new_img_mean, first_img) # If this is True then the two arrays are same
>>> True
For visual verification, you'll have to move the channel to last because that is what matplotlib needs. This is a 3 channel image, so we are not using cmap='gray' here
print(np.moveaxis(new_img, 0, -1).shape)
plt.imshow(np.moveaxis(new_img, 0, -1))
>>> (10, 10, 3)

Related

Multiple images numpy array into blocks

I have a numpy array with 1000 RGB images with shape (1000, 90, 90, 3) and I need to work on each image, but sliced in 9 blocks. I've found many solution for slicing a single image, but how can I obtain a (9000, 30, 30, 3) array and then iteratively send to a function 9 contiguous block?
I would do smth like what I do in the code below. In my example I used parts of images from skimage.data to illustrate my method and made the shapes and sizes different so that it will look prettier. But you can do the same for your dta by adjusting those parameters.
from skimage import data
from matplotlib import pyplot as plt
import numpy as np
astronaut = data.astronaut()
coffee = data.coffee()
arr = np.stack([coffee[:400, :400, :], astronaut[:400, :400, :]])
plt.imshow(arr[0])
plt.title('arr[0]')
plt.figure()
plt.imshow(arr[1])
plt.title('arr[1]')
arr_blocks = arr.reshape(arr.shape[0], 4, 100, 4, 100, 3, ).swapaxes(2, 3)
arr_blocks = arr_blocks.reshape(-1, 100, 100, 3)
for i, block in enumerate(arr_blocks):
plt.figure(10+i//16, figsize = (10, 10))
plt.subplot(4, 4, i%16+1)
plt.imshow(block)
plt.title(f'block {i}')
# batch_size = 9
# some_outputs_list = []
# for i in range(arr_blocks.shape[0]//batch_size + ((arr_blocks.shape[0]%batch_size) > 0)):
# some_outputs_list.append(some_function(arr_blocks[i*batch_size:(i+1)*batch_size]))
Output:

Numpy split array without copying

I have a very large array of images (multiple GBs) and want to split it using numpy. This is my code:
images = ... # this is the very large array which contains a lot of images.
images.shape => (50000, 256, 256)
indices = ... # array containing ranges, that group the images array like [(0, 300), (301, 580), (581, 860), ...]
train_indices, test_indices = ... # both arrays contain indices like [1, 6, 8, 19] which determine which groups are in the train and which are in the test group
images_train, images_test = np.empty([0, images.shape[1], images.shape[2]]), np.empty([0, images.shape[1], images.shape[2]])
# assign the image groups to either train or test set
for (i, rng) in enumerate(indices):
group_range = range(rng[0], rng[1]+1)
if i in train_indices:
images_train = np.concatenate((images_train, images[group_range]))
else:
images_test = np.concatenate((images_test, images[group_range]))
The problem with this code is, that images_train and images_test are new arrays and the single images are always copied in this new array. This leads to double the memory needed to run the program.
Is there a way to split my images array into images_train and images_test without having to copy the images, but rather reuse them?
My intention with the indices is to group the images into roughly 150 groups, where images from one group should be either in the train or test set
Without a running code it's difficult to understand the details. But I can try to give some ideas. If you have images_train and images_test then you will probabely use them to train and to test with a command that is something like
.fit(images_train);
.score(images_test)
An approach might be that you do not build images_train and images_test but that you use part of images directely
.fit(images[...]);
.score(images[...])
Now the question is, what should be in the [...]-brackets ? Or is there a numpy operater that extracts the right images[...]. First we have to think about what we should avoid:
for loop is always slow
iterative filling of an array like A = np.concatenate((A, B[j])) is always slow
Python's "fancy indexing" is always slow, as group_range = range(rng[0], rng[1]+1); images[group_range]
Some ideas:
use slices instead of "fancy indexing" see here
images[rng[0] : rng[1]+1] , or
group_range = slice(rng[0] , rng[1]+1); images[group_range]
Is images_train = images[train_indices, :, :] and images_test = images[test_indices, :, :] ?
images.shape => (50000, 256, 256) is 3-dimensional ?
try wether numpy.where can give some assitance
below the methods I've mentioned
...
import numpy as np
A = np.arange(20); print("A =",A)
B = A[5:16:2]; print("B =",B) # view of A only, faster
j = slice(5, 16, 2); C = A[j]; print("C =",C) # view of A only, faster
k = [2, 4, 8, 12]; D = A[k]; print("D =",D) # generates internal copies
A = [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
B = [ 5 7 9 11 13 15]
C = [ 5 7 9 11 13 15]
D = [ 2 4 8 12]

Fastening the generation of data sequence into array using Python

I have the following code for making the sequences of the dataframe, which has loaded the csv data of rains ratios.
import pandas as pd
import numpy as np
import sklearn
import sklearn.preprocessing
seq_len = 1100
def load_data(df_, seq_len):
data_raw = df_.values # convert to numpy array
data = []
data = np.array([data_raw[index: index + seq_len] for index in range(len(data_raw) - (seq_len+1))])
print(data.shape)
df = pd.read_csv("data.csv",index_col = 0)
temp = df.copy()
temp = normalize_data(temp)
load_data(df_, seq_len)temp
When I ran the function load_data(df_, seq_len)temp, I have to wait a lot of time. I am not understanding whether it is the issue of the seq_len.
Here is the attached dataset: data.csv
Please help me make it faster. It may happen that in future I may have a bigger data. But if this one becomes faster I need not have to worry for the future data.
**EDITED: ** As per #ParitoshSingh Comment.. Here is the part of the dataset. But do not consider this is the data. It is just a part of bigger data:
,rains_ratio_2013,rains_ratio_2014
0,1.12148,1.1216
1,1.12141,1.12162
2,1.12142,1.12163
3,1.12148,1.1216
4,1.12143,1.12165
5,1.12141,1.12161
6,1.1213799999999998,1.12161
7,1.1214,1.12158
8,1.1214,1.12158
9,1.12141,1.12158
10,1.12141,1.12161
11,1.12144,1.1215899999999999
12,1.12141,1.12162
13,1.12141,1.12161
14,1.12143,1.12161
15,1.12143,1.1216899999999999
16,1.12143,1.12173
17,1.12143,1.12178
18,1.1214600000000001,1.12179
19,1.12148,1.12174
20,1.12148,1.1217
21,1.12148,1.12174
22,1.12148,1.1217
23,1.12145,1.1217
24,1.12145,1.1217
25,1.12148,1.1217
26,1.1214899999999999,1.1217
27,1.1214899999999999,1.1216899999999999
28,1.12143,1.1216899999999999
29,1.12143,1.1216899999999999
30,1.12144,1.1216899999999999
This is essentially a sliding window problem.
One approach is to use vectorization to take the sliding windows over the data faster. Note that If you do not have enough memory to load the final output data, then this may cause issues as well.
import numpy as np
import pandas as pd
Creating some dummy dataframe for ease of use. You should test on your original dataframe.
seq_len = 5
df = pd.DataFrame(np.arange(300).reshape(-1, 3))
print(df.head())
#Output:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
Now, we can create an array for all indexes that we need to use, and use indexing to access all our values in the desired format.
def load_data(df_, seq_len):
data_raw = df_.values # convert to numpy array
#find total number of rows
nrows = len(data_raw) - seq_len + 1 #Your code had -(seq_len + 1) for some reason. i am assuming that was just a mistake. If not, correct this accordingly.
#Now, create an index matrix from the total number of rows.
data = data_raw[np.arange(nrows)[:,None] + np.arange(seq_len)]
print("shape is", data.shape)
return data
out = load_data(df, seq_len)
#Output: shape is (98, 3, 3)
EDIT: If you run into memory errors, you can always modify the function to use a generator instead. This way, you take a middle ground between the two scenarios of iterating one by one or consuming too much memory.
def load_data_gen(df_, seq_len, chunksize=10):
data_raw = df_.values # convert to numpy array
nrows = len(data_raw) - seq_len + 1
for i in range(0, nrows, chunksize):
data = data_raw[np.arange(i, min(i+chunksize, nrows))[:,None] + np.arange(seq_len)]
print("shape is", data.shape)
yield data
out = load_data_gen(df, seq_len, 15)
test = list(out)
#Output:
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (6, 5, 3)

Moving average on 3D array in Dask

I have a 3D array and I would like to use Dask to chunk up my 3D array into blocks of traces of a certain window size around each trace. A trace is just one vector of size (1, 1, z). I can do this using the numpy as_strided tricks as follows:
import numpy as np
from numpy.lib.stride_tricks import as_strided
input_volume = np.linspace(1, 1000, 1000, dtype=int).reshape((10, 10, 10))
window_size = 5
x, y, z = input_volume.shape
# Create a view on the volume of sub-cubes window_size traces wide overlapping by 1 trace in each direction
half_w = (window_size - 1) // 2
padded = np.pad(input_volume[...], [(half_w, half_w), (half_w, half_w), (0, 0)], 'edge')
x_str, y_str, z_str = padded.strides
blocks = as_strided(padded, (x, y, window_size, window_size, z), (x_str, y_str, x_str, y_str, z_str))
averaged_volume = np.mean(blocks, (2, 3))
First I pad my 3D cube in the x and y dimensions by the half window. I get the average trace from each block so in this case a block of (5, 5, z) gets reduced to a single trace. I then end up with a volume the same size as the original that has been averaged over the window size. This effectively gives me a "view" of my 3D array with as shape of (10, 10, 5, 5, 10).
This works but if the volume is large it will load the whole volume into memory.
I have been trying to achieve the same thing with a chunked array in dask but I'm having trouble getting the depth and boundaries correct to give me the same answer. How can I achieve the same thing in dask so it only loads each block of traces into memory at a time and writes back out to the average cube?
EDIT:
This is the dask code I have been trying so far but when this runs I get an IndexError: tuple index out of range when it's trying to do the average calculation:
def average(block):
return np.mean(block, axis=(0, 1))
import dask.array as da
dask_volume = da.from_array(da.pad(input_volume, [(half_w, half_w), (half_w, half_w), (0, 0)], 'edge'), chunks=(window_size ,window_size, -1))
dask_overlapping = da.overlap.overlap(dask_volume, depth={0: window_size - 1, 1: window_size -1}, boundary={0: 'none', 1: 'none'})
dask_average = dask_overlapping.map_blocks(average, chunks=(1, 1, z)).compute()
Thanks,
Mike

How do I median bin a 2D image in python?

I have a 2D numarray, of size WIDTHxHEIGHT. I would like to bin the array by finding the median of each bin so that the resultant array is WIDTH/binsize x HEIGHT/binsize. Assume that both WIDTH and HEIGHT are divisible by binsize.
Edit: An example is given in the attached image.
I have found solutions where the binned array values are the sum or average of the individual elements in each bin:
How to bin a 2D array in numpy?
However, if I want to do a median combine of elements in each bin, I haven't been able to figure out a solution. Your help would be much appreciated!
Edit: image added
An example of the initial array and desired resultant median binned array
So you are looking for median over strided reshape:
import numpy as np
a = np.arange(24).reshape(4,6)
def median_binner(a,bin_x,bin_y):
m,n = np.shape(a)
strided_reshape = np.lib.stride_tricks.as_strided(a,shape=(bin_x,bin_y,m//bin_x,n//bin_y),strides = a.itemsize*np.array([(m / bin_x) * n, (n / bin_y), n, 1]))
return np.array([np.median(col) for row in strided_reshape for col in row]).reshape(bin_x,bin_y)
print "Original Matrix:"
print a
print "\n"
bin_tester1 = median_binner(a,2,3)
print "2x3 median bin :"
print bin_tester1
print "\n"
bin_tester2 = median_binner(a,2,2)
print "2x2 median bin :"
print bin_tester2
result:
Original Matrix:
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]]
2x3 median bin :
[[ 3.5 5.5 7.5]
[ 15.5 17.5 19.5]]
2x2 median bin :
[[ 4. 7.]
[ 16. 19.]]
Read this in order to completely understand the following line in the code:
strided_reshape = np.lib.stride_tricks.as_strided(a,shape=(bin_x,bin_y,m//bin_x,n//bin_y),strides = a.itemsize*np.array([(m / bin_x) * n, (n / bin_y), n, 1])) .
I was dealing with the same issue. I have found the answer of Kennet Celeste very useful but there are some caveats. First the stride reshape is fast but the loop then is slow. The trick is to get all the data you compute median from to the same location in the memory and use somehow vectorized numpy operation.
If you don't want to fiddle with the stride reshape you can go for np.swapaxes function. So let's say I have an array X of the size xdim x ydim and want to bin it by window bin_x x bin_y
import numpy as np
#Some sample values
xdim= 5039
ydim = 6637
bin_x = 5
bin_y = 7
X = np.random.rand(ydim, xdim)
#now compute reduced dimensions so that bin_x divides xdim_red
xdim_red = xdim - xdim % bin_x
ydim_red = ydim - ydim % bin_y
#and dimensions after binning
xdim_bin = xdim_red // bin_x
ydim_bin = ydim_red // bin_y
#crop X to the end of the indices
X = X[0:ydim_red, 0:xdim_red]
#Here alternative to stride reshape
X.shape = (ydim_bin, bin_y, xdim_bin, bin_x)
X_reshaped = X.swapaxes(1, 2)
#The following can be done on stride_reshape array as well and finally joins the chunks of the memory we need to get together
X_reshaped = X_reshaped.reshape((ydim_bin, xdim_bin, bin_x*bin_y))
#There could be faster implementation but this at least use batc
g = np.median(X_reshaped, axis=-1)

Categories