I have a very large array of images (multiple GBs) and want to split it using numpy. This is my code:
images = ... # this is the very large array which contains a lot of images.
images.shape => (50000, 256, 256)
indices = ... # array containing ranges, that group the images array like [(0, 300), (301, 580), (581, 860), ...]
train_indices, test_indices = ... # both arrays contain indices like [1, 6, 8, 19] which determine which groups are in the train and which are in the test group
images_train, images_test = np.empty([0, images.shape[1], images.shape[2]]), np.empty([0, images.shape[1], images.shape[2]])
# assign the image groups to either train or test set
for (i, rng) in enumerate(indices):
group_range = range(rng[0], rng[1]+1)
if i in train_indices:
images_train = np.concatenate((images_train, images[group_range]))
else:
images_test = np.concatenate((images_test, images[group_range]))
The problem with this code is, that images_train and images_test are new arrays and the single images are always copied in this new array. This leads to double the memory needed to run the program.
Is there a way to split my images array into images_train and images_test without having to copy the images, but rather reuse them?
My intention with the indices is to group the images into roughly 150 groups, where images from one group should be either in the train or test set
Without a running code it's difficult to understand the details. But I can try to give some ideas. If you have images_train and images_test then you will probabely use them to train and to test with a command that is something like
.fit(images_train);
.score(images_test)
An approach might be that you do not build images_train and images_test but that you use part of images directely
.fit(images[...]);
.score(images[...])
Now the question is, what should be in the [...]-brackets ? Or is there a numpy operater that extracts the right images[...]. First we have to think about what we should avoid:
for loop is always slow
iterative filling of an array like A = np.concatenate((A, B[j])) is always slow
Python's "fancy indexing" is always slow, as group_range = range(rng[0], rng[1]+1); images[group_range]
Some ideas:
use slices instead of "fancy indexing" see here
images[rng[0] : rng[1]+1] , or
group_range = slice(rng[0] , rng[1]+1); images[group_range]
Is images_train = images[train_indices, :, :] and images_test = images[test_indices, :, :] ?
images.shape => (50000, 256, 256) is 3-dimensional ?
try wether numpy.where can give some assitance
below the methods I've mentioned
...
import numpy as np
A = np.arange(20); print("A =",A)
B = A[5:16:2]; print("B =",B) # view of A only, faster
j = slice(5, 16, 2); C = A[j]; print("C =",C) # view of A only, faster
k = [2, 4, 8, 12]; D = A[k]; print("D =",D) # generates internal copies
A = [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
B = [ 5 7 9 11 13 15]
C = [ 5 7 9 11 13 15]
D = [ 2 4 8 12]
Related
Suppose I have the 2 arrays below:
a = tf.constant([1,2,3])
b = tf.constant([10,20,30])
How can we concatenate them using Tensorflow's methods, such that the new array is created by doing intervals of taking 1 number from each array one at a time? (Is there already a function that can do this?)
For example, the desired result for the 2 arrays is:
[1,10,2,20,3,30]
Methods with tf.concat just puts array b after array a.
a = tf.constant([1,2,3])
b = tf.constant([10,20,30])
c = tf.stack([a,b]) #combine a,b as a matrix
d = tf.transpose(c) #transpose matrix to get the right order
e = tf.reshape(d, [-1]) #reshape to 1-d tensor
You could also try using tf.tensor_scatter_nd_update:
import tensorflow as tf
a = tf.constant([1,2,3])
b = tf.constant([10,20,30])
shape = tf.shape(a)[0] + tf.shape(b)[0]
c = tf.tensor_scatter_nd_update(tf.zeros(shape, dtype=tf.int32),
tf.expand_dims(tf.concat([tf.range(start=0, limit=shape, delta=2), tf.range(start=1, limit=shape, delta=2) ], axis=0), axis=-1),
tf.concat([a, b], axis=0))
# tf.Tensor([ 1 10 2 20 3 30], shape=(6,), dtype=int32)
I have a numpy array of size 55 x 10 x 10, which represents 55 10 x 10 grayscale images. I'm trying to make them RGB by duplicating the 10 x 10 images 3 times.
From what I've understood, I first need to add a new dimension to house the duplicated data. I've done this using:
array_4d = np.expand_dims(array_3d, 1),
so I now have a 55 x 1 x 10 x 10 array. How do I now duplicate the 10 x 10 images and add them back into this array?
Quick edit: In the end I want a 55 x 3 x 10 x 10 array
Let us first create a 3d array of size 55x10x10
from matplotlib import pyplot as plt
import numpy as np
original_array = np.random.randint(10,255, (55,10,10))
print(original_array.shape)
>>>(55, 10, 10)
Visual of first image in array:
first_img = original_array[0,:,:]
print(first_img.shape)
plt.imshow(first_img, cmap='gray')
>>>(10, 10)
Now you can get your desired array in just one single step.
stacked_img = np.stack(3*(original_array,), axis=1)
print(stacked_img.shape)
>>>(55, 3, 10, 10)
Use axis=-1 if you want channel last
Now let us verify that the value are correct by extracting the first image from this array and taking average of 3 channels:
new_img = stacked_img[0,:,:,:]
print(new_img.shape)
>>> (3, 10, 10)
new_img_mean = new_img.mean(axis=0)
print(new_img_mean.shape)
>>> (10, 10)
np.allclose(new_img_mean, first_img) # If this is True then the two arrays are same
>>> True
For visual verification, you'll have to move the channel to last because that is what matplotlib needs. This is a 3 channel image, so we are not using cmap='gray' here
print(np.moveaxis(new_img, 0, -1).shape)
plt.imshow(np.moveaxis(new_img, 0, -1))
>>> (10, 10, 3)
I am working with a 4-D array input to a CNN network. The input array has the following shape
print('X_train shape: ', X_train.shape)
X_train shape: (47204, 1, 100, 4)
Data description:
The input data consists of a 47204 instances (fixed-length segments as far CNN requirement). Each instance (1, 100, 4) i.e. 1 segment contains 100-GPS points, and for each point, 4-corresponding point kinematics (max_speed, avg_speed, max_acc, avg_acc) are stored, thus the (1, 100, 4). Labels are stored in a separate y_train array of shape (47204,) for 5 classes [0..4].
print(y_train)
[3 3 0 ... 2 3 4]
To get a better sense of my X_train array, I show the first 3 elements below:
print(X_train[1:3])
[
[[[ 3.82280987e+00 2.16802350e-01 7.49917451e-02 3.44416369e-04]
[ 3.38707371e+00 2.02210055e-01 1.61751110e-03 1.93745950e-03]
[ 2.49202215e+00 1.60605262e-01 8.43561351e-03 2.40057917e-03]
...
[ 2.00022316e+00 2.70020923e-01 5.40441673e-02 3.57212151e-03]
[ 3.25199744e-01 9.06990382e-02 1.46808316e-02 1.65841315e-03]
[2.96587589e-01 0.00000000e+00 6.13293351e-04 4.16518187e-03]]]
[[[ 1.07209176e+00 7.27038312e-02 6.62777026e-03 2.04611951e-04]
[ 1.06194285e+00 5.05005456e-02 4.05676569e-03 3.72293433e-04]
[ 1.02849748e+00 2.12558178e-02 2.95477005e-03 5.56584054e-04]
...
[ 4.51962909e-03 5.63125736e-04 5.98474074e-04 1.63036715e-05]
[ 2.83026181e-03 2.35855075e-03 1.25789358e-03 2.15331510e-06]
[8.49078543e-03 2.16840434e-19 9.43423077e-04 1.29198906e-05]]]
[[[ 7.51127665e+00 3.14033478e-01 6.85170617e-02 7.73415075e-04]
[ 7.42307262e+00 1.33868251e-01 4.10564823e-02 1.16131460e-03]
[ 7.35818066e+00 1.23886976e-02 3.02312582e-02 1.28312101e-03]
...
[ 7.40826167e+00 1.19388656e-01 4.00874715e-02 2.04909489e-04]
[ 7.23779176e+00 1.33269965e-01 1.20430502e-02 1.58195900e-04]
[ 7.11697001e+00 4.68002105e-02 5.42478400e-02 3.58101318e-05]]]
]
Task:
I am required to create a machine learning model (e.g. random forest) using the 4 kinematics (max_speed, avg_speed, max_acc, avg_acc) as features. This requires navigating each instance and getting these features for the 100-points in the instance.
Clearly, the number of samples will then be 4720400 (i.e. 47204 x 100), so would also match each value to the corresponding label of its instances, i.e. y_train will then be (4720400,).
The expected input would then be like:
max_speed avg_speed max_acc avg_acc class
0 3.82280987e+00 2.16802350e-01 7.49917451e-02 3.44416369e-04 3
1 3.38707371e+00 2.02210055e-01 1.61751110e-03 1.93745950e-03 3
2 2.49202215e+00 1.60605262e-01 8.43561351e-03 2.40057917e-03 3
...
I have being thinking about how to do this all through the week, all ideas evaporated. How do I do this, please?
You can reshape your X_train array from (47204, 1, 100, 4) to (4720400, 4) simply with:
X_train_reshaped = X_train.reshape(4720400, 4)
It preserves the data order and the total number of elements will be the same.
Similarly, you can expand y_train array using repeat command:
Y_train_reshaped = numpy.repeat(Y_train, 100)
Note the 100 for repeat command. Since you had one label for 100 data points, we will expand these items 100 times. This command will preserve data order too so all instances will have the same original label.
I have the following code for making the sequences of the dataframe, which has loaded the csv data of rains ratios.
import pandas as pd
import numpy as np
import sklearn
import sklearn.preprocessing
seq_len = 1100
def load_data(df_, seq_len):
data_raw = df_.values # convert to numpy array
data = []
data = np.array([data_raw[index: index + seq_len] for index in range(len(data_raw) - (seq_len+1))])
print(data.shape)
df = pd.read_csv("data.csv",index_col = 0)
temp = df.copy()
temp = normalize_data(temp)
load_data(df_, seq_len)temp
When I ran the function load_data(df_, seq_len)temp, I have to wait a lot of time. I am not understanding whether it is the issue of the seq_len.
Here is the attached dataset: data.csv
Please help me make it faster. It may happen that in future I may have a bigger data. But if this one becomes faster I need not have to worry for the future data.
**EDITED: ** As per #ParitoshSingh Comment.. Here is the part of the dataset. But do not consider this is the data. It is just a part of bigger data:
,rains_ratio_2013,rains_ratio_2014
0,1.12148,1.1216
1,1.12141,1.12162
2,1.12142,1.12163
3,1.12148,1.1216
4,1.12143,1.12165
5,1.12141,1.12161
6,1.1213799999999998,1.12161
7,1.1214,1.12158
8,1.1214,1.12158
9,1.12141,1.12158
10,1.12141,1.12161
11,1.12144,1.1215899999999999
12,1.12141,1.12162
13,1.12141,1.12161
14,1.12143,1.12161
15,1.12143,1.1216899999999999
16,1.12143,1.12173
17,1.12143,1.12178
18,1.1214600000000001,1.12179
19,1.12148,1.12174
20,1.12148,1.1217
21,1.12148,1.12174
22,1.12148,1.1217
23,1.12145,1.1217
24,1.12145,1.1217
25,1.12148,1.1217
26,1.1214899999999999,1.1217
27,1.1214899999999999,1.1216899999999999
28,1.12143,1.1216899999999999
29,1.12143,1.1216899999999999
30,1.12144,1.1216899999999999
This is essentially a sliding window problem.
One approach is to use vectorization to take the sliding windows over the data faster. Note that If you do not have enough memory to load the final output data, then this may cause issues as well.
import numpy as np
import pandas as pd
Creating some dummy dataframe for ease of use. You should test on your original dataframe.
seq_len = 5
df = pd.DataFrame(np.arange(300).reshape(-1, 3))
print(df.head())
#Output:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
Now, we can create an array for all indexes that we need to use, and use indexing to access all our values in the desired format.
def load_data(df_, seq_len):
data_raw = df_.values # convert to numpy array
#find total number of rows
nrows = len(data_raw) - seq_len + 1 #Your code had -(seq_len + 1) for some reason. i am assuming that was just a mistake. If not, correct this accordingly.
#Now, create an index matrix from the total number of rows.
data = data_raw[np.arange(nrows)[:,None] + np.arange(seq_len)]
print("shape is", data.shape)
return data
out = load_data(df, seq_len)
#Output: shape is (98, 3, 3)
EDIT: If you run into memory errors, you can always modify the function to use a generator instead. This way, you take a middle ground between the two scenarios of iterating one by one or consuming too much memory.
def load_data_gen(df_, seq_len, chunksize=10):
data_raw = df_.values # convert to numpy array
nrows = len(data_raw) - seq_len + 1
for i in range(0, nrows, chunksize):
data = data_raw[np.arange(i, min(i+chunksize, nrows))[:,None] + np.arange(seq_len)]
print("shape is", data.shape)
yield data
out = load_data_gen(df, seq_len, 15)
test = list(out)
#Output:
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (15, 5, 3)
shape is (6, 5, 3)
I'm looking for a good approach for efficiently dividing an image into small regions, processing each region separately, and then re-assembling the results from each process into a single processed image. Matlab had a tool for this called blkproc (replaced by blockproc in newer versions of Matlab).
In an ideal world, the function or class would support overlap between the divisions in the input matrix too. In the Matlab help, blkproc is defined as:
B = blkproc(A,[m n],[mborder nborder],fun,...)
A is your input matrix,
[m n] is the block size
[mborder, nborder] is the size of your border region (optional)
fun is a function to apply to each block
I have kluged together an approach, but it strikes me as clumsy and I bet there's a much better way. At the risk of my own embarrassment, here's my code:
import numpy as np
def segmented_process(M, blk_size=(16,16), overlap=(0,0), fun=None):
rows = []
for i in range(0, M.shape[0], blk_size[0]):
cols = []
for j in range(0, M.shape[1], blk_size[1]):
cols.append(fun(M[i:i+blk_size[0], j:j+blk_size[1]]))
rows.append(np.concatenate(cols, axis=1))
return np.concatenate(rows, axis=0)
R = np.random.rand(128,128)
passthrough = lambda(x):x
Rprime = segmented_process(R, blk_size=(16,16),
overlap=(0,0),
fun=passthrough)
np.all(R==Rprime)
Here are some examples of a different (loop free) way to work with blocks:
import numpy as np
from numpy.lib.stride_tricks import as_strided as ast
A= np.arange(36).reshape(6, 6)
print A
#[[ 0 1 2 3 4 5]
# [ 6 7 8 9 10 11]
# ...
# [30 31 32 33 34 35]]
# 2x2 block view
B= ast(A, shape= (3, 3, 2, 2), strides= (48, 8, 24, 4))
print B[1, 1]
#[[14 15]
# [20 21]]
# for preserving original shape
B[:, :]= np.dot(B[:, :], np.array([[0, 1], [1, 0]]))
print A
#[[ 1 0 3 2 5 4]
# [ 7 6 9 8 11 10]
# ...
# [31 30 33 32 35 34]]
print B[1, 1]
#[[15 14]
# [21 20]]
# for reducing shape, processing in 3D is enough
C= B.reshape(3, 3, -1)
print C.sum(-1)
#[[ 14 22 30]
# [ 62 70 78]
# [110 118 126]]
So just trying to simply copy the matlab functionality to numpy is not all ways the best way to proceed. Sometimes a 'off the hat' thinking is needed.
Caveat:
In general, implementations based on stride tricks may (but does not necessary need to) suffer some performance penalties. So be prepared to all ways measure your performance. In any case it's wise to first check if the needed functionality (or similar enough, in order to easily adapt for) has all ready been implemented in numpy or scipy.
Update:
Please note that there is no real magic involved here with the strides, so I'll provide a simple function to get a block_view of any suitable 2D numpy-array. So here we go:
from numpy.lib.stride_tricks import as_strided as ast
def block_view(A, block= (3, 3)):
"""Provide a 2D block view to 2D array. No error checking made.
Therefore meaningful (as implemented) only for blocks strictly
compatible with the shape of A."""
# simple shape and strides computations may seem at first strange
# unless one is able to recognize the 'tuple additions' involved ;-)
shape= (A.shape[0]/ block[0], A.shape[1]/ block[1])+ block
strides= (block[0]* A.strides[0], block[1]* A.strides[1])+ A.strides
return ast(A, shape= shape, strides= strides)
if __name__ == '__main__':
from numpy import arange
A= arange(144).reshape(12, 12)
print block_view(A)[0, 0]
#[[ 0 1 2]
# [12 13 14]
# [24 25 26]]
print block_view(A, (2, 6))[0, 0]
#[[ 0 1 2 3 4 5]
# [12 13 14 15 16 17]]
print block_view(A, (3, 12))[0, 0]
#[[ 0 1 2 3 4 5 6 7 8 9 10 11]
# [12 13 14 15 16 17 18 19 20 21 22 23]
# [24 25 26 27 28 29 30 31 32 33 34 35]]
Process by slices/views. Concatenation is very expensive.
for x in xrange(0, 160, 16):
for y in xrange(0, 160, 16):
view = A[x:x+16, y:y+16]
view[:,:] = fun(view)
I took both inputs, as well as my original approach and compared the results. As #eat correctly points out, the results depend on the nature of your input data. Surprisingly, concatenate beats view processing in a few instances. Each method has a sweet-spot. Here is my benchmark code:
import numpy as np
from itertools import product
def segment_and_concatenate(M, fun=None, blk_size=(16,16), overlap=(0,0)):
# truncate M to a multiple of blk_size
M = M[:M.shape[0]-M.shape[0]%blk_size[0],
:M.shape[1]-M.shape[1]%blk_size[1]]
rows = []
for i in range(0, M.shape[0], blk_size[0]):
cols = []
for j in range(0, M.shape[1], blk_size[1]):
max_ndx = (min(i+blk_size[0], M.shape[0]),
min(j+blk_size[1], M.shape[1]))
cols.append(fun(M[i:max_ndx[0], j:max_ndx[1]]))
rows.append(np.concatenate(cols, axis=1))
return np.concatenate(rows, axis=0)
from numpy.lib.stride_tricks import as_strided
def block_view(A, block= (3, 3)):
"""Provide a 2D block view to 2D array. No error checking made.
Therefore meaningful (as implemented) only for blocks strictly
compatible with the shape of A."""
# simple shape and strides computations may seem at first strange
# unless one is able to recognize the 'tuple additions' involved ;-)
shape= (A.shape[0]/ block[0], A.shape[1]/ block[1])+ block
strides= (block[0]* A.strides[0], block[1]* A.strides[1])+ A.strides
return as_strided(A, shape= shape, strides= strides)
def segmented_stride(M, fun, blk_size=(3,3), overlap=(0,0)):
# This is some complex function of blk_size and M.shape
stride = blk_size
output = np.zeros(M.shape)
B = block_view(M, block=blk_size)
O = block_view(output, block=blk_size)
for b,o in zip(B, O):
o[:,:] = fun(b);
return output
def view_process(M, fun=None, blk_size=(16,16), overlap=None):
# truncate M to a multiple of blk_size
from itertools import product
output = np.zeros(M.shape)
dz = np.asarray(blk_size)
shape = M.shape - (np.mod(np.asarray(M.shape),
blk_size))
for indices in product(*[range(0, stop, step)
for stop,step in zip(shape, blk_size)]):
# Don't overrun the end of the array.
#max_ndx = np.min((np.asarray(indices) + dz, M.shape), axis=0)
#slices = [slice(s, s + f, None) for s,f in zip(indices, dz)]
output[indices[0]:indices[0]+dz[0],
indices[1]:indices[1]+dz[1]][:,:] = fun(M[indices[0]:indices[0]+dz[0],
indices[1]:indices[1]+dz[1]])
return output
if __name__ == "__main__":
R = np.random.rand(128,128)
squareit = lambda(x):x*2
from timeit import timeit
t ={}
kn = np.array(list(product((8,16,64,128),
(128, 512, 2048, 4096)) ) )
methods = ("segment_and_concatenate",
"view_process",
"segmented_stride")
t = np.zeros((kn.shape[0], len(methods)))
for i, (k, N) in enumerate(kn):
for j, method in enumerate(methods):
t[i,j] = timeit("""Rprime = %s(R, blk_size=(%d,%d),
overlap = (0,0),
fun = squareit)""" % (method, k, k),
setup="""
from segmented_processing import %s
import numpy as np
R = np.random.rand(%d,%d)
squareit = lambda(x):x**2""" % (method, N, N),
number=5
)
print "k =", k, "N =", N #, "time:", t[i]
print (" Speed up (view vs. concat, stride vs. concat): %0.4f, %0.4f" % (
t[i][0]/t[i][1],
t[i][0]/t[i][2]))
And here are the results:
Note that the segmented stride method wins by 3-4x for small block sizes. Only at large block sizes (128 x 128) and very large matrices (2048 x 2048 and larger) does the view processing approach win, and then only by a small percentage. Based on the bake-off, it looks like #eat gets the check-mark! Thanks to both of you for good examples!
Bit late to the game, but this would do overlapping blocks. I haven't done it here, but you could easily adapt for step sizes for shifting the window, I think:
from numpy.lib.stride_tricks import as_strided
def rolling_block(A, block=(3, 3)):
shape = (A.shape[0] - block[0] + 1, A.shape[1] - block[1] + 1) + block
strides = (A.strides[0], A.strides[1]) + A.strides
return as_strided(A, shape=shape, strides=strides)
Even later in the game.
There is a Swiss Image processing package called Bob available at:
https://www.idiap.ch/software/bob/
It has some python commands for blocks, e.g. bob.ip.base.block
which appears to do everything the Matlab command 'blockproc' does.
I have not tested it.
There are also interesting commands bob.ip.base.DCTFeatures which incorporates the 'block' capabilities to extract or modify DCT coefficients of an image.
I found this tutorial - The final source code provides exactly the desired functionality!
It even should work for any dimensionality (I did not test it)
http://www.johnvinyard.com/blog/?p=268
Though the "flatten" option at the very end of the source code seems to be a little buggy. Nevertheless, very nice piece of software!