I have some images stored in chunks of 80 in a numpy array.
trainImages.shape
# (715, 80, 96, 96, 3)
For example trainImages has got 715 chunks with each 80 images of size (96,96,3).
The array is dtype=float32 so it takes up quite a lot of space in RAM, approximately 6GB.
I shuffle the chunks with this code
shuffler = np.random.permutation(trainImages.shape[0])
trainImages = trainImages[shuffler]
I notice that the RAM usage drops to almost 0. The shape is still the same and I can display the images. All and all the array looks fine but it hardly takes up any RAM after the shuffle. How can it be?
I'm using Google Colab Pro with 25GB of RAM and I monitor the RAM usage from the indicator at the top.
You can easily reproduce this behavior by pasting this code in a Colab notebook
import numpy as np
a = np.random.rand(715, 80, 96, 96, 3).astype(np.float32)
shuffler = np.random.permutation(a.shape[0])
a = a[shuffler]
I've also tried to shuffle the same array, but reshaped to (57200,96,96,1) so I was suffling every image. In this case I didn't notice any change in RAM usage, as expected.
It looks like you are running out of memory. Generally when slicing in numpy, the result is a view which does not take up a lot of memory. However when slicing with a boolean mask or a random array of integers there is no regularity so numpy will not return a view but rather a copy. In the line:
a = a[shuffler]
Python will first allocate a new 6GB array, then copy the data according to the index in shuffler to the new array and finally reassign the new array to the old one releasing the memory of the old one. However it will have to have 12GB allocated at one point! I suspect that Colab, only having ~13GB kills the python kernel that tries to allocate more memory than allowed. As a result, you see the RAM dropping to 0.
I am not quite sure why the reshaped array works, perhaps when you tested it, you had more available memory so you just barely managed to fit into the 13GB.
However what you can do to shuffle the samples more memory efficiently is to use
np.random.shuffle(a)
This method will shuffle the first axis of your data in place which should prevent the issue with memory overflow.
If you need to shuffle the first axis of two different arrays in a consistent order (for example the input features x and output labels y), you can set a seed before each suffle ensuring that the two shuffles are equivalent:
np.random.seed(42)
np.random.shuffle(x)
np.random.seed(42)
np.random.shuffle(y)
Related
I am trying to understand the memory allocation in the following operation:
x_batch,images_path,ImageValidStatus = tf_resize_images(path_list, img_type=col_mode, im_size=IMAGE_SIZE)
x_batch=x_batch/255;
x_batch = 1.0-x_batch
x_batch = x_batch.reshape(x_batch.shape[0],IMAGE_SIZE[0]*IMAGE_SIZE[1]*IMAGE_SIZE[2])
what I am interested in is the x_batch, this is a multi-dim numpy array (100x64x64x3)
where 100 is the number of images and 64x64x3 is the dimensions of the image.
what is the maximum number of copies of the images located inside the memory at one point of time.
in other words, how exactly the operations (x_batch/255) , (1-x_batch) and x_batch.reshape are happening from memory perspective.
my main concern is in some cases I am trying to process 500K images at the same time, if I will make multiple copies of these images in the memory, it will be very difficult to fit everything in the memory.
I see "tf" in your code, so I am unsure if you are asking about tensors or arrays. Lets assume you are asking about arrays. In general, arrays are written to memory once and then manipulated. For example,
import numpy as np
data = np.empty((1000,30,30,5)) #This took up 1000*30*30*5*dtype_size bytes (plus epsilon).
data.reshape((1000,30,150)) #Does nothing but update how numpy accesses the array.
data += 1 #Adds one to all the entries in the array.
data = 1-data #Overwrites the array with the data of 1-data.
x = data + 1 #Re-allocates and copies the whole memory.
As long as you don't change the array size (re-allocate the memory), then numpy operates very quickly and efficiently on the data. Not as good as tensorflow, but very very fast. In place adding, functions, operations, are all done without using more memory. Things like appending to the array could cause problems and make python rewrite the array in memory.
I am trying to display a sequence of frames using Shady, but I'm running into difficulties. I'm looking at 25 frames, covering a 1080x1080 pixels area. The stimulus is grayscale, and I am doing luminance linearization off-line, so I only need to save a uint8 value for each pixel. The full sequence is thus about 29Mb. I define the stimulus as a 3-D numpy array [1080x1080x25], and I save it to disk using np.save(). I then load it using np.load().
try:
yy = np.load(fname)
except:
print fname + ' does not exist'
return
This step takes about 20ms. It is my understanding that Shady does not deal with uint8 luminance values, but rather with floats between 0 and 1. I thus convert it into a float array and divide by 255.
yy = yy.astype(np.float)/255.0
This second step takes approx 260ms, which is already not great (ideally I need to get the stimulus loaded and ready to be presented in 400ms).
I now create a list of 25 numpy arrays to use as my pages parameter in the Stimulus class:
pages = []
for j in range(yy.shape[2]):
pages.append(np.squeeze(yy[:, :, j]))
This is virtually instantaneous. But at my next step I run into serious timing problems.
if (self.sequence is None):
self.sequence = self.wind.Stimulus(pages, 'sequence', multipage=True, anchor=Shady.LOCATION.UPPER_LEFT, position=[deltax, deltay], visible=False)
else:
self.sequence.LoadPages(pages, visible=False)
Here I either create a Stimulus object, or update its pages attribute if this is not the first sequence I load. Either way, this step takes about 10s, which is about 100 times what I can tolerate in my application.
Is there a way to significantly speed this up? What am I doing wrong? I have a pretty mediocre graphics card on this machine (Radeon Pro WX 4100), and if that is the problem I could upgrade it, but I do not want to go through the hassle if that is not going to fix it.
Based on jez comments, his tests, and my tests, I guess that on some configurations (in my case a Linux Mint 19 with Cinnamon and a mediocre AMD video card) loading floats can be much slower than loading uint8. With uint8 the behavior appears to be consistent across configurations. So go with uint8 if you can. Since this will (I assume) disable much of what Shady can do in terms of gamma correction and dynamic range enhancement, this might be limiting for some.
Shady can accept uint8 pixel values as-is so you can cut out your code for scaling and type-conversion. Of course, you lose out on Shady's ability to do dynamic range enhancement that way, but it seems like you have your own offline solutions for that kind of thing. If you're going to use uint8 stimuli exclusively, you can save a bit of GPU processing effort by turning off dithering (set the .ditheringDenominator of both the World and the Stimulus to 0 or a negative value).
It seems like the ridiculous 10–to-15-second delays come from inside the compiled binary "accelerator" component, when transferring the raw texture data from RAM to the graphics card. The problem is apparently (a) specific to transferring floating-point texture data rather than integer data, and (b) specific to the graphics card you have (since you reported the problem went away on the same system when you swapped in an NVidia card). Possibly it's also OS- or driver-specific with regard to the old graphics card.
Note that you can also reduce your LoadPages() time from 300–400ms down to about 40ms by cutting down the amount of numpy operations Shady has to do. Save your arrays as [pages x rows x columns] instead of [rows x columns x pages]. Relative to your existing workflow, this means you do yy = yy.transpose([2, 0, 1]) before saving. Then, when you load, don't transpose back: just split on axis=0, and then squeeze the leftmost dimension out of each resulting page:
pages = [ page.squeeze(0) for page in numpy.split(yy, yy.shape[0], axis=0) ]
That way you'll end up with 25 views into the original array, each of which is a contiguous memory block. By contrast, if you do it the original [rows x columns x pages] way, then regardless of whether you do split-and-squeeze or your original slice-and-squeeze loop, you get 25 non-contiguous views into the original memory, and that fact will catch up with you sooner or later—if not when you or Shady convert between numeric formats, then at latest when Shady uses numpy's .tostring method to serialize the data for transfer.
I have a fairly large matrix of the shape (80000, 4), it contains a lot of data, most of which I don't need.
Part of my process involves calculating a pairwise distance matrix, and this process is O(n^2) for the number of rows, so I can reduce processing time a lot by downsampling
I have tried using the numpy.linspace method like so:
downsample_factor = np.linspace(0, large_matrix.shape[0]-1, 20000, dtype=int)
large_matrix = large_matrix[downsample_factor, :]
In this example large_matrix is of shape (80000,4) and I'm downsampling it to around (20000,4)
This code successfully reduces the size of large_matrix but doesn't actually free up memory.
Is there another way to free up RAM by downsampling a large array?
EDIT:
The program receives Killed messages when it runs out of RAM. I noticed that large arrays were getting Killed messages even after downsampling, whereas arrays that were small to begin with did not get killed.
I'd like to read 2048 randomly chosen rows of a stored numpy matrix of column size 200 within 100ms. So far I've tried with h5py. In my case, contiguous mode works faster than chunks, and for various other reasons I'm trying with the former. Writing (in a certain more orderly way) is very fast (~3ms); unfortunately, reading 2048 randomly chosen rows takes about 250ms. The reading part I'm trying is as follows:
a = f['/test']
x = []
for i in range(2048):
r = random.randint(1,2048)
x.append(a[[r],...])
x = np.concatenate(x, 0)
Clearly, the speed bottleneck is due to accessing 'a' for 2048 times because I don't know whether there exists an one-shot way of accessing to random rows or not. np.concatenate consumes a negligible amount of time. Since the matrix eventually reaches to the size of (2048*100k, 200), I probably can't use a method other than contiguous h5py. I've tried with a smaller maximal matrix size, but it didn't affect the computation time at all. For reference, the following is the entire task I'm trying to achieve as a part of deep reinforcement learning algorithm:
Generate a numpy array of size (2048, 200)
Write it onto the next available 2048 rows in an extendable list (None, 200)
Randomly pick 2048 rows from the filled rows of the extendable list (irrespective of the generated chunk in the step 1)
Read the picked rows
Continue 1-4 for 100k times (so the total list size becomes (2048*100k, 200))
If rows can be selected more than once, I would try with:
random.choices(a, k=2048)
Otherwise, using:
random.sample(a, 2048)
Both methods will return a list of numpy arrays if a is an numpy ndarray.
Furthermore, if a is already a numpy array why not take advantage of numpy's slicing capabilities and shorten your code to:
x.append(a[np.randint(1, 2048, 2048)])
That way a is still accessed multiple time, but it is all done in optimized C code, which should be faster.
Hope those points you in the right direction.
I'm testing NumPy's memmap through IPython Notebook, with the following code
Ymap = np.memmap('Y.dat', dtype='float32', mode='w+', shape=(5e6, 4e4))
As you can see, Ymap's shape is pretty large. I'm trying to fill up Ymap like a sparse matrix. I'm not using scipy.sparse matrices because I will eventually need to dot-product it with another dense matrix, which will definitely not fit into memory.
Anyways, I'm performing a very long series of indexing operations:
Ymap = np.memmap('Y.dat', dtype='float32', mode='w+', shape=(5e6, 4e4))
with open("somefile.txt", 'rb') as somefile:
for i in xrange(5e6):
# Read a line
line = somefile.readline()
# For each token in the line, lookup its j value
# Assign the value 1.0 to Ymap[i,j]
for token in line.split():
j = some_dictionary[token]
Ymap[i,j] = 1.0
These operations somehow quickly eat up my RAM. I thought mem-mapping was basically an out-of-core numpy.ndarray. Am I mistaken? Why is my memory usage sky-rocketing like crazy?
A (non-anonymous) mmap is a link between a file and RAM that, roughly, guarantees that when RAM of the mmap is full, data will be paged to the given file instead of to the swap disk/file, and when you msync or munmap it, the whole region of RAM gets written out to the file. Operating systems typically follow a lazy strategy wrt. disk accesses (or eager wrt. RAM): data will remain in memory as long as it fits. This means a process with large mmaps will eat up as much RAM as it can/needs before spilling over the rest to disk.
So you're right that an np.memmap array is an out-of-core array, but it is one that will grab as much RAM cache as it can.
As the docs say:
Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory.
There's no true magic in computers ;-) If you access very little of a giant array, a memmap gimmick will require very little RAM; if you access very much of a giant array, a memmap gimmick will require very much RAM.
One workaround that may or may not be helpful in your specific code: create new mmap objects periodically (and get rid of old ones), at logical points in your workflow. Then the amount of RAM needed should be roughly proportional to the number of array items you touch between such steps. Against that, it takes time to create and destroy new mmap objects. So it's a balancing act.