Pdf to image conversion takes an enormous amount of space - python

I have a quick and dirty python script that takes a pdf as input and saves the pages as an array of images (using pdf2image).
What I don't understand: 72 images take up 920MB of memory. However, if I save the images to file and then reload them, I get to barely over 30-40MB (combined size of the images is 29MB). Does that make sense?
I also tried to dump the array using pickle and I get to about 3GB before it crashes due to MemError. I'm at a complete loss what is eating up so much memory...

The reason for the huge memory usage is most likely because of excessive ammount of meta data usage, uncompressed image data (raw color data) or a lossless image codec within the library/tool itself.
It might also depend on the size, amount of images etc.
On the last remark, regarding pickle. Pickle in itself is a memory dump format used by Python to preserve certain variable states. Dumping memory to a session state on disk is quite a heavy task. Not only do Python need to convert everything to a format that enables the saved state, but it must also copy all the data to a known state upon saving it. There for it might use up quite a lot of ram and disk in order to do so. (Only way around this it to chunk up the data usually).
Upon answering some comments, one solution would be to pass the parameter fmt=jpg which keeps the image in a compressed state lowering the resource usage a bit.

Related

Creating large .npy files with memory efficiency

I'm trying to create very large .npy files and I'm having a bit of difficulty. For instance, I need to create a (500, 1586, 2048, 3) matrix and save it to a npy file. And preferably, I need to put it in an npz_compressed file. I also need this to be memory efficient to be ran on low-memory systems. I've tried a few methods, but none have seemed to work so far. I've written and re-written things so many times, that I don't have code snippets for everything, but I'll describe the methods as best I can with code snippets where I can. Also, apologies for bad formatting.
Create an ndarray with all my data in it, then use savez_compressed to export it.
This gets all my data into the array, but it's terrible for memory efficiency. I filled all 8gb of RAM, plus 5gb of Swap space. I got it to save my file, but it doesn't scale, as my matrix could get significantly larger.
Use " np.memmap('file_name_to_create', mode='w+', shape=(500,1586,2048,3)) " to create the large, initial npy file, then add my data.
This method worked for getting my data in, and it's pretty memory efficient. However, i can no longer use np.load to open the file (get errors associated with pickle, regardless of if allow_pickle is true or false), which means I can't put it into compressed. I'd be happy with this format, if I can get it into the compressed format, but I just can't figure it out. I'm trying to avoid using gzip if possible.
Create a (1,1,1,1) zeros array and save it with np.save. Then try opening it with np.memmap with the same size as before.
This runs into the same issues as method 2. Can no longer use np.load to read it in afterwards
Create 5 [100,...] npy files with method 1, and saving them with np.save. Then read 2 in using np.load(mmap_mode='r+') and then merge them into 1 large npy file.
Creating the individual npy files wan't bad on memory, maybe 1gb to 1.5gb. However, I couldn't figure out how to then merge the npy files without actually loading the entire npy file into RAM. I read in other stackoverflow that npy files aren't really designed for this at all. They mentioned it would be better to use a .h5 file for doing this kind of 'appending'.
Those are the main methods that I've used. I'm looking for feedback on if any of these methods would work, which one would work 'best' for memory efficiency, and maybe some guidance on getting that method to work. I also wouldn't be opposed to moving to .h5 if that would be the best method, I just haven't tried it yet.
try it at Google colab which uses the GPU to run it

How to save large Python numpy datasets?

I'm attempting to create an autonomous RC car and my Python program is supposed to query the live stream on a given interval and add it to a training dataset. The data I want to collect is the array of the current image from OpenCV and the current speed and angle of the car. I would then like it to be loaded into Keras for processing.
I found out that numpy.save() just saves one array to a file. What is the best/most efficient way of saving data for my needs?
As with anything regarding performance or efficiency, test it yourself. The problem with recommendations for the "best" of anything is that they might change from year to year.
First, you should determine if this is even an issue you should be tackling. If you're not experiencing performance issues or storage issues, then don't bother optimizing until it becomes a problem. What ever you do, don't waste your time on premature optimizations.
Next, assuming it actually is an issue, try out every method for saving to see which one yields the smallest results in the shortest amount of time. Maybe compression is the answer, but that might slow things down? Maybe pickling objects would be faster? Who knows until you've tried.
Finally, weigh the trade-offs and decide which method you can compromise on; You'll almost never have one silver bullet solution. While your at it, determine if just adding more CPU, RAM or disk space at the problem would solve it. Cloud computing affords you a lot of headroom in those areas.
The most simple way is np.savez_compressed(). This saves any number of arrays using the same format as np.save() but encapsulated in a standard Zip file.
If you need to be able to add more arrays to an existing file, you can do that easily, because after all the NumPy ".npz" format is just a Zip file. So open or create a Zip file using zipfile, and then write arrays into it using np.save(). The APIs aren't perfectly matched for this, so you can first construct a StringIO "file", write into it with np.save(), then use writestr() in zipfile.

Manipulating Large Amounts of Image Data in Python

I have a large number of images of different categories, e.g. "Cat", "Dog", "Bird". The images have some hierarchical structure, like a dict. So for example the key is the animal name and the value is a list of animal images, e.g. animalPictures[animal][index].
I want to manipulate each image (e.g. compute histogram) and then save the manipulated data in an identical corresponding structure, e.g. animalPictures['bird'][0] has its histogram stored in animalHistograms['bird'][0].
The only issue is I do not have enough memory to load all the images, perform all the manipulations, and create an additional structure of the transformed images.
Is it possible to load an image from disk, manipulate the image, and stream the data to a dict on disk? This way I could work on a per-image basis and not worry about loading everything into memory at once.
Thanks!
I would suggest using theshelvemodule which provides dictionary-like persistent objects backed by a file object. For the most part you can use them like ordinary dictionaries, so converting any existing code that uses them is relatively easy. There's couple of additional methods for controlling the writing back of any cached entries and closing the object.
Basically it will let you transparently store your Python dictionary data in an underlying file-based database. All entries accessed are also cached in memory. However you can prevent the cache from consuming too much memory by manually calling the sync() method which writes any modified entries back to disk and empties the cache.
It is possible, if you us NumPy and especially numpy.memmap to store the image data. That way the image data looks as if it were in memory but is on the disk using the mmap mechanism. The nice thing is that the numpy.memmap arrays are not more difficult to handle than ordinary arrays.
There is some performance overhead as all memmap arrays are saved to disk. The arrays could be described as "disk-backed arrays", i.e. the data is also kept in RAM as long as possible. This means that if you access some data array very often, it is most likely in memory, and there is no disk read overhead.
So, keep your metadata in a dict in memory, but memmap your bigger arrays.
This is probably the easiest way. However, make sure you have a 64-bit Python in use, as the 32-bit one runs out of addresses ad 2 GiB.
Of course, there are a lot of ways to compress the image data. If your data may be compressed, then you might consider using compression to save memory.

Python: Memory usage increments while reading TIFFArray

I have a list of "TIFFFiles", where each "TIFFFiles" contains a "TIFFArray" with 60 tiff images each with a size of 2776x2080 pixel. The images are read as numpy.memmap objects.
I want to access all intensities of the images (shape of imgs: (60,2776,2080)). I use the following code:
for i in xrange(18):
#get instance of type TIFFArray from tiff_list
tiffs = get_tiff_arrays(smp_ppx, type_subfile,tiff_list[i])
#accessing all intensities from tiffs
imgs = tiffs[:,:,:]
Even by overwriting "tiffs" and "imgs" in each iteration step my memory increments by 2.6GByte. How can I avoid that the data are copied in each iteration step? Is there any way that the memory of 2.6GByte can be reused?
I know that is probably not an answer, but it might help anyway and was too long for a comment.
Some times ago I had a memory problem while reading large (>1Gb) ascii files with numpy: basically to read the file with numpy.loadtxt, the code was using the whole memory (8Gb) plus some swap.
From what I've understood, if you know in advance the size of the array to fill, you can allocate it and pass it to, e.g., loadtxt. This should prevent numpy to allocate temporary objects and it might be better memory-wise.
mmap, or similar approaches, can help improving memory usage, but I've never used them.
edit
The problem with memory usage and release made me wonder when I was trying to solve my large file problem. Basically I had
def read_f(fname):
arr = np.loadtxt(fname) #this uses a lot of memory
#do operations
return something
for f in ["verylargefile", "smallerfile", "evensmallerfile"]:
result = read_f(f)
From the memory profiling I did, there was no memory release when returning loadtxt nor when returning read_f and calling it again with a smaller file.

Python: quickly loading 7 GB of text files into unicode strings

I have a large directory of text files--approximately 7 GB. I need to load them quickly into Python unicode strings in iPython. I have 15 GB of memory total. (I'm using EC2, so I can buy more memory if absolutely necessary.)
Simply reading the files will be too slow for my purposes. I have tried copying the files to a ramdisk and then loading them from there into iPython. That speeds things up but iPython crashes (not enough memory left over?) Here is the ramdisk setup:
mount -t tmpfs none /var/ramdisk -o size=7g
Anyone have any ideas? Basically, I'm looking for persistent in-memory Python objects. The iPython requirement precludes using IncPy: http://www.stanford.edu/~pgbovine/incpy.html .
Thanks!
There is much that is confusing here, which makes it more difficult to answer this question:
The ipython requirement. Why do you need to process such large data files from within ipython instead of a stand-alone script?
The tmpfs RAM disk. I read your question as implying that you read all of your input data into memory at once in Python. If that is the case, then python allocates its own buffers to hold all the data anyway, and the tmpfs filesystem only buys you a performance gain if you reload the data from the RAM disk many, many times.
Mentioning IncPy. If your performance issues are something you could solve with memoization, why can't you just manually implement memoization for the functions where it would help most?
So. If you actually need all the data in memory at once -- if your algorithm reprocesses the entire dataset multiple times, for example -- I would suggest looking at the mmap module. That will provide the data in raw bytes instead of unicode objects, which might entail a little more work in your algorithm (operating on the encoded data, for example), but will use a reasonable amount of memory. Reading the data into Python unicode objects all at once will require either 2x or 4x as much RAM as it occupies on disk (assuming the data is UTF-8).
If your algorithm simply does a single linear pass over the data (as does the Aho-Corasick algorithm you mention), then you'd be far better off just reading in a reasonably sized chunk at a time:
with codecs.open(inpath, encoding='utf-8') as f:
data = f.read(8192)
while data:
process(data)
data = f.read(8192)
I hope this at least gets you closer.
I saw the mention of IncPy and IPython in your question, so let me plug a project of mine that goes a bit in the direction of IncPy, but works with IPython and is well-suited to large data: http://packages.python.org/joblib/
If you are storing your data in numpy arrays (strings can be stored in numpy arrays), joblib can use memmap for intermediate results and be efficient for IO.

Categories