Fastest way to store large files in Python

Fastest way to store large files in Python - python

I recently asked a question regarding how to save large python objects to file. I had previously run into problems converting massive Python dictionaries into string and writing them to file via write(). Now I am using pickle. Although it works, the files are incredibly large (> 5 GB). I have little experience in the field of such large files. I wanted to know if it would be faster, or even possible, to zip this pickle file prior to storing it to memory.

You can compress the data with bzip2:
from __future__ import with_statement # Only for Python 2.5
import bz2,json,contextlib
hugeData = {'key': {'x': 1, 'y':2}}
with contextlib.closing(bz2.BZ2File('data.json.bz2', 'wb')) as f:
json.dump(hugeData, f)
Load it like this:
from __future__ import with_statement # Only for Python 2.5
import bz2,json,contextlib
with contextlib.closing(bz2.BZ2File('data.json.bz2', 'rb')) as f:
hugeData = json.load(f)
You can also compress the data using zlib or gzip with pretty much the same interface. However, both zlib and gzip's compression rates will be lower than the one achieved with bzip2 (or lzma).

Python code would be extremely slow when it comes to implementing data serialization.
If you try to create an equivalent to Pickle in pure Python, you'll see that it will be super slow.
Fortunately the built-in modules which perform that are quite good.
Apart from cPickle, you will find the marshal module, which is a lot faster.
But it needs a real file handle (not from a file-like object).
You can import marshal as Pickle and see the difference.
I don't think you can make a custom serializer which is a lot faster than this...
Here's an actual (not so old) serious benchmark of Python serializers

faster, or even possible, to zip this pickle file prior to [writing]
Of course it's possible, but there's no reason to try to make an explicit zipped copy in memory (it might not fit!) before writing it, when you can automatically cause it to be zipped as it is written, with built-in standard library functionality ;)
See http://docs.python.org/library/gzip.html . Basically, you create a special kind of stream with
gzip.GzipFile("output file name", "wb")
and then use it exactly like an ordinary file created with open(...) (or file(...) for that matter).

I'd just expand on phihag's answer.
When trying to serialize an object approaching the size of RAM, pickle/cPickle should be avoided, since it requires additional memory of 1-2 times the size of the object in order to serialize. That's true even when streaming it to BZ2File. In my case I was even running out of swap space.
But the problem with JSON (and similarly with HDF files as mentioned in the linked article) is that it cannot serialize tuples, which in my data are used as keys to dicts. There is no great solution for this; the best I could find was to convert tuples to strings, which requires some memory of its own, but much less than pickle. Nowadays, you can also use the ujson library, which is much faster than the json library.
For tuples composed of strings (requires strings to contain no commas):
import ujson as json
from bz2 import BZ2File
bigdata = { ('a','b','c') : 25, ('d','e') : 13 }
bigdata = dict([(','.join(k), v) for k, v in bigdata.viewitems()])
f = BZ2File('filename.json.bz2',mode='wb')
json.dump(bigdata,f)
f.close()
To re-compose the tuples:
bigdata = dict([(tuple(k.split(',')),v) for k,v in bigdata.viewitems()])
Alternatively if e.g. your keys are 2-tuples of integers:
bigdata2 = { (1,2): 1.2, (2,3): 3.4}
bigdata2 = dict([('%d,%d' % k, v) for k, v in bigdata2.viewitems()])
# ... save, load ...
bigdata2 = dict([(tuple(map(int,k.split(','))),v) for k,v in bigdata2.viewitems()])
Another advantage of this approach over pickle is that json appears to compress a significantly better than pickles when using bzip2 compression.

Look at Google's ProtoBuffers. Although they are not designed for large files out-of-the box, like audio-video files, they do well with object serialization as in your case, because they were designed for it. Practice shows that some day you may need to update structure of your files, and ProtoBuffers will handle it. Also, they are highly optimized for compression and speed. And you're not tied to Python, Java and C++ are well supported.

Related

How to write to yaml file in human readable format?

I have some data types I need to write:
a. A list of numpy arrays, e.g. [ndarray, ndarray, ndarray] of different sizes.
b. Any arbitrary numpy array, e.g. np.zeros((5,6)), np.randn((76,2)) and so on.
c. Any other future datatype that hasn't occurred to me yet.
Requirements:
I need a single function to be able to save all those data types, with no specific handling, and with future compatibility for type c stated above.
I also need the output file dump in human readable format.
So far, I was only able to achieve requirement 1 with either YAML or pickle, both of which with binary files, i.e. not human readable.
#staticmethod
def _read_with_yaml(path):
with open(path, 'r') as stream:
return yaml.load(stream)
#staticmethod
def _write_with_yaml(path, obj):
with io.open(path, 'w+', encoding='utf8') as outfile:
yaml.dump(obj, outfile, default_flow_style=False, allow_unicode=True)
This example code outputs non-human-readable files, but works for the data types I have.
Is there a way to achieve both requirements?

No, your requirements cannot be satisfied.
You already have one function yaml.dump() that saves all those data types. As you noticed it doesn't do so in a very readable way for numpy data structures. This is caused by numpy not having dumping routines for their special data structure, instead falling back to the, not-so-readable, default !python.... tagged dump of the datastructure. Now you (or the YAML or Numpy package maintainers) can provide, special routines for those objects that dump in a more readable format, so that could be covered. You can make the representer in your YAML library more intelligent and get more readable output for Numpy datastructures without touching Numpy classes.
But you want this for all future datatypes, and IMO a variation of Gödel's theorem applies: even if the YAML library is extended so that it covers all known cases and dumps them in a readable way, there will always be new datastructures, especially in C based extensions (like Numpy), that cannot be represented in a readable way without extra work.
So because of your
Any other future unknown datatype that hasn't occurred to me yet.
premise, this is not just a lot of difficult work, but impossible.

why does converting a python 'shelve' to 'dict' use so much memory?

I have a very large python shelve object (6GB on disk). I want to be able to move it to another machine, and since shelves are not portable, I wanted to cPickle it. To do that, I first have to convert it to a dict.
For some reason, when I do dict(myShelf) the ipython process spikes up to 32GB of memory (all my machine has) and then seems to hang (or maybe just take a really long time).
Can someone explain this? And perhaps offer a potential workaround?
edit: using Python 2.7

From my experience I'd expect pickling to be even more of a memory-hog than what you've done so far. However, creating a dict loads every key and value in the shelf into memory at once, and you shouldn't assume because your shelf is 6GB on disk, that it's only 6GB in memory. For example:
>>> import sys, pickle
>>> sys.getsizeof(1)
24
>>> len(pickle.dumps(1))
4
>>> len(pickle.dumps(1, -1))
5
So, a very small integer is 5-6 times bigger as a Python int object (on my machine) than it is once pickled.
As for the workaround: you can write more than one pickled object to a file. So don't convert the shelf to a dict, just write a long sequence of keys and values to your file, then read an equally long sequence of keys and values on the other side to put into your new shelf. That way you only need one key/value pair in memory at a time. Something like this:
Write:
with open('myshelf.pkl', 'wb') as outfile:
pickle.dump(len(myShelf), outfile)
for p in myShelf.iteritems():
pickle.dump(p, outfile)
Read:
with open('myshelf.pkl', 'rb') as infile:
for _ in xrange(pickle.load(infile)):
k, v = pickle.load(infile)
myShelf[k] = v
I think you don't actually need to store the length, you could just keep reading until pickle.load throws an exception indicating it's run out of file.

Storing a list, then reading it as integer

I have a question. It may be an easy one, but anyway I could not find a good idea. The question is that I have 2 python programs. First of them is giving 2 outputs, one of output is a huge list (like having thousands of another lists) and the other one is a simple csv file for the Weka. I need to store this list (first output) somehow to be able to use it as input of the other program later. I can not just send it to second program because when the first of the program is done, Weka should also produce new output for the second program. Hence, second program has to wait the outputs of first program and Weka.
The problem is that output list consists of lost of lists having numerical values. Simple example could be:
list1 = [[1,5,7],[14,3,27], [19,12,0], [23,8,17], [12,7]]
If I write this on a txt file, then when I try to read it, it takes all the values as string. Is there any easy and fast way (since data is big) to manage somehow taking all the values as integer? Or maybe in the first case, writing it as integer?

I think this is good case to use pickle module
To save data:
import pickle
lst = [[1,5,7],[14,3,27], [19,12,0], [23,8,17], [12,7]]
pickle.dump(lst, open('data.pkl', 'wb'))
To read data from saved file:
import pickle
lst = pickle.load(open('data.pkl', 'r')
From documentation:
The pickle module implements a fundamental, but powerful algorithm for
serializing and de-serializing a Python object structure. “Pickling”
is the process whereby a Python object hierarchy is converted into a
byte stream, and “unpickling” is the inverse operation, whereby a byte
stream is converted back into an object hierarchy. Pickling (and
unpickling) is alternatively known as “serialization”, “marshalling,”
[1] or “flattening”, however, to avoid confusion, the terms used here
are “pickling” and “unpickling”.
there's also faster cPickle module:
To save data:
from cPickle import Pickler
p = Pickler(open('data2.pkl', 'wb'))
p.dump(lst)
To read data from saved file:
from cPickle import Unpickler
up = Unpickler(open('data.pkl', 'r'))
lst = up.load()

How about pickling the list output rather than outputting it as a plaintext representation? Have a look at the documentation for your version: it's basically a way to write Python objects to file, which you can then read from Python at any point to get identical objects.
Once you have the file open that you want to output to, the outputting difference will be quite minor, e.g.
import pickle
my_list = [[1, 2], [134, 76], [798, 5, 2]]
with open('outputfile.pkl', 'wb') as output:
pickle.dump(my_list, output, -1)
And then just use the following way to read it in from your second program:
import pickle
my_list = pickle.load(open('outputfile.pkl', 'rb'))

Best way to store a large python dictionary in a file

I have a python script (script 1) which generate a large python dictionary. This dictionary has to be read by an another script (script 2).
Could any one suggest me the best way to write the python dictionary generated by script 1 and to be read by script 2.
In past I have used cPickle to write and read such large dictionaries.
Is there a beter way to do this?

shelve will give you access to each item separately, instead of requiring you to serialize and deserialize the entire dictionary each time.

If you want your dictionary to be readable by different types of scripts (i.e. not just Python), JSON is a good option as well.
It's not as fast as shelve, but it's easy to use and quite readable to the human eye.
import json
with open("/tmp/test.json", "w") as out_handle:
json.dump(my_dict, out_handle) # save dictionary
with open("/tmp/test.json", "r") as in_handle:
my_dict = json.load(in_handle) # load dictionary

Pickle vs output to a file in python

I have a program that outputs some lists that I want to store to work with later. For example, suppose it outputs a list of student names and another list of their midterm scores. I can store this output in the following two ways:
Standard File Output way:
newFile = open('trialWrite1.py','w')
newFile.write(str(firstNames))
newFile.write(str(midterm1Scores))
newFile.close()
The pickle way:
newFile = open('trialWrite2.txt','w')
cPickle.dump(firstNames, newFile)
cPickle.dump(midterm1Scores, newFile)
newFile.close()
Which technique is better or preferred? Is there an advantage of using one over the other?
Thanks

I think the csv module might be a good fit here, since CSV is a standard format that can be both read and written by Python (and many other languages), and it's also human-readable. Usage could be as simple as
with open('trialWrite1.py','wb') as fileobj:
newFile = csv.writer(fileobj)
newFile.writerow(firstNames)
newFile.writerow(midterm1Scores)
However, it'd probably make more sense to write one student per row, including their name and score. That can be done like this:
from itertools import izip
with open('trialWrite1.py','wb') as fileobj:
newFile = csv.writer(fileobj)
for row in izip(firstNames, midterm1Scores):
newFile.writerow(row)

pickle is more generic -- it allows you to dump many different kinds of objects to a file for later use. The downside is that the interim storage is not very human-readable, and not in a standard format.
Writing strings to a file, on the other hand, is a much better interface to other activities or code. But it comes at the cost of having to parse the text back into your Python object again.
Both are fine for this simple (list?) data; I would use write( firstNames ) simply because there's no need to use pickle. In general, how to persist your data to the filesystem depends on the data!
For instance, pickle will happily pickle functions, which you can't do by simply writing the string representations.
>>> data = range
<class 'range'>
>>> pickle.dump( data, foo )
# stuff
>>> pickle.load( open( ..., "rb" ) )
<class 'range'.

For a completely different approach, consider that Python ships with SQLite. You could store your data in a SQL database without adding any third-party dependencies.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fastest way to store large files in Python - python

Related

How to write to yaml file in human readable format?

why does converting a python 'shelve' to 'dict' use so much memory?

Storing a list, then reading it as integer

Best way to store a large python dictionary in a file

Pickle vs output to a file in python

Categories

Resources