Best way to store a large python dictionary in a file - python

I have a python script (script 1) which generate a large python dictionary. This dictionary has to be read by an another script (script 2).
Could any one suggest me the best way to write the python dictionary generated by script 1 and to be read by script 2.
In past I have used cPickle to write and read such large dictionaries.
Is there a beter way to do this?

shelve will give you access to each item separately, instead of requiring you to serialize and deserialize the entire dictionary each time.

If you want your dictionary to be readable by different types of scripts (i.e. not just Python), JSON is a good option as well.
It's not as fast as shelve, but it's easy to use and quite readable to the human eye.
import json
with open("/tmp/test.json", "w") as out_handle:
json.dump(my_dict, out_handle) # save dictionary
with open("/tmp/test.json", "r") as in_handle:
my_dict = json.load(in_handle) # load dictionary

Related

Best way to store dictionary in a file and load it partially?

Which is the best way to store dictionary of strings in file(as they are big) and load it partially in python. Dictionary of strings here means, keyword would be a string and the value would be a list of strings.
Dictionary storing in appended form to check keys, if available not update or else update. Then use keys for post processing.
Usually a dictionary is stored in JSON.
I'll leave here a link:
Convert Python dictionary to JSON array
You could simply write the dictionary to a text file, and then create a new dictionary that only pulls certain keys and values from that text file.
But you're probably best off exploring the json module.
Here's a straighforward way to write a dict called "sample" to a file with the json module:
import json
with open('result.json', 'w') as fp:
json.dump(sample, fp)
On the loading side, we'd need to know more about how you want to choose which keys to load from the JSON file.
The above answers are great, but i hate using JSON, i have had issues with pickle before that corrupted my data, so what i do is, i use numpy's save and load
To save np.save(filename,dict)
to load dict = np.load(filename).item()
really simple and works well, as far as loading partially goes, you could always split the dictionary into multiple smaller dictionaries and save them as individual files, maybe not a very concrete solution but it could work
to split the dictionary you could do something like this
temp_dict = {}
for i,k in enumerate(dict.keys()):
if i%1000 == 0:
np.save("records-"+str(i-1000)+"-"+str(i)+".npy",temp_dict)
temp_dict = {}
temp_dict[k]=dict[k].value()
then for loading just do something like
my_dict={}
all_files = glob.glob("*.npy")
for f in all_files:
dict = np.load(filename).item()
my_dict.update(dict)
If this is for some sort of database type use then save yourself the headache and use TinyDB. It uses JSON format when saving to disc and will provide you the "partial" loading that you're looking for.
I only recommend TinyDB as this seems to be the closest to what you're looking to achieve, maybe try googling for other databases if this isn't your fancy there's TONS of them out there!

how can I reliably access a single key-value pair from a JSON file that's too large to load into memory?

I am trying to retrieve the names of the people from my file. The file size is 201GB
import json
with open("D:/dns.json", "r") as fh:
for l in fh:
d = json.loads(l)
print(d["name"])
Whenever I try to run this program on windows, I encounter a Memory error, which says insufficient memory.
Is there a reliable way to parse a single key, value pair without loading the whole file? I have reading the file in chunks in mind, but I don't know how to start.
Here is sample: test.json
Every line is seperated by newline. Hope this helps.
You may want to give ijson a try : https://pypi.python.org/pypi/ijson
Unfortunately there is no guarantee that each line of a JSON file will make any sense to the parser on its own. I'm afraid JSON was never intended for multi-gigabyte data exchange, precisely because each JSON file contains an integral data structure. In the XML world people have written incremental event-driven (SAX-based) parsers. I'm not aware of such a library for JSON.

Fastest way to store large files in Python

I recently asked a question regarding how to save large python objects to file. I had previously run into problems converting massive Python dictionaries into string and writing them to file via write(). Now I am using pickle. Although it works, the files are incredibly large (> 5 GB). I have little experience in the field of such large files. I wanted to know if it would be faster, or even possible, to zip this pickle file prior to storing it to memory.
You can compress the data with bzip2:
from __future__ import with_statement # Only for Python 2.5
import bz2,json,contextlib
hugeData = {'key': {'x': 1, 'y':2}}
with contextlib.closing(bz2.BZ2File('data.json.bz2', 'wb')) as f:
json.dump(hugeData, f)
Load it like this:
from __future__ import with_statement # Only for Python 2.5
import bz2,json,contextlib
with contextlib.closing(bz2.BZ2File('data.json.bz2', 'rb')) as f:
hugeData = json.load(f)
You can also compress the data using zlib or gzip with pretty much the same interface. However, both zlib and gzip's compression rates will be lower than the one achieved with bzip2 (or lzma).
Python code would be extremely slow when it comes to implementing data serialization.
If you try to create an equivalent to Pickle in pure Python, you'll see that it will be super slow.
Fortunately the built-in modules which perform that are quite good.
Apart from cPickle, you will find the marshal module, which is a lot faster.
But it needs a real file handle (not from a file-like object).
You can import marshal as Pickle and see the difference.
I don't think you can make a custom serializer which is a lot faster than this...
Here's an actual (not so old) serious benchmark of Python serializers
faster, or even possible, to zip this pickle file prior to [writing]
Of course it's possible, but there's no reason to try to make an explicit zipped copy in memory (it might not fit!) before writing it, when you can automatically cause it to be zipped as it is written, with built-in standard library functionality ;)
See http://docs.python.org/library/gzip.html . Basically, you create a special kind of stream with
gzip.GzipFile("output file name", "wb")
and then use it exactly like an ordinary file created with open(...) (or file(...) for that matter).
I'd just expand on phihag's answer.
When trying to serialize an object approaching the size of RAM, pickle/cPickle should be avoided, since it requires additional memory of 1-2 times the size of the object in order to serialize. That's true even when streaming it to BZ2File. In my case I was even running out of swap space.
But the problem with JSON (and similarly with HDF files as mentioned in the linked article) is that it cannot serialize tuples, which in my data are used as keys to dicts. There is no great solution for this; the best I could find was to convert tuples to strings, which requires some memory of its own, but much less than pickle. Nowadays, you can also use the ujson library, which is much faster than the json library.
For tuples composed of strings (requires strings to contain no commas):
import ujson as json
from bz2 import BZ2File
bigdata = { ('a','b','c') : 25, ('d','e') : 13 }
bigdata = dict([(','.join(k), v) for k, v in bigdata.viewitems()])
f = BZ2File('filename.json.bz2',mode='wb')
json.dump(bigdata,f)
f.close()
To re-compose the tuples:
bigdata = dict([(tuple(k.split(',')),v) for k,v in bigdata.viewitems()])
Alternatively if e.g. your keys are 2-tuples of integers:
bigdata2 = { (1,2): 1.2, (2,3): 3.4}
bigdata2 = dict([('%d,%d' % k, v) for k, v in bigdata2.viewitems()])
# ... save, load ...
bigdata2 = dict([(tuple(map(int,k.split(','))),v) for k,v in bigdata2.viewitems()])
Another advantage of this approach over pickle is that json appears to compress a significantly better than pickles when using bzip2 compression.
Look at Google's ProtoBuffers. Although they are not designed for large files out-of-the box, like audio-video files, they do well with object serialization as in your case, because they were designed for it. Practice shows that some day you may need to update structure of your files, and ProtoBuffers will handle it. Also, they are highly optimized for compression and speed. And you're not tied to Python, Java and C++ are well supported.

Editing Pickled Data

I need to save a complex piece of data:
list = ["Animals", {"Cats":4, "Dogs":5}, {"x":[], "y":[]}]
I was planning on saving several of these lists within the same file, and I was also planning on using the pickle module to save this data. I also want to be able to access the pickled data and add items to the lists in the 2nd dictionary. So after I unpickle the data and edit, the list might look like this:
list = ["Animals", {"Cats":4, "Dogs":5}, {"x"=[1, 2, 3], "y":[]}]
Preferable, I want to be able to save this list (using pickle) in the same file I took that piece of data from. However, if I simply re-pickle the data to the same file (lets say I originally saved it to "File"), I'll end up with two copies of the same list in that file:
a = open("File", "ab")
pickle.dump(list, a)
a.close()
Is there a way to replace the edited list in the file using pickle rather than adding a second (updated) copy? Or, is there another method I should consider for saving this data?
I think you want the shelve module. It creates a file (uses pickle under the hood) that contains the contents of a variable accessible by key (think persistent dictionary).
You could open the file for writing instead of appending -- then the changes would overwrite previous data. This is however a problem if there is more data stored in that file. If what you want really is to selectively replace data in a pickled file, I'm afraid this won't work with pickle. If this is a common operation, check if something like a sqlite database helps you to this end.

Pickle vs output to a file in python

I have a program that outputs some lists that I want to store to work with later. For example, suppose it outputs a list of student names and another list of their midterm scores. I can store this output in the following two ways:
Standard File Output way:
newFile = open('trialWrite1.py','w')
newFile.write(str(firstNames))
newFile.write(str(midterm1Scores))
newFile.close()
The pickle way:
newFile = open('trialWrite2.txt','w')
cPickle.dump(firstNames, newFile)
cPickle.dump(midterm1Scores, newFile)
newFile.close()
Which technique is better or preferred? Is there an advantage of using one over the other?
Thanks
I think the csv module might be a good fit here, since CSV is a standard format that can be both read and written by Python (and many other languages), and it's also human-readable. Usage could be as simple as
with open('trialWrite1.py','wb') as fileobj:
newFile = csv.writer(fileobj)
newFile.writerow(firstNames)
newFile.writerow(midterm1Scores)
However, it'd probably make more sense to write one student per row, including their name and score. That can be done like this:
from itertools import izip
with open('trialWrite1.py','wb') as fileobj:
newFile = csv.writer(fileobj)
for row in izip(firstNames, midterm1Scores):
newFile.writerow(row)
pickle is more generic -- it allows you to dump many different kinds of objects to a file for later use. The downside is that the interim storage is not very human-readable, and not in a standard format.
Writing strings to a file, on the other hand, is a much better interface to other activities or code. But it comes at the cost of having to parse the text back into your Python object again.
Both are fine for this simple (list?) data; I would use write( firstNames ) simply because there's no need to use pickle. In general, how to persist your data to the filesystem depends on the data!
For instance, pickle will happily pickle functions, which you can't do by simply writing the string representations.
>>> data = range
<class 'range'>
>>> pickle.dump( data, foo )
# stuff
>>> pickle.load( open( ..., "rb" ) )
<class 'range'.
For a completely different approach, consider that Python ships with SQLite. You could store your data in a SQL database without adding any third-party dependencies.

Categories