Saving dictionaries to file (numpy and Python 2/3 friendly) - python

I want to do hierarchical key-value storage in Python, which basically boils down to storing dictionaries to files. By that I mean any type of dictionary structure, that may contain other dictionaries, numpy arrays, serializable Python objects, and so forth. Not only that, I want it to store numpy arrays space-optimized and play nice between Python 2 and 3.
Below are methods I know are out there. My question is what is missing from this list and is there an alternative that dodges all my deal-breakers?
Python's pickle module (deal-breaker: inflates the size of numpy arrays a lot)
Numpy's save/savez/load (deal-breaker: Incompatible format across Python 2/3)
PyTables replacement for numpy.savez (deal-breaker: only handles numpy arrays)
Using PyTables manually (deal-breaker: I want this for constantly changing research code, so it's really convenient to be able to dump dictionaries to files by calling a single function)
The PyTables replacement of numpy.savez is promising, since I like the idea of using hdf5 and it compresses the numpy arrays really efficiently, which is a big plus. However, it does not take any type of dictionary structure.
Lately, what I've been doing is to use something similar to the PyTables replacement, but enhancing it to be able to store any type of entries. This actually works pretty well, but I find myself storing primitive data types in length-1 CArrays, which is a bit awkward (and ambiguous to actual length-1 arrays), even though I set chunksize to 1 so it doesn't take up that much space.
Is there something like that already out there?

After asking this two years ago, I starting coding my own HDF5-based replacement of pickle/ Ever since, it has matured into a stable package, so I thought I would finally answer and accept my own question because it is by design exactly what I was looking for:

I recently found myself with a similar problem, for which I wrote a couple of functions for saving the contents of dicts to a group in a PyTables file, and loading them back into dicts.
They process nested dictionary and group structures recursively, and handle objects with types that are not natively supported by PyTables by pickling them and storing them as string arrays. It's not perfect, but at least things like numpy arrays will be stored efficiently. There's also a check included to avoid inadvertently loading enormous structures into memory when reading the group contents back into a dict.
import tables
import cPickle
def dict2group(f, parent, groupname, dictin, force=False, recursive=True):
Take a dict, shove it into a PyTables HDF5 file as a group. Each item in
the dict must have a type and shape compatible with PyTables Array.
If 'force == True', any existing child group of the parent node with the
same name as the new group will be overwritten.
If 'recursive == True' (default), new groups will be created recursively
for any items in the dict that are also dicts.
g = f.create_group(parent, groupname)
except tables.NodeError as ne:
if force:
pathstr = parent._v_pathname + '/' + groupname
f.removeNode(pathstr, recursive=True)
g = f.create_group(parent, groupname)
raise ne
for key, item in dictin.iteritems():
if isinstance(item, dict):
if recursive:
dict2group(f, g, key, item, recursive=True)
if item is None:
item = '_None'
f.create_array(g, key, item)
return g
def group2dict(f, g, recursive=True, warn=True, warn_if_bigger_than_nbytes=100E6):
Traverse a group, pull the contents of its children and return them as
a Python dictionary, with the node names as the dictionary keys.
If 'recursive == True' (default), we will recursively traverse child
groups and put their children into sub-dictionaries, otherwise sub-
groups will be skipped.
Since this might potentially result in huge arrays being loaded into
system memory, the 'warn' option will prompt the user to confirm before
loading any individual array that is bigger than some threshold (default
is 100MB)
def memtest(child, threshold=warn_if_bigger_than_nbytes):
mem = child.size_in_memory
if mem > threshold:
print '[!] "%s" is %iMB in size [!]' % (child._v_pathname, mem / 1E6)
confirm = raw_input('Load it anyway? [y/N] >>')
if confirm.lower() == 'y':
return True
print "Skipping item \"%s\"..." % g._v_pathname
return True
outdict = {}
for child in g:
if isinstance(child,
if recursive:
item = group2dict(f, child)
if memtest(child):
item =
if isinstance(item, str):
if item == '_None':
item = None
outdict.update({child._v_name: item})
except tables.NoSuchNodeError:
warnings.warn('No such node: "%s", skipping...' % repr(child))
return outdict
It's also worth mentioning joblib.dump and joblib.load, which tick all of your boxes apart from Python 2/3 cross-compatibility. Under the hood they use for numpy arrays and cPickle for everything else.

I tried playing with np.memmap for saving an array of dictionaries. Say we have the dictionary:
a = np.array([str({'a':1, 'b':2, 'c':[1,2,3,{'d':4}]}])
first I tried to directly save it to a memmap:
f = np.memmap('stack.array', dtype=dict, mode='w+', shape=(100,))
f[0] = d
# CRASHES when reopening since it looses the memory pointer
f = np.memmap('stack.array', dtype=object, mode='w+', shape=(100,))
f[0] = d
# CRASHES when reopening for the same reason
the way it worked is converting the dictionary to a string:
f = np.memmap('stack.array', dtype='|S1000', mode='w+', shape=(100,))
f[0] = str(a)
this works and afterwards you can eval(f[0]) to get the value back.
I do not know the advantage of this approach over the others, but it deserves a closer look.

I absolutely recommend a python object database like ZODB. It seems pretty well suited for your situation, considering you store objects (literally whatever you like) to a dictionary - this means you can store dictionaries inside dictionaries. I've used it in a wide range of problems, and the nice thing is that you can just hand somebody the database file (the one with a .fs extension). With this, they'll be able to read it in, and perform any queries they wish, and modify their own local copies. If you wish to have multiple programs simultaneously accessing the same database, I'd make sure to look at ZEO.
Just a silly example of how to get started:
from ZODB import DB
from ZODB.FileStorage import FileStorage
from ZODB.PersistentMapping import PersistentMapping
import transaction
from persistent import Persistent
from persistent.dict import PersistentDict
from persistent.list import PersistentList
# Defining database type and creating connection.
storage = FileStorage('/path/to/database/zodbname.fs')
db = DB(storage)
connection =
root = connection.root()
# Define and populate the structure.
root['Vehicle'] = PersistentDict() # Upper-most dictionary
root['Vehicle']['Tesla Model S'] = PersistentDict() # Object 1 - also a dictionary
root['Vehicle']['Tesla Model S']['range'] = "208 miles"
root['Vehicle']['Tesla Model S']['acceleration'] = 5.9
root['Vehicle']['Tesla Model S']['base_price'] = "$71,070"
root['Vehicle']['Tesla Model S']['battery_options'] = ["60kWh","85kWh","85kWh Performance"]
# more attributes here
root['Vehicle']['Mercedes-Benz SLS AMG E-Cell'] = PersistentDict() # Object 2 - also a dictionary
# more attributes here
# add as many objects with as many characteristics as you like.
# commiting changes; up until this point things can be rolled back
Once the database is created it's very easy use. Since it's an object database (a dictionary), you can access objects very easily:
#after it's opened (lines from the very beginning, up to and including root = connection.root() )
>> root['Vehicles']['Tesla Model S']['range']
'208 miles'
You can also display all of the keys (and do all other standard dictionary things you might want to do):
>> root['Vehicles']['Tesla Model S'].keys()
['acceleration', 'range', 'battery_options', 'base_price']
Last thing I want to mention is that keys can be changed: Changing the key value in python dictionary. Values can also be changed - so if your research results change because you change your method or something you don't have to start the entire database from scratch (especially if everything else is still okay). Be careful with doing both of these. I put in safety measures in my database code to make sure I'm aware of my attempts to overwrite keys or values.
** ADDED **
# added imports
import numpy as np
from tempfile import TemporaryFile
outfile = TemporaryFile()
# insert into definition/population section,np.linspace(-1,1,10000))
root['Vehicle']['Tesla Model S']['arraydata'] = outfile
# check to see if it worked
>>> root['Vehicle']['Tesla Model S']['arraydata']
<open file '<fdopen>', mode 'w+b' at 0x2693db0> simulate closing and re-opening
A = np.load(root['Vehicle']['Tesla Model S']['arraydata'])
>>> print A
array([-1. , -0.99979998, -0.99959996, ..., 0.99959996,
0.99979998, 1. ])
You could also use numpy.savez() for compressed saving of multiple numpy arrays in this exact same way.

This is not a direct answer. Anyway, you may be interested also in JSON. Have a look at the 13.10. Serializing Datatypes Unsupported by JSON. It shows how to extend the format for unsuported types.
The whole chapter from "Dive into Python 3" by Mark Pilgrim is definitely a good read for at least to know...
Update: Possibly an unrelated idea, but... I have read somewhere, that one of the reasons why XML was finally adopted for data exchange in heterogeneous environment was some study that compared specialized binary format with zipped XML. The conclusion for you could be to use possibly not so space efficient solution and compress it via zip or another well known algorithm. Using the known algorithm helps when you need to debug (to unzip and then look at the text file by eye).


Key 'boot_num' is not recognized when being interpreted from a .JSON file

Currently, I am working on a Boot Sequence in Python for a larger project. For this specific part of the sequence, I need to access a .JSON file (specs.json), establish it as a dictionary in the main program. I then need to take a value from the .JSON file, and add 1 to it, using it's key to find the value. Once that's done, I need to push the changes to the .JSON file. Yet, every time I run the code below, I get the error:
bootNum = spcInfDat['boot_num']
KeyError: 'boot_num'`
Here's the code I currently have:
(Note: I'm using the Python json library, and have imported dumps, dump, and load.)
# Opening of the JSON files
spcInf = open('mki/data/json/specs.json',) # .JSON file that contains the current system's specifications. Not quite needed, but it may make a nice reference?
spcInfDat = load(spcInf)
This code is later followed by this, where I attempt to assign the value to a variable by using it's dictionary key (The for statement was a debug statement, so I could visibly see the Key):
for i in spcInfDat['spec']:
print(CBL + str(i) + CEN)
# Loacting and increasing the value of bootNum.
bootNum = spcInfDat['boot_num']
bootNum = bootNum + 1
(Another Note: CBL and CEN are just variables I use to colour text I send to the terminal.)
This is the interior of specs.json:
"spec": [
I'm relatively new with .JSON files, as well as using the Python json library; I only have experience with them through some GeeksforGeeks tutorials I found. There is a rather good chance that I just don't know how .JSON files work in conjunction with the library, but I figure that it would still be worth a shot to check here. The GeeksForGeeks tutorial had no documentation about this, as well as there being minimal I know about how this works, so I'm lost. I've tried searching here, and have found nothing.
Issue Number 2
Now, the prior part works. But, when I attempt to run the code on the following lines:
# Changing the values of specDict.
print(CBL + "Changing values of specDict... 50%" + CEN)
specDict ={
# Writing the product of makeSpec to `specs.json`.
print(CBL + "Writing makeSpec() result to `specs.json`... 75%" + CEN)
jsonobj = dumps(specDict, indent = 4)
with open('mki/data/json/specs.json', "w") as outfile:
dump(jsonobj, outfile)
I get the error:
TypeError: Object of type builtin_function_or_method is not JSON serializable.
Is there a chance that I set up my dictionary incorrectly, or am I using the dump function incorrectly?
You can show the data using:
This shows it to be a dictionary, whose single entry 'spec' has an array, whose zero'th element is a sub-dictionary, whose 'boot_num' entry is an integer.
{'spec': [{'os': 'name', 'os_type': 'getwindowsversion', 'lang': 'en', 'cpu_amt': 'cpu_count', 'storage_amt': 'unk', 'boot_num': 1}]}
So what you are looking for is
boot_num = spcInfData['spec'][0]['boot_num']
and note that the value obtained this way is already an integer. str() is not necessary.
It's also good practice to guard against file format errors so the program handles them gracefully.
boot_num = spcInfData['spec'][0]['boot_num']
except (KeyError, IndexError):
print('Database is corrupt')
Issue Number 2
"Not serializable" means there is something somewhere in your data structure that is not an accepted type and can't be converted to a JSON string.
json.dump() only processes certain types such as strings, dictionaries, and integers. That includes all of the objects that are nested within sub-dictionaries, sub-arrays, etc. See documentation for json.JSONEncoder for a complete list of allowable types.

How to achieve big growing Python lists backed by a file and not loaded into memory at any point

I have a python script that needs to maintain some values as list and append to the list every n seconds for an indefinite period until the user quits the script. I want the list to be file backed to which i can append without loading the list contents into memory, the only part that will be in memory will be the value that is to be appended.
I tried Shelve but after running some tests i found out that at some point of time i will have to have to load the existing persisted list into memory and append to it, with writeback = true, it keeps the whole file into memory and writes to file at the end only.
The script goes like this:-
d = reportDir+curDatName+ ( processName.rsplit('.', 1)[0] if processName.endswith('.exe') else processName ))
global counter
global startTime
d['CPU|'+pid] = []
d['RAM|'+pid] = []
d['THREADS|'+pid] = []
startTime =
while(not quit):
print('Took a snapshot')
counter = counter + 1
p = psutil.Process(int(pid))
list = d['CPU|'+pid]
d['CPU|'+pid] = list
print('CPU '+ str(p.cpu_percent(0.1)/psutil.cpu_count(0.1)))
list = d['RAM|'+pid]
d['RAM|'+pid] = list
print('RAM '+ str(p.memory_percent()))
list = d['THREADS|'+pid]
d['THREADS|'+pid] = list
print('Threads- '+ str(p.num_threads()))
while(not quit and i < interval/chunkWait): #wait in chunks
i = i+1
Is there any other package that can achieve the desired functionality for me?
You want to use a list, which is an in-memory data structure, to store data that is too big to fit conveniently in memory. You are asking for some package that will allow you to pretend that you are using a list when in fact you are dealing with a list-like interface that in actuality has a disk-based backing store. That would be convenient, and there is at least one first-class proprietary API that I know that does it, for the vendor's own database (so not generic). And no doubt someone somewhere has come up with a generic list-like interface to a database table, but that would simply be syntactic sugar. I suspect you are going to have to abandon your list approach and learn a database API. This is not a forum for such recommendations, but I think I won't get flamed or downvoted for saying that mysql is free and has a very good reputation.

Fastest way to compare and replace key value pairs in Python

I have a number of files where I want to replace all instances of a specific string with another one.
I currently have this code:
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}
# Open file for substitution
replaceFile = open('file', 'r+')
# read in all the lines
lines = replaceFile.readlines()
# seek to the start of the file and truncate
# (this is cause i want to do an "inline" replace
# Loop through each line from file
for line in lines:
# Loop through each Key in the mappings dict
for i in mappings.keys():
# if the key appears in the line
if i in line:
# do replacement
line = line.replace(i, mappings[i])
# Write the line to the file and move to next line
This works ok, but it is very slow for the size of the mappings and the size of the files I am dealing with.
For instance, in the "mappings" dict there are 60728 key value pairs.
I need to process up to 50 files and replace all instances of "key" with the corresponding value, and each of the 50 files is approximately 250000 lines.
There are also multiple instances where there are multiple keys that need to be replaced on the one line, hence I cant just find the first match and then move on.
So my question is:
Is there a faster way to do the above?
I have thought about using a regex, but I am not sure how to craft one that will do multiple in-line replaces using key/value pairs from a dict.
If you need more info, let me know.
If this performance is slow, you'll have to find something fancy. It's just about all running at C-level:
for filename in filenames:
with open(filename, 'r+') as f:
data =
for k, v in mappings.items():
data = data.replace(k, v)
Note that you can run multiple processes where each process tackles a portion of the total list of files. That should make the whole job a lot faster. Nothing fancy, just run multiple instances off the shell, each with a different file list.
Apparently str.replace is faster than regex.sub.
So I got to thinking about this a bit more: suppose you have a really huge mappings. So much so that the likelihood of any one key in mappings being detected in your files is very low. In this scenario, all the time will be spent doing the searching (as pointed out by #abarnert).
Before resorting to exotic algorithms, it seems plausible that multiprocessing could at least be used to do the searching in parallel, and thereafter do the replacements in one process (you can't do replacements in multiple processes for obvious reasons: how would you combine the result?).
So I decided to finally get a basic understanding of multiprocessing, and the code below looks like it could plausibly work:
import multiprocessing as mp
def split_seq(seq, num_pieces):
# Splits a list into pieces
start = 0
for i in xrange(num_pieces):
stop = start + len(seq[i::num_pieces])
yield seq[start:stop]
start = stop
def detect_active_keys(keys, data, queue):
# This function MUST be at the top-level, or
# it can't be pickled (multiprocessing using pickling)
queue.put([k for k in keys if k in data])
def mass_replace(data, mappings):
manager = mp.Manager()
queue = mp.Queue()
# Data will be SHARED (not duplicated for each process)
d = manager.list(data)
# Split the MAPPINGS KEYS up into multiple LISTS,
# same number as CPUs
key_batches = split_seq(mappings.keys(), mp.cpu_count())
# Start the key detections
processes = []
for i, keys in enumerate(key_batches):
p = mp.Process(target=detect_active_keys, args=(keys, d, queue))
# This is non-blocking
# Consume the output from the queues
active_keys = []
for p in processes:
# We expect one result per process exactly
# (this is blocking)
# Wait for the processes to finish
for p in processes:
# Note that you MUST only call join() after
# calling queue.get()
# Same as original submission, now with MUCH fewer keys
for key in active_keys:
data = data.replace(k, mappings[key])
return data
if __name__ == '__main__':
# You MUST call the mass_replace function from
# here, due to how multiprocessing works
filenames = <...obtain filenames...>
mappings = <...obtain mappings...>
for filename in filenames:
with open(filename, 'r+') as f:
data = mass_replace(, mappings)
Some notes:
I have not executed this code yet! I hope to test it out sometime but it takes time to create the test files and so on. Please consider it as somewhere between pseudocode and valid python. It should not be difficult to get it to run.
Conceivably, it should be pretty easy to use multiple physical machines, i.e. a cluster with the same code. The docs for multiprocessing show how to work with machines on a network.
This code is still pretty simple. I would love to know whether it improves your speed at all.
There seem to be a lot of hackish caveats with using multiprocessing, which I tried to point out in the comments. Since I haven't been able to test the code yet, it may be the case that I haven't used multiprocessing correctly anyway.
According to, regex is the fastest way to go for Python. (Building a Trie data structure would be fastest for C++) :
import sys, re, time, hashlib
class Regex:
# Regex implementation of find/replace for a massive word list.
def __init__(self, mappings):
self._mappings = mappings
def replace_func(self, matchObj):
key =
if self._mappings.has_key(key):
return self._mappings[key]
return key
def replace_all(self, filename):
text = ''
with open(filename, 'r+') as fp
text =
text = re.sub("[a-zA-Z]+", self.replace_func, text)
fp = with open(filename, "w") as fp:
# mapping dictionary of (find, replace) tuples defined
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}
# initialize regex class with mapping tuple dictionary
r = Regex(mappings)
# replace file
r.replace_all( 'file' )
The slow part of this is the searching, not the replacing. (Even if I'm wrong, you can easily speed up the replacing part by first searching for all the indices, then splitting and replacing from the end; it's only the searching part that needs to be clever.)
Any naive mass string search algorithm is obviously going to be O(NM) for an N-length string and M substrings (and maybe even worse, if the substrings are long enough to matter). An algorithm that searched M times at each position, instead of M times over the whole string, might be offer some cache/paging benefits, but it'll be a lot more complicated for probably only a small benefit.
So, you're not going to do much better than cjrh's implementation if you stick with a naive algorithm. (You could try compiling it as Cython or running it in PyPy to see if it helps, but I doubt it'll help much—as he explains, all the inner loops are already in C.)
The way to speed it up is to somehow look for many substrings at a time. The standard way to do that is to build a prefix tree (or suffix tree), so, e.g, "original-1" and "original-2" are both branches off the same subtree "original-", so they don't need to be handled separately until the very last character.
The standard implementation of a prefix tree is a trie. However, as Efficient String Matching: An Aid to Bibliographic Search and the Wikipedia article Aho-Corasick string matching algorithm explain, you can optimize further for this use case by using a custom data structure with extra links for fallbacks. (IIRC, this improves the average case by logM.)
Aho and Corasick further optimize things by compiling a finite state machine out of the fallback trie, which isn't appropriate to every problem, but sounds like it would be for yours. (You're reusing the same mappings dict 50 times.)
There are a number of variant algorithms with additional benefits, so it might be worth a bit of further research. (Common use cases are things like virus scanners and package filters, which might help your search.) But I think Aho-Corasick, or even just a plain trie, is probably good enough.
Building any of these structures in pure Python might add so much overhead that, at M~60000, the extra cost will defeat the M/logM algorithmic improvement. But fortunately, you don't have to. There are many C-optimized trie implementations, and at least one Aho-Corasick implementation, on PyPI. It also might be worth looking at something like SuffixTree instead of using a generic trie library upside-down if you think suffix matching will work better with your data.
Unfortunately, without your data set, it's hard for anyone else to do a useful performance test. If you want, I can write test code that uses a few different modules, that you can then run against you data. But here's a simple example using ahocorasick for the search and a dumb replace-from-the-end implementation for the replace:
tree = ahocorasick.KeywordTree()
for key in mappings:
for start, end in reversed(list(tree.findall(target))):
target = target[:start] + mappings[target[start:end]] + target[end:]
This use a with block to prevent leaking file descriptors. The string replace function will ensure all instances of key get replaced within the text.
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}
# Open file for substitution
with open('file', 'r+') as fd:
# read in all the data
text =
# seek to the start of the file and truncate so file will be edited inline
for key in mappings.keys():
text = text.replace(key, mappings[key])

Pack array of namedtuples in PYTHON

I need to send an array of namedtuples by a socket.
To create the array of namedtuples I use de following:
for i in range(200):
ipPuerto=collections.namedtuple('ipPuerto', 'ip, puerto')
Now that is filled, i need to pack "listaPeers[200]"
How can i do it?
Something like?:
packedData = struct.pack('XXXX',listaPeers)
First of all you are using namedtuple incorrectly. It should look something like this:
# ipPuerto is a type
ipPuerto=collections.namedtuple('ipPuerto', 'ip, puerto')
# theTuple is a tuple object
theTuple = ipPuerto("121.231.334.22", "8988")
As for packing, it depends what you want to use on the other end. If the data will be read by Python, you can just use Pickle module.
import cPickle as Pickle
pickledTuple = Pickle.dumps(theTuple)
You can pickle whole array of them at once.
It is not that simple - yes, for integers and simple numbers, it s possible to pack straight from named tuples to data provided by the struct package.
However, you are holding your data as strings, not as numbers - it is a simple thing to convert to int in the case of the port - as it is a simple integer, but requires some juggling when it comes to the IP.
def ipv4_from_str(ip_str):
parts = ip_str.split(".")
result = 0
for part in parts:
result <<= 8
result += int(part)
return result
def ip_puerto_gen(list_of_ips):
for ip_puerto in list_of_ips:
def pack(list_of_ips):
return struct.pack(">" + "II" * len(list_of_ips),
And you then use the "pack" function from here to pack your structure as you seem to want.
But first, attempt to the fact that you are creating your "listaPiers" incorrectly (your example code simply will fail with an IndexError) - use an empty list, and the append method on it to insert new named tuples with ip/port pairs as each element:
listaPiers = []
ipPuerto=collections.namedtuple('ipPuerto', 'ip, puerto')
for x in range(200):
new_element = ipPuerto("", "8192")
data = pack(listaPiers)
ISTR that pickle is considered insecure in server processes, if the server process is receiving pickled data from untrusted clients.
You might want to come up with some sort of separator character(s) for the records and fields (perhaps \0 and \001 or \376 and \377). Then putting together a message is kind of like a text file broken up into records and fields separated by spaces and newlines. Or for that matter, you could use spaces and newlines, if your normal data doesn't include these.
I find this module very valuable for framing data in socket-based protocols:
It lets you do things like "read up until the next null byte" or "read the next 10 characters" - without needing to worry about the complexities of IP aggregating or splitting packets.

Using cPickle to serialize a large dictionary causes MemoryError

I'm writing an inverted index for a search engine on a collection of documents. Right now, I'm storing the index as a dictionary of dictionaries. That is, each keyword maps to a dictionary of docIDs->positions of occurrence.
The data model looks something like:
{word : { doc_name : [location_list] } }
Building the index in memory works fine, but when I try to serialize to disk, I hit a MemoryError. Here's my code:
# Write the index out to disk
serializedIndex = open(sys.argv[3], 'wb')
cPickle.dump(index, serializedIndex, cPickle.HIGHEST_PROTOCOL)
Right before serialization, my program is using about 50% memory (1.6 Gb). As soon as I make the call to cPickle, my memory usage skyrockets to 80% before crashing.
Why is cPickle using so much memory for serialization? Is there a better way to be approaching this problem?
cPickle needs to use a bunch of extra memory because it does cycle detection. You could try using the marshal module if you are sure your data has no cycles
There's the other pickle library you could try. Also there might be some cPickle settings you could change.
Other options: Break your dictionary into smaller pieces and cPickle each piece. Then put them back together when you load everything in.
Sorry this is vague, I'm just writing off the top of my head. I figured it might still be helpful since no one else has answered.
You may well be using the wrong tool for this job. If you want to persist a huge amount of indexed data, I'd strongly suggest using an SQLite on-disk database (or, of course, just a normal database) with an ORM like SQLObject or SQL Alchemy.
These will take care of the mundane things like compatibility, optimising format for purpose, and not holding all the data in memory simultaneously so that you run out of memory...
Added: Because I was working on a near identical thing anyway, but mainly because I'm such a nice person, here's a demo that appears to do what you need (it'll create an SQLite file in your current dir, and delete it if a file with that name already exists, so put it somewhere empty first):
import sqlobject
from sqlobject import SQLObject, UnicodeCol, ForeignKey, IntCol, SQLMultipleJoin
import os
DB_NAME = "mydb"
ENCODING = "utf8"
class Document(SQLObject):
dbName = UnicodeCol(dbEncoding=ENCODING)
class Location(SQLObject):
""" Location of each individual occurrence of a word within a document.
dbWord = UnicodeCol(dbEncoding=ENCODING)
dbDocument = ForeignKey('Document')
dbLocation = IntCol()
'one' : {
'doc1' : [1,2,10],
'doc3' : [6],
'two' : {
'doc1' : [2, 13],
'doc2' : [5,6,7],
'three' : {
'doc3' : [1],
if __name__ == "__main__":
db_filename = os.path.abspath(DB_NAME)
if os.path.exists(db_filename):
connection = sqlobject.connectionForURI("sqlite:%s" % (db_filename))
sqlobject.sqlhub.processConnection = connection
# Create the tables
# Import the dict data:
for word, locs in TEST_DATA.items():
for doc, indices in locs.items():
sql_doc = Document(dbName=doc)
for index in indices:
Location(dbWord=word, dbDocument=sql_doc, dbLocation=index)
# Let's check out the data... where can we find 'two'?
locs_for_two = Location.selectBy(dbWord = 'two')
# Or...
# locs_for_two = == 'two')
print "Word 'two' found at..."
for loc in locs_for_two:
print "Found: %s, p%s" % (loc.dbDocument.dbName, loc.dbLocation)
# What documents have 'one' in them?
docs_with_one = Location.selectBy(dbWord = 'one').throughTo.dbDocument
print "Word 'one' found in documents..."
for doc in docs_with_one:
print "Found: %s" % doc.dbName
This is certainly not the only way (or necessarily the best way) to do this. Whether the Document or Word tables should be separate tables from the Location table depends on your data and typical usage. In your case, the "Word" table could probably be a separate table with some added settings for indexing and uniqueness.
