I have a number of files where I want to replace all instances of a specific string with another one.
I currently have this code:
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}
# Open file for substitution
replaceFile = open('file', 'r+')
# read in all the lines
lines = replaceFile.readlines()
# seek to the start of the file and truncate
# (this is cause i want to do an "inline" replace
replaceFile.seek(0)
replaceFile.truncate()
# Loop through each line from file
for line in lines:
# Loop through each Key in the mappings dict
for i in mappings.keys():
# if the key appears in the line
if i in line:
# do replacement
line = line.replace(i, mappings[i])
# Write the line to the file and move to next line
replaceFile.write(line)
This works ok, but it is very slow for the size of the mappings and the size of the files I am dealing with.
For instance, in the "mappings" dict there are 60728 key value pairs.
I need to process up to 50 files and replace all instances of "key" with the corresponding value, and each of the 50 files is approximately 250000 lines.
There are also multiple instances where there are multiple keys that need to be replaced on the one line, hence I cant just find the first match and then move on.
So my question is:
Is there a faster way to do the above?
I have thought about using a regex, but I am not sure how to craft one that will do multiple in-line replaces using key/value pairs from a dict.
If you need more info, let me know.
If this performance is slow, you'll have to find something fancy. It's just about all running at C-level:
for filename in filenames:
with open(filename, 'r+') as f:
data = f.read()
f.seek(0)
f.truncate()
for k, v in mappings.items():
data = data.replace(k, v)
f.write(data)
Note that you can run multiple processes where each process tackles a portion of the total list of files. That should make the whole job a lot faster. Nothing fancy, just run multiple instances off the shell, each with a different file list.
Apparently str.replace is faster than regex.sub.
So I got to thinking about this a bit more: suppose you have a really huge mappings. So much so that the likelihood of any one key in mappings being detected in your files is very low. In this scenario, all the time will be spent doing the searching (as pointed out by #abarnert).
Before resorting to exotic algorithms, it seems plausible that multiprocessing could at least be used to do the searching in parallel, and thereafter do the replacements in one process (you can't do replacements in multiple processes for obvious reasons: how would you combine the result?).
So I decided to finally get a basic understanding of multiprocessing, and the code below looks like it could plausibly work:
import multiprocessing as mp
def split_seq(seq, num_pieces):
# Splits a list into pieces
start = 0
for i in xrange(num_pieces):
stop = start + len(seq[i::num_pieces])
yield seq[start:stop]
start = stop
def detect_active_keys(keys, data, queue):
# This function MUST be at the top-level, or
# it can't be pickled (multiprocessing using pickling)
queue.put([k for k in keys if k in data])
def mass_replace(data, mappings):
manager = mp.Manager()
queue = mp.Queue()
# Data will be SHARED (not duplicated for each process)
d = manager.list(data)
# Split the MAPPINGS KEYS up into multiple LISTS,
# same number as CPUs
key_batches = split_seq(mappings.keys(), mp.cpu_count())
# Start the key detections
processes = []
for i, keys in enumerate(key_batches):
p = mp.Process(target=detect_active_keys, args=(keys, d, queue))
# This is non-blocking
p.start()
processes.append(p)
# Consume the output from the queues
active_keys = []
for p in processes:
# We expect one result per process exactly
# (this is blocking)
active_keys.append(queue.get())
# Wait for the processes to finish
for p in processes:
# Note that you MUST only call join() after
# calling queue.get()
p.join()
# Same as original submission, now with MUCH fewer keys
for key in active_keys:
data = data.replace(k, mappings[key])
return data
if __name__ == '__main__':
# You MUST call the mass_replace function from
# here, due to how multiprocessing works
filenames = <...obtain filenames...>
mappings = <...obtain mappings...>
for filename in filenames:
with open(filename, 'r+') as f:
data = mass_replace(f.read(), mappings)
f.seek(0)
f.truncate()
f.write(data)
Some notes:
I have not executed this code yet! I hope to test it out sometime but it takes time to create the test files and so on. Please consider it as somewhere between pseudocode and valid python. It should not be difficult to get it to run.
Conceivably, it should be pretty easy to use multiple physical machines, i.e. a cluster with the same code. The docs for multiprocessing show how to work with machines on a network.
This code is still pretty simple. I would love to know whether it improves your speed at all.
There seem to be a lot of hackish caveats with using multiprocessing, which I tried to point out in the comments. Since I haven't been able to test the code yet, it may be the case that I haven't used multiprocessing correctly anyway.
According to http://pravin.paratey.com/posts/super-quick-find-replace, regex is the fastest way to go for Python. (Building a Trie data structure would be fastest for C++) :
import sys, re, time, hashlib
class Regex:
# Regex implementation of find/replace for a massive word list.
def __init__(self, mappings):
self._mappings = mappings
def replace_func(self, matchObj):
key = matchObj.group(0)
if self._mappings.has_key(key):
return self._mappings[key]
else:
return key
def replace_all(self, filename):
text = ''
with open(filename, 'r+') as fp
text = fp.read()
text = re.sub("[a-zA-Z]+", self.replace_func, text)
fp = with open(filename, "w") as fp:
fp.write(text)
# mapping dictionary of (find, replace) tuples defined
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}
# initialize regex class with mapping tuple dictionary
r = Regex(mappings)
# replace file
r.replace_all( 'file' )
The slow part of this is the searching, not the replacing. (Even if I'm wrong, you can easily speed up the replacing part by first searching for all the indices, then splitting and replacing from the end; it's only the searching part that needs to be clever.)
Any naive mass string search algorithm is obviously going to be O(NM) for an N-length string and M substrings (and maybe even worse, if the substrings are long enough to matter). An algorithm that searched M times at each position, instead of M times over the whole string, might be offer some cache/paging benefits, but it'll be a lot more complicated for probably only a small benefit.
So, you're not going to do much better than cjrh's implementation if you stick with a naive algorithm. (You could try compiling it as Cython or running it in PyPy to see if it helps, but I doubt it'll help much—as he explains, all the inner loops are already in C.)
The way to speed it up is to somehow look for many substrings at a time. The standard way to do that is to build a prefix tree (or suffix tree), so, e.g, "original-1" and "original-2" are both branches off the same subtree "original-", so they don't need to be handled separately until the very last character.
The standard implementation of a prefix tree is a trie. However, as Efficient String Matching: An Aid to Bibliographic Search and the Wikipedia article Aho-Corasick string matching algorithm explain, you can optimize further for this use case by using a custom data structure with extra links for fallbacks. (IIRC, this improves the average case by logM.)
Aho and Corasick further optimize things by compiling a finite state machine out of the fallback trie, which isn't appropriate to every problem, but sounds like it would be for yours. (You're reusing the same mappings dict 50 times.)
There are a number of variant algorithms with additional benefits, so it might be worth a bit of further research. (Common use cases are things like virus scanners and package filters, which might help your search.) But I think Aho-Corasick, or even just a plain trie, is probably good enough.
Building any of these structures in pure Python might add so much overhead that, at M~60000, the extra cost will defeat the M/logM algorithmic improvement. But fortunately, you don't have to. There are many C-optimized trie implementations, and at least one Aho-Corasick implementation, on PyPI. It also might be worth looking at something like SuffixTree instead of using a generic trie library upside-down if you think suffix matching will work better with your data.
Unfortunately, without your data set, it's hard for anyone else to do a useful performance test. If you want, I can write test code that uses a few different modules, that you can then run against you data. But here's a simple example using ahocorasick for the search and a dumb replace-from-the-end implementation for the replace:
tree = ahocorasick.KeywordTree()
for key in mappings:
tree.add(key)
tree.make()
for start, end in reversed(list(tree.findall(target))):
target = target[:start] + mappings[target[start:end]] + target[end:]
This use a with block to prevent leaking file descriptors. The string replace function will ensure all instances of key get replaced within the text.
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}
# Open file for substitution
with open('file', 'r+') as fd:
# read in all the data
text = fd.read()
# seek to the start of the file and truncate so file will be edited inline
fd.seek(0)
fd.truncate()
for key in mappings.keys():
text = text.replace(key, mappings[key])
fd.write(text)
Related
I am trying to extract some information from a set of files sent to me by a collaborator. Each file contains some python code which names a sequence of lists. They look something like this:
#PHASE = 0
x = np.array(1,2,...)
y = np.array(3,4,...)
z = np.array(5,6,...)
#PHASE = 30
x = np.array(1,4,...)
y = np.array(2,5,...)
z = np.array(3,6,...)
#PHASE = 40
...
And so on. There are 12 files in total, each with 7 phase sets. My goal is to convert each phase into it's own file which can then be read by ascii.read() as a Table object for manipulation in a different section of code.
My current method is extremely inefficient, both in terms of resources and time/energy required to assemble. It goes something like this: Start with a function
def makeTable(a,b,c):
output = Table()
output['x'] = a
output['y'] = b
output['z'] = c
return output
Then for each phase, I have manually copy-pasted the relevant part of the text file into a cell and appended a line of code
fileName_phase = makeTable(a,b,c)
Repeat ad nauseam. It would take 84 iterations of this to process all the data, and naturally each would need some minor adjustments to match the specific fileName and phase.
Finally, at the end of my code, I have a few lines of code set up to ascii.write each of the tables into .dat files for later manipulation.
This entire method is extremely exhausting to set up. If it's the only way to handle the data, I'll do it. I'm hoping I can find a quicker way to set it up, however. Is there one you can suggest?
If efficiency and code reuse instead of copy is the goal, I think that Classes might provide a good way. I'm going to sleep now, but I'll edit later. Here's my thoughts: create a class called FileWithArrays and use a parser to read the lines and put them inside the object FileWithArrays you will create using the class. Once that's done, you can then create a method to transform the object in a table.
P.S. A good idea for the parser is to store all the lines in a list and parse them one by one, using list.pop() to auto shrink the list. Hope it helps, tomorrow I'll look more on it if this doesn't help a lot. Try to rewrite/reformat the question if I misunderstood anything, it's not very easy to read.
I will suggest a way which will be scorned by many but will get your work done.
So apologies to every one.
The prerequisites for this method is that you absolutely trust the correctness of the input files. Which I guess you do. (After all he is your collaborator).
So the key point here is that the text in the file is code which means it can be executed.
So you can do something like this
import re
import numpy as np # this is for the actual code in the files. You might have to install numpy library for this to work.
file = open("xyz.txt")
content = file.read()
Now that you have all the content, you have to separate it by phase.
For this we will use the re.split function.
phase_data = re.split("#PHASE = .*\n", content)
Now we have the content of each phase in an array.
Now comes for the part of executing it.
for phase in phase_data:
if len(phase.strip()) == 0:
continue
exec(phase)
table = makeTable(x, y, z) # the x, y and z are defined by the exec.
# do whatever you want with the table.
I will reiterate that you have to absolutely trust the contents of the file. Since you are executing it as code.
But your work seems like a scripting one and I believe this will get your work done.
PS : The other "safer" alternative to exec is to have a sandboxing library which takes the string and executes it without affecting the parent scope.
To avoid the safety issue of using exec as suggested by #Ajay Brahmakshatriya, but keeping his first processing step, you can create your own minimal 'phase parser', something like:
VARS = 'xyz'
def makeTable(phase):
assert len(phase) >= 3
output = Table()
for i in range(3):
line = [s.strip() for s in phase[i].split('=')]
assert len(line) == 2
var, arr = line
assert var == VARS[i]
assert arr[:10]=='np.array([' and arr[-2:]=='])'
output[var] = np.fromstring(arr[10:-2], sep=',')
return output
and then call
table = makeTable(phase)
instead of
exec(phase)
table = makeTable(x, y, z)
You could also skip all these assert statements without compromising safety, if the file is corrupted or not formatted as expected the error that will be thrown might just be harder to understand...
For the program I am trying to design, I am checking that certain conditions exist in configuration files. For example, that the line: ThisExists is in the program, or that ThisIsFirst exists in the file followed by ThisAlsoExists somewhere later down in the file.
I looked for an efficient approach which might be used in this situation but couldn't find any.
My current idea is basically to just iterate over the file(s) multiple times each time I want to check a condition. So I would have functions:
def checkA(file)
def checkB(file)
.
.
.
To me this seems inefficient as I have to iterate for every condition I want to check.
Initially I thought I could just iterate once, checking each line for every condition I want to verify. But I don't think I can do that as conditions which can be multi line require information about more than one line at a time.
Is the way I outlined the only way to do this, or is there a more efficient approach?
I am trying to provide an example below.
def main():
file = open(filename)
result1 = checkA(file)
result2 = checkB(file)
"""This is a single line check function"""
def checkA(file)
conditionExists = False
for line in file:
if(line == "SomeCondition"):
conditionExists = True
return conditionExists
"""This is a multi line check function"""
def checkB(file)
conditionExists = False
conditionStarted = False
for line in file:
if(line == "Start"):
conditionStarted = True
elif(line == "End" and conditionStarted):
conditionExists = True
return conditionExists
For a software engineering perspective, your current approach has the some nice advantages. The logic of the functions are fully decoupled from one another and can be separately debugged and tested. The complexity of each individual function is low. And this approach allows you to easily incorporate various checks that do not have a parallel structure.
If available libraries (configparser etc.) aren't enough I would probably use regular expressions.
import re
check_a = re.compile('^SomeCondition$', flags=re.MULTILINE)
check_b = re.compile('^Start(?:.|\n)*?End$', flags=re.MULTILINE)
def main(file_name):
with open(file_name, 'r') as file_object:
file_content = file_object.read()
result_1 = bool(check_a.search(file_content))
result_2 = bool(check_b.search(file_content))
It's not the most user friendly approach – especially if the matching conditions are complex – but I think the pay-off for learning regex is great.
xkcd tells us that regex both can be a super power and a problem.
So I am working in a project in which I have to read a large database (for me it is large) of 10 million records. I cannot really filter them, because I have to treat them all and individually. For each record I must apply a formula and then write this result into multiple files depending on certain conditions of the record.
I have implemented a few algorithms and finishing the whole processing takes around 2- 3 days. This is a problem because I am trying to optimise a process that already takes this time. 1 day is acceptable.
So far I have tried indexes on the database, threading(of the process upon the record and not I/O operations). I can not get a shorter time.
I am using django, and i fail to measure how much it really takes to really start treating the data due to its lazy behaviour. I would also like to know if i can start treating the data as soon as i receive it and not having to wait for all the data to be loaded unto memory before i can actually process it. It could also be my understanding of writing operations upon python. Lastly it could be that I need a better machine (I doubt it, I have 4 cores and 4GB RAM, it should be able to give better speeds)
Any ideas? I really appreciate the feedback. :)
Edit: Code
Explanation:
The records i talked about are ids of customers(passports), and the conditions are if there are agreements between the different terminals of the company(countries). The process is a hashing.
First strategy tries to treat the whole database... We have at the beginning some preparation for treating the condition part of the algorithm (agreements between countries). Then a large verification by belonging or not in a set.
Since i've been trying to improve it on my own, i tried to cut the problem in parts for the second strategy, treating the query by parts (obtaining the records that belong to a country and writing in the files of those that have an agreement with them)
The threaded strategy is not depicted for it was designed for a single country and i got awful results compared with no threaded. I honestly have the intuition it has to be a thing of memory and sql.
def create_all_files(strategy=0):
if strategy == 0:
set_countries_agreements = set()
file_countries = open(os.path.join(PROJECT_ROOT, 'list_countries'))
set_countries_temp = set(line.strip() for line in file_countries)
file_countries.close()
set_countries = sorted_nicely(set_countries_temp)
for each_country in set_countries:
set_agreements = frozenset(get_agreements(each_country))
set_countries_agreements.add(set_agreements)
print("All agreements obtained")
set_passports = Passport.objects.all()
print("All passports obtained")
for each_passport in set_passports:
for each_agreement in set_countries_agreements:
for each_country in each_agreement:
if each_passport.nationality == each_country:
with open(os.path.join(PROJECT_ROOT, 'generated_indexes/%s' % iter(each_agreement).next()), "a") as f:
f.write(generate_hash(each_passport.nationality + "<" + each_passport.id_passport, each_country) + "\n")
print(".")
print("_")
print("-")
print("~")
if strategy == 1:
file_countries = open(os.path.join(PROJECT_ROOT, 'list_countries'))
set_countries_temp = set(line.strip() for line in file_countries)
file_countries.close()
set_countries = sorted_nicely(set_countries_temp)
while len(set_countries)!= 0:
country = set_countries.pop()
list_countries = get_agreements(country)
list_passports = Passport.objects.filter(nationality=country)
for each_passport in list_passports:
for each_country in list_countries:
with open(os.path.join(PROJECT_ROOT, 'generated_indexes/%s' % each_country), "a") as f:
f.write(generate_hash(each_passport.nationality + "<" + each_passport.id_passport, each_country) + "\n")
print("r")
print("c")
print("p")
print("P")
In your question, you are describing an ETL process. I suggest you to use an ETL tool.
To mention some python ETL tool I can talk about Pygrametl, wrote by Christian Thomsen, in my opinion it runs nicely and its performance is impressive. Test it and comeback with results.
I can't post this answer without mention MapReduce. This programming model can catch with your requirements if you are planing to distribute task through nodes.
It looks like you have a file for each country that you append hashes to, instead of opening and closing handles to these files 10 million+ times you should open each one once and close them all at the end.
countries = {} # country -> file
with open(os.path.join(PROJECT_ROOT, 'list_countries')) as country_file:
for line in country_file:
country = line.strip()
countries[country] = open(os.path.join(PROJECT_ROOT, 'generated_indexes/%s' % country), "a")
for country in countries:
agreements = get_agreements(country)
for postcode in Postcode.objects.filter(nationality=country):
for agreement in agreements:
countries[agreement].write(generate_hash(passport.nationality + "<" + passport.id_passport, country_agreement) + "\n")
for country, file in countries.items():
file.close()
I don't how big a list of Postcode objects Postcode.objects.filter(nationality=country) will return, if it is massive and memory is an issue, you will have to start thinking about chunking/paginating the query using limits
You are using sets for your list of countries and their agreements, if that is because your file containing the list of countries is not guaranteed to be unique, the dictionary solution may error when you attempt to open another handle to the same file. This can be avoided by added a simple check to see if the country is already a member of countries
I have to parse a large log file (2GB) using reg ex in python. In the log file regular expression matches line which I am interested in. Log file can also have unwanted data.
Here is a sample from the file:
"#DEBUG:: BFM [L4] 5.4401e+08ps MSG DIR:TX SCB_CB TYPE:DATA_REQ CPortID:'h8 SIZE:'d20 NumSeg:'h0001 Msg_Id:'h00000000"
My regular expression is ".DEBUG.*MSG."
First I will split it using the white spaces then the "field:value" patterns are inserted into the sqlite3 database; but for large files it takes around 10 to 15 minutes to parse the file.
Please suggest the best way to do the above task in minimal time.
As others have said, profile your code to see why it is slow. The cProfile module in conjunction with the gprof2dot tool can produce nice readable information
Without seeing your slow code, I can guess a few things that might help:
First is you can probably get away with using the builtin string methods instead of a regex - this might be marginally quicker. If you need to use regex's, it's worthwhile precompiling outside the main loop using re.compile
Second is to not do one insert query per line, instead do the insertions in batches, e.g add the parsed info to a list, then when it reaches a certain size, perform one INSERT query with executemany method.
Some incomplete code, as an example of the above:
import fileinput
parsed_info = []
for linenum, line in enumerate(fileinput.input()):
if not line.startswith("#DEBUG"):
continue # Skip line
msg = line.partition("MSG")[1] # Get everything after MSG
words = msg.split() # Split on words
info = {}
for w in words:
k, _, v = w.partition(":") # Split each word on first :
info[k] = v
parsed_info.append(info)
if linenum % 10000 == 0: # Or maybe if len(parsed_info) > 500:
# Insert everything in parsed_info to database
...
parsed_info = [] # Clear
Paul's answer makes sense, you need to understand where you "lose" time first.
Easiest way if you don't have a profiler is to post a timestamp in milliseconds before and after each "step" of your algorithm (opening the file, reading it line by line (and inside, time taken for the split / regexp to recognise the debug lines), inserting it in the DB, etc...).
Without further knowledge of your code, there are possible "traps" that would be very time consuming :
- opening the log file several times
- opening the DB every time you need to insert data inside instead of opening one connection and then write as you go
"The best way to do the above task in minimal time" is to first figure out where the time is going. Look into how to profile your Python script to find what parts are slow. You may have an inefficient regex. Writing to sqlite may be the problem. But there are no magic bullets - in general, processing 2GB of text line by line, with a regex, in Python, is probably going to run in minutes, not seconds.
Here is a test script that will show how long it takes to read a file, line by line, and do nothing else:
from datetime import datetime
start = datetime.now()
for line in open("big_honkin_file.dat"):
pass
end = datetime.now()
print (end-start)
I want to do hierarchical key-value storage in Python, which basically boils down to storing dictionaries to files. By that I mean any type of dictionary structure, that may contain other dictionaries, numpy arrays, serializable Python objects, and so forth. Not only that, I want it to store numpy arrays space-optimized and play nice between Python 2 and 3.
Below are methods I know are out there. My question is what is missing from this list and is there an alternative that dodges all my deal-breakers?
Python's pickle module (deal-breaker: inflates the size of numpy arrays a lot)
Numpy's save/savez/load (deal-breaker: Incompatible format across Python 2/3)
PyTables replacement for numpy.savez (deal-breaker: only handles numpy arrays)
Using PyTables manually (deal-breaker: I want this for constantly changing research code, so it's really convenient to be able to dump dictionaries to files by calling a single function)
The PyTables replacement of numpy.savez is promising, since I like the idea of using hdf5 and it compresses the numpy arrays really efficiently, which is a big plus. However, it does not take any type of dictionary structure.
Lately, what I've been doing is to use something similar to the PyTables replacement, but enhancing it to be able to store any type of entries. This actually works pretty well, but I find myself storing primitive data types in length-1 CArrays, which is a bit awkward (and ambiguous to actual length-1 arrays), even though I set chunksize to 1 so it doesn't take up that much space.
Is there something like that already out there?
Thanks!
After asking this two years ago, I starting coding my own HDF5-based replacement of pickle/np.save. Ever since, it has matured into a stable package, so I thought I would finally answer and accept my own question because it is by design exactly what I was looking for:
https://github.com/uchicago-cs/deepdish
I recently found myself with a similar problem, for which I wrote a couple of functions for saving the contents of dicts to a group in a PyTables file, and loading them back into dicts.
They process nested dictionary and group structures recursively, and handle objects with types that are not natively supported by PyTables by pickling them and storing them as string arrays. It's not perfect, but at least things like numpy arrays will be stored efficiently. There's also a check included to avoid inadvertently loading enormous structures into memory when reading the group contents back into a dict.
import tables
import cPickle
def dict2group(f, parent, groupname, dictin, force=False, recursive=True):
"""
Take a dict, shove it into a PyTables HDF5 file as a group. Each item in
the dict must have a type and shape compatible with PyTables Array.
If 'force == True', any existing child group of the parent node with the
same name as the new group will be overwritten.
If 'recursive == True' (default), new groups will be created recursively
for any items in the dict that are also dicts.
"""
try:
g = f.create_group(parent, groupname)
except tables.NodeError as ne:
if force:
pathstr = parent._v_pathname + '/' + groupname
f.removeNode(pathstr, recursive=True)
g = f.create_group(parent, groupname)
else:
raise ne
for key, item in dictin.iteritems():
if isinstance(item, dict):
if recursive:
dict2group(f, g, key, item, recursive=True)
else:
if item is None:
item = '_None'
f.create_array(g, key, item)
return g
def group2dict(f, g, recursive=True, warn=True, warn_if_bigger_than_nbytes=100E6):
"""
Traverse a group, pull the contents of its children and return them as
a Python dictionary, with the node names as the dictionary keys.
If 'recursive == True' (default), we will recursively traverse child
groups and put their children into sub-dictionaries, otherwise sub-
groups will be skipped.
Since this might potentially result in huge arrays being loaded into
system memory, the 'warn' option will prompt the user to confirm before
loading any individual array that is bigger than some threshold (default
is 100MB)
"""
def memtest(child, threshold=warn_if_bigger_than_nbytes):
mem = child.size_in_memory
if mem > threshold:
print '[!] "%s" is %iMB in size [!]' % (child._v_pathname, mem / 1E6)
confirm = raw_input('Load it anyway? [y/N] >>')
if confirm.lower() == 'y':
return True
else:
print "Skipping item \"%s\"..." % g._v_pathname
else:
return True
outdict = {}
for child in g:
try:
if isinstance(child, tables.group.Group):
if recursive:
item = group2dict(f, child)
else:
continue
else:
if memtest(child):
item = child.read()
if isinstance(item, str):
if item == '_None':
item = None
else:
continue
outdict.update({child._v_name: item})
except tables.NoSuchNodeError:
warnings.warn('No such node: "%s", skipping...' % repr(child))
pass
return outdict
It's also worth mentioning joblib.dump and joblib.load, which tick all of your boxes apart from Python 2/3 cross-compatibility. Under the hood they use np.save for numpy arrays and cPickle for everything else.
I tried playing with np.memmap for saving an array of dictionaries. Say we have the dictionary:
a = np.array([str({'a':1, 'b':2, 'c':[1,2,3,{'d':4}]}])
first I tried to directly save it to a memmap:
f = np.memmap('stack.array', dtype=dict, mode='w+', shape=(100,))
f[0] = d
# CRASHES when reopening since it looses the memory pointer
f = np.memmap('stack.array', dtype=object, mode='w+', shape=(100,))
f[0] = d
# CRASHES when reopening for the same reason
the way it worked is converting the dictionary to a string:
f = np.memmap('stack.array', dtype='|S1000', mode='w+', shape=(100,))
f[0] = str(a)
this works and afterwards you can eval(f[0]) to get the value back.
I do not know the advantage of this approach over the others, but it deserves a closer look.
I absolutely recommend a python object database like ZODB. It seems pretty well suited for your situation, considering you store objects (literally whatever you like) to a dictionary - this means you can store dictionaries inside dictionaries. I've used it in a wide range of problems, and the nice thing is that you can just hand somebody the database file (the one with a .fs extension). With this, they'll be able to read it in, and perform any queries they wish, and modify their own local copies. If you wish to have multiple programs simultaneously accessing the same database, I'd make sure to look at ZEO.
Just a silly example of how to get started:
from ZODB import DB
from ZODB.FileStorage import FileStorage
from ZODB.PersistentMapping import PersistentMapping
import transaction
from persistent import Persistent
from persistent.dict import PersistentDict
from persistent.list import PersistentList
# Defining database type and creating connection.
storage = FileStorage('/path/to/database/zodbname.fs')
db = DB(storage)
connection = db.open()
root = connection.root()
# Define and populate the structure.
root['Vehicle'] = PersistentDict() # Upper-most dictionary
root['Vehicle']['Tesla Model S'] = PersistentDict() # Object 1 - also a dictionary
root['Vehicle']['Tesla Model S']['range'] = "208 miles"
root['Vehicle']['Tesla Model S']['acceleration'] = 5.9
root['Vehicle']['Tesla Model S']['base_price'] = "$71,070"
root['Vehicle']['Tesla Model S']['battery_options'] = ["60kWh","85kWh","85kWh Performance"]
# more attributes here
root['Vehicle']['Mercedes-Benz SLS AMG E-Cell'] = PersistentDict() # Object 2 - also a dictionary
# more attributes here
# add as many objects with as many characteristics as you like.
# commiting changes; up until this point things can be rolled back
transaction.get().commit()
transaction.get().abort()
connection.close()
db.close()
storage.close()
Once the database is created it's very easy use. Since it's an object database (a dictionary), you can access objects very easily:
#after it's opened (lines from the very beginning, up to and including root = connection.root() )
>> root['Vehicles']['Tesla Model S']['range']
'208 miles'
You can also display all of the keys (and do all other standard dictionary things you might want to do):
>> root['Vehicles']['Tesla Model S'].keys()
['acceleration', 'range', 'battery_options', 'base_price']
Last thing I want to mention is that keys can be changed: Changing the key value in python dictionary. Values can also be changed - so if your research results change because you change your method or something you don't have to start the entire database from scratch (especially if everything else is still okay). Be careful with doing both of these. I put in safety measures in my database code to make sure I'm aware of my attempts to overwrite keys or values.
** ADDED **
# added imports
import numpy as np
from tempfile import TemporaryFile
outfile = TemporaryFile()
# insert into definition/population section
np.save(outfile,np.linspace(-1,1,10000))
root['Vehicle']['Tesla Model S']['arraydata'] = outfile
# check to see if it worked
>>> root['Vehicle']['Tesla Model S']['arraydata']
<open file '<fdopen>', mode 'w+b' at 0x2693db0>
outfile.seek(0)# simulate closing and re-opening
A = np.load(root['Vehicle']['Tesla Model S']['arraydata'])
>>> print A
array([-1. , -0.99979998, -0.99959996, ..., 0.99959996,
0.99979998, 1. ])
You could also use numpy.savez() for compressed saving of multiple numpy arrays in this exact same way.
This is not a direct answer. Anyway, you may be interested also in JSON. Have a look at the 13.10. Serializing Datatypes Unsupported by JSON. It shows how to extend the format for unsuported types.
The whole chapter from "Dive into Python 3" by Mark Pilgrim is definitely a good read for at least to know...
Update: Possibly an unrelated idea, but... I have read somewhere, that one of the reasons why XML was finally adopted for data exchange in heterogeneous environment was some study that compared specialized binary format with zipped XML. The conclusion for you could be to use possibly not so space efficient solution and compress it via zip or another well known algorithm. Using the known algorithm helps when you need to debug (to unzip and then look at the text file by eye).