Merge Two TinyDB Databases - python

On Python, I'm trying to merge multiple JSON files obtained from TinyDB.
I was not able to find a way to directly merge two tinydb JSON files that have keys autogenerated in the sequence that not restart with the opening of the next file.
In code words, i want to merge large amount of data like this:
hello1={"1":"bye",2:"good"....,"20000":"goodbye"}
hello2={"1":"dog",2:"cat"....,"15000":"monkey"}
As:
Hello3= {"1":"bye",2:"good"....,"20000":"goodbye","20001":"dog",20002:"cat"....,"35000":"monkey"}
Because of the problem to find the correct way to do it with TinyDB, I opened and transformed them simply in classic syntax json file, loading each file and then doing:
Data = Data['_default']
The problem that I have, is that at the moment the code works, but it has serious memory problems. After a few seconds, the created merged Db contains like 28Mb of data, but (probably) the cache saturate, and it starts to add all the other data in a really slow way.
So, I need to empty the cache after a certain amount of data, or probably i need to change the way to do this!
That's the code that i use:
Try1.purge()
Try1 = TinyDB('FullDB.json')
with open('FirstDataBase.json') as Part1 :
Datapart1 = json.load(Part1)
Datapart1 = Datapart1['_default']
for dets in range(1, len(Datapart1)):
Try1.insert(Datapart1[str(dets)])
with open('SecondDatabase.json') as Part2:
Datapart2 = json.load(Part2)
Datapart2 = Datapart2['_default']
for dets in range(1, len(Datapart2)):
Try1.insert(Datapart2[str(dets)])

Question: Merge Two TinyDB Databases ... probably i need to change the way to do this!
From TinyDB Documentation
Why Not Use TinyDB?
...
You are really concerned about performance and need a high speed database.
Single row insertion into a DB are always slow, try db.insert_multiple(....
The second one. with generator. gives you the option to hold down the memory footprint.
# From list
Try1.insert_multiple([{"1":"bye",2:"good"....,"20000":"goodbye"}])
or
# From generator function
Try1.insert_multiple(generator())

Related

How do I write unknown keys to a CSV in large data sets?

I'm currently working on a script that will query data from a REST API, and write the resulting values to a CSV. The data set could potentially contain hundreds of thousands of records, but it returns the data in sets of 100 entries. My goal is to include every key from each entry in the CSV.
What I have so far (this is a simplified structure for the purposes of this question):
import csv
resp = client.get_list()
while resp.token:
my_data = resp.data
process_data(my_data)
resp = client.get_list(resp.token)
def process_data(my_data):
#This section should write my_data to a CSV file
#I know I can use csv.dictwriter as part of the solution
#Notice that in this example, "fieldnames" is undefined
#Defining it is part of the question
with open('output.csv', 'a') as output_file:
writer = csv.DictWriter(output_file, fieldnames = fieldnames)
for element in my_data:
writer.writerow(element)
The problem: Each entry doesn't necessarily have the same keys. A later entry missing a key isn't that big of a deal. My problem is, for example, entry 364 introducing an entirely new key.
Options that I've considered:
Whenever I encounter a new key, read in the output CSV, append the new key to the header, and append a comma to each previous line. This leads to a TON of file I/O, which I'm hoping to avoid.
Rather than writing to a CSV, write the raw JSON to a file. Meanwhile, build up a list of all known keys as I iterate over the data. Once I've finished querying the API, iterate over the JSON files that I wrote, and write the CSV using the list that I built. This leads to 2 total iterations over the data, and feels unnecessarily complex.
Hard code the list of potential keys beforehand. This approach is impossible, for a number of reasons.
None of these solutions feel particularly elegant to me, which leads me to my question. Is there a better way for me to approach this problem? Am I overlooking something obvious?
Options 1 and 2 both seem reasonable.
Does the CSV need to valid and readable while you're creating it? If not you could do the append of missing columns in one pass after you've finished reading from the API (which would be like a combination of the two approaches). If you do this you'll probably have to use the regular csv.writer in the first pass rather than csv.DictWriter, since your columns definition will grow while you're writing.
One thing to bear in mind - if the overall file is expected to be large (eg won't fit into memory), then your solution probably needs to use a streaming approach, which is easy with CSV but fiddly with JSON. You might also want to look into to alternative formats to JSON for the intermediate data (eg XML, BSON etc).

Combining JSON values based on key

I'm working on Python 2.6.6 and I'm struggling with one issue.
I have a large JSON file with the following structure:
{"id":"12345","ua":[{"n":"GROUP_A","v":["true"]}]}
{"id":"12345","ua":[{"n":"GROUP_B","v":["true"]}]}
{"id":"54321","ua":[{"n":"GROUP_C","v":["true"]}]}
{"id":"54321","ua":[{"n":"GROUP_D","v":["true"]}]}
{"id":"54321","ua":[{"n":"GROUP_E","v":["true"]}]}
{"id":"98765","ua":[{"n":"GROUP_F","v":["true"]}]}
And I need to merge the id's so they'll contain all the GROUPS as so:
{"id":"12345","ua":[{"n":"GROUP_A","v":["true"]},{"n":"GROUP_B","v":["true"]}]}
{"id":"54321","ua":[{"n":"GROUP_C","v":["true"]},{"n":"GROUP_D","v":["true"]},{"n":"GROUP_E","v":["true"]}]}
{"id":"98765","ua":[{"n":"GROUP_F","v":["true"]}]}
I tried using the 'json' library, but I couldn't append the values correctly. Also I tried to convert it all to a dictionary and append the values (GROUPS) to the key (id) as lists, but I got stuck on printing it all as I need to the output file.
I can do it using bash but it takes too long to parse all the information and rearrange it in the needed format.
Any help is appreciated!
Thanks.
First, let's get the JSON stuff out of the way.
Your file is not a JSON structure, it's a bunch of separate JSON objects. From your sample, it looks like it's one object per line. So, let's read this in to a list:
with open('spam.json') as f:
things = [json.loads(line) for line in f]
Then we'll process this, and write it out:
with open('eggs.json', 'w') as f:
for thing in new_things:
f.write(json.dumps(thing) + '\n')
Now, you don't have a JSON structure that you want to append things to; you have a list of dicts, and you want to create a new list of dicts, merging together the ones with the same key.
Here's one way to do it:
new_things = {}
for thing in things:
thing_id = thing['id']
try:
old_thing = new_things[thing_id]
except KeyError:
new_things[thing_id] = thing
else:
old_thing['ua'].extend(thing['ua'])
new_things = new_things.values()
There are a few different ways you could simplify this; I just wrote it this way because it uses no tricks that should be beyond a novice. For example, you could do by it sorting and grouping:
def merge(things):
return {'id': things[0]['id'],
'ua': list(itertools.chain.from_iterable(t['ua'] for t in things))}
sorted_things = sorted(things, key=operator.itemgetter('id'))
grouped_things = itertools.groupby(sorted_things, key=operator.itemgetter('id'))
new_things = [merge(list(group)) for key, group in grouped_things]
I didn't realize from you original question that you had tens of millions of rows. All of the above steps require loading the entire original data set into memory, processing with some temporary storage, then writing it back out. But if your data set is too large, you need to find a way to process one row at a time, and keep as little in memory simultaneously as possibly.
First, to process one row at a time, you just need to change the initial list comprehension to a generator expression, and move the rest of the code inside the with statement, like this:
with open('spam.json') as f:
things = (json.loads(line) for line in f)
for thing in things:
# blah blah
… at which point it might be just as easy to rewrite it like this:
with open('spam.json') as f:
for line in f:
thing = json.loads(line)
# blah blah
Next, sorting obviously builds the whole sorted list in memory, so that's not acceptable here. But if you don't sort and group, the entire new_things result object has to be alive at the same time (because the very last input row could have to be merged into the very first output row).
Your sample data seems to already have the rows sorted by id. If you can count on that in real life—or just count on the rows always being grouped by id—just skip the sorting step, which isn't doing anything but wasting time and memory, and use a grouping solution.
On the other hand, if you can't count on the rows being grouped by id, there are only really two ways to reduce memory further: compress the data in some way, or back the storage to disk.
For the first, Foo Bar User's solution built a simpler and smaller data structure (a dict mapping each id to its list of uas, instead of a list of dicts, each with an id and a ua), which should take less memory, and which we could convert to the final format one row at a time. Like this:
with open('spam.json') as f:
new_dict = defaultdict(list)
for row in f:
thing = json.loads(row)
new_dict[thing["id"]].extend(thing["ua"])
with open('eggs.json', 'w') as f:
for id, ua in new_dict.items(): # use iteritems in Python 2.x
thing = {'id': id, 'ua': ua}
f.write(json.dumps(thing) + '\n')
For the second, Python comes with a nice way to use a dbm database as if it were a dictionary. If your values are just strings, you can use the anydbm/dbm module (or one of the specific implementations). Since your values are lists, you'll need to use shelve instead.
Anyway, while this will reduce your memory usage, it could slow things down. On a machine with 4GB of RAM, the savings in pagefile swapping will probably blow away the extra cost of going through the database… but on a machine with 16GB of RAM, you may just be adding overhead for very little gain. You may want to experiment with smaller files first, to see how much slower your code is with shelve vs. dict when memory isn't an issue.
Alternatively, if things get way beyond the limits of your memory, you can always use a more powerful database that actually can sort things on disk. For example (untested):
db = sqlite3.connect('temp.sqlite')
c = db.cursor()
c.execute('CREATE TABLE Things (tid, ua)')
for thing in things:
for ua in thing['ua']:
c.execute('INSERT INTO Things (tid, ua) VALUES (?, ?)',
thing['id'], ua)
c.commit()
c.execute('SELECT tid, ua FROM Things ORDER BY tid')
rows = iter(c.fetchone, None)
grouped_things = itertools.groupby(rows, key=operator.itemgetter(0))
new_things = (merge(list(group)) for key, group in grouped_things)
with open('eggs.json', 'w') as f:
for thing in new_things:
f.write(json.dumps(thing) + '\n')

write table cell real-time python

I would like to loop trough a database, find the appropriate values and insert them in the appropriate cell in a separate file. It maybe a csv, or any other human-readable format.
In pseudo-code:
for item in huge_db:
for list_of_objects_to_match:
if itemmatch():
if there_arent_three_matches_yet_in_list():
matches++
result=performoperationonitem()
write_in_file(result, row=object_to_match_id, col=matches)
if matches is 3:
remove_this_object_from_object_to_match_list()
can you think of any way other than going every time through all the outputfile line by line?
I don't even know what to search for...
even better, there are better ways to find three matching objects in a db and have the results in real-time? (the operation will take a while, but I'd like to see the results popping out RT)
Assuming itemmatch() is a reasonably simple function, this will do what I think you want better than your pseudocode:
for match_obj in list_of_objects_to_match:
db_objects = query_db_for_matches(match_obj)
if len(db_objects) >= 3:
result=performoperationonitem()
write_in_file(result, row=match_obj.id, col=matches)
else:
write_blank_line(row=match_obj.id) # if you want
Then the trick becomes writing the query_db_for_matches() function. Without detail, I'll assume you're looking for objects that match in one particular field, call it type. In pymongo such a query would look like:
def query_db_for_matches(match_obj):
return pymongo_collection.find({"type":match_obj.type})
To get this to run efficiently, make sure your database has an index on the field(s) you're querying on by first calling:
pymongo_collection.ensure_index({"type":1})
The first time you call ensure_index it could take a long time for a huge collection. But each time after that it will be fast -- fast enough that you could even put it into query_db_for_matches before your find and it would be fine.

Merging two datasets in Python efficiently

What would anyone consider the most efficient way to merge two datasets using Python?
A little background - this code will take 100K+ records in the following format:
{user: aUser, transaction: UsersTransactionNumber}, ...
and using the following data
{transaction: aTransactionNumber, activationNumber: assoiciatedActivationNumber}, ...
to create
{user: aUser, activationNumber: assoiciatedActivationNumber}, ...
N.B These are not Python dictionaries, just the closest thing to portraying record format cleanly.
So in theory, all I am trying to do is create a view of two lists (or tables) joining on a common key - at first this points me towards sets (unions etc), but before I start learning these in depth, are they the way to go? So far I felt this could be implemented as:
Create a list of dictionaries and iterate over the list comparing the key each time, however, worst case scenario this could run up to len(inputDict)*len(outputDict) <- Not sure?
Manipulate the data as an in-memory SQLite Table? Peferrably not as although there is no strict requirement for Python 2.4, it would make life easier.
Some kind of Set based magic?
Clarification
The whole purpose of this script is to summarise, the actual data sets are coming from two different sources. The user and transaction numbers are coming in the form of a CSV as an output from a performance test that is testing email activation code throughput. The second dataset comes from parsing the test mailboxes, which contain the transaction id and activation code. The output of this test is then a CSV that will get pumped back into stage 2 of the performance test, activating user accounts using the activation codes that were paired up.
Apologies if my notation for the records was misleading, I have updated them accordingly.
Thanks for the replies, I am going to give two ideas a try:
Sorting the lists first (I don't know
how expensive this is)
Creating a
dictionary with the transactionCodes
as the key then store the user and
activation code in a list as the
value
Performance isn't overly paramount for me, I just want to try and get into good habits with my Python Programming.
Here's a radical approach.
Don't.
You have two CSV files; one (users) is clearly the driver. Leave this alone.
The other -- transaction codes for a user -- can be turned into a simple dictionary.
Don't "combine" or "join" anything except when absolutely necessary. Certainly don't "merge" or "pre-join".
Write your application do simply do simple lookups in the other collection.
Create a list of dictionaries and iterate over the list comparing the key each time,
Close. It looks like this. Note: No Sort.
import csv
with open('activations.csv','rb') as act_data:
rdr= csv.DictReader( act_data)
activations = dict( (row['user'],row) for row in rdr )
with open('users.csv','rb') as user_data:
rdr= csv.DictReader( user_data )
with open( 'users_2.csv','wb') as updated_data:
wtr= csv.DictWriter( updated_data, ['some','list','of','columns'])
for user in rdr:
user['some_field']= activations[user['user_id_column']]['some_field']
wtr.writerow( user )
This is fast and simple. Save the dictionaries (use shelve or pickle).
however, worst case scenario this could run up to len(inputDict)*len(outputDict) <- Not sure?
False.
One list is the "driving" list. The other is the lookup list. You'll drive by iterating through users and lookup appropriate values for transaction. This is O( n ) on the list of users. The lookup is O( 1 ) because dictionaries are hashes.
Sort the two data sets by transaction number. That way, you always only need to keep one row of each in memory.
This looks like a typical use for dictionaries with transaction number as key. But you don't have to create the common structure, just build the lookup dictionnaries and use them as needed.
I'd create a map myTransactionNumber -> {transaction: myTransactionNumber, activationNumber: myActivationNumber} and then iterate on {user: myUser, transaction: myTransactionNumber} entries and search in the map for needed myTransactionNumber. The complexity of a search should be O(log N) where N is amount of the entries in the set. So the overal complexity would be O(M*log N) where M is amount of user entries.

Storing huge hash table in a file in Python

Hey. I have a function I want to memoize, however, it has too many possible values. Is there any convenient way to store the values in a text file and make it read from them? For example, something like storing a pre-computed list of primes up to 10^9 in a text file? I know it's slow to read from a text file but there's no other option if the amount of data is really huge. Thanks!
For a list of primes up to 10**9, why do you need a hash? What would the KEYS be?! Sounds like a perfect opportunity for a simple, straightforward binary file! By the Prime Number Theorem, there's about 10**9/ln(10**9) such primes -- i.e. 50 millions or a bit less. At 4 bytes per prime, that's only 200 MB or less -- perfect for an array.array("L") and its methods such as fromfile, etc (see the docs). In many cases you could actually suck all of the 200 MB into memory, but, worst case, you can get a slice of those (e.g. via mmap and the fromstring method of array.array), do binary searches there (e.g. via bisect), etc, etc.
When you DO need a huge key-values store -- gigabytes, not a paltry 200 MB!-) -- I used to recommend shelve but after unpleasant real-life experience with huge shelves (performance, reliability, etc), I currently recommend a database engine instead -- sqlite is good and comes with Python, PostgreSQL is even better, non-relational ones such as CouchDB can be better still, and so forth.
You can use the shelve module to store a dictionary like structure in a file. From the Python documentation:
import shelve
d = shelve.open(filename) # open -- file may get suffix added by low-level
# library
d[key] = data # store data at key (overwrites old data if
# using an existing key)
data = d[key] # retrieve a COPY of data at key (raise KeyError
# if no such key)
del d[key] # delete data stored at key (raises KeyError
# if no such key)
flag = key in d # true if the key exists
klist = list(d.keys()) # a list of all existing keys (slow!)
# as d was opened WITHOUT writeback=True, beware:
d['xx'] = [0, 1, 2] # this works as expected, but...
d['xx'].append(3) # *this doesn't!* -- d['xx'] is STILL [0, 1, 2]!
# having opened d without writeback=True, you need to code carefully:
temp = d['xx'] # extracts the copy
temp.append(5) # mutates the copy
d['xx'] = temp # stores the copy right back, to persist it
# or, d=shelve.open(filename,writeback=True) would let you just code
# d['xx'].append(5) and have it work as expected, BUT it would also
# consume more memory and make the d.close() operation slower.
d.close() # close it
You could also just go with the ultimate brute force, and create a Python file with just a single statement in it:
seedprimes = [3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,
79,83,89,97,101,103,107,109,113,127,131,137,139,149,151,157,163,167,173, ...
and then just import it. (Here is file with the primes up to 1e5: http://python.pastebin.com/f177ec30).
from primes_up_to_1e9 import seedprimes
For Project Euler, I stored a precomputed list of primes up to 10**8 in a text file just by writing them in comma separated format. It worked well for that size, but it doesn't scale well to going much larger.
If your huge is not really that huge, I would use something simple like me, otherwise I would go with shelve as the others have said.
Just naively sticking a hash table onto disk will result in about 5 orders of magnitude performance loss compared to an in memory implementation (or at least 3 if you have a SSD). When dealing with hard disks you'll want to extract every bit of data-locality and caching you can get.
The correct choice will depend on details of your use case. How much performance do you need? What kind of operations do you need on data-structure? Do you need to only check if the table contains a key or do you need to fetch a value based on the key? Can you precompute the table or do you need to be able to modify it on the fly? What kind of hit rate are you expecting? Can you filter out a significant amount of the operations using a bloom filter? Are the requests uniformly distributed or do you expect some kind of temporal locality? Can you predict the locality clusters ahead of time?
If you don't need ultimate performance or can parallelize and throw hardware at the problem check out some distributed key-value stores.
You can also go one step down the ladder and use pickle. Shelve imports from pickle (link), so if you don't need the added functionality of shelve, this may spare you some clock cycles (although, they really don't matter to you, as you have choosen python to do large number storing)
Let's see where the bottleneck is. When you're going to read a file, the hard drive has to turn enough to be able to read from it; then it reads a big block and caches the results.
So you want some method that will guess exactly what position in file you're going to read from and then do it exactly once. I'm pretty much sure standard DB modules will work for you, but you can do it yourself -- just open the file in binary mode for reading/writing and store your values as, say, 30-digits (=100-bit = 13-byte) numbers.
Then use standard file methods .

Categories