Combining JSON values based on key - python

I'm working on Python 2.6.6 and I'm struggling with one issue.
I have a large JSON file with the following structure:
{"id":"12345","ua":[{"n":"GROUP_A","v":["true"]}]}
{"id":"12345","ua":[{"n":"GROUP_B","v":["true"]}]}
{"id":"54321","ua":[{"n":"GROUP_C","v":["true"]}]}
{"id":"54321","ua":[{"n":"GROUP_D","v":["true"]}]}
{"id":"54321","ua":[{"n":"GROUP_E","v":["true"]}]}
{"id":"98765","ua":[{"n":"GROUP_F","v":["true"]}]}
And I need to merge the id's so they'll contain all the GROUPS as so:
{"id":"12345","ua":[{"n":"GROUP_A","v":["true"]},{"n":"GROUP_B","v":["true"]}]}
{"id":"54321","ua":[{"n":"GROUP_C","v":["true"]},{"n":"GROUP_D","v":["true"]},{"n":"GROUP_E","v":["true"]}]}
{"id":"98765","ua":[{"n":"GROUP_F","v":["true"]}]}
I tried using the 'json' library, but I couldn't append the values correctly. Also I tried to convert it all to a dictionary and append the values (GROUPS) to the key (id) as lists, but I got stuck on printing it all as I need to the output file.
I can do it using bash but it takes too long to parse all the information and rearrange it in the needed format.
Any help is appreciated!
Thanks.

First, let's get the JSON stuff out of the way.
Your file is not a JSON structure, it's a bunch of separate JSON objects. From your sample, it looks like it's one object per line. So, let's read this in to a list:
with open('spam.json') as f:
things = [json.loads(line) for line in f]
Then we'll process this, and write it out:
with open('eggs.json', 'w') as f:
for thing in new_things:
f.write(json.dumps(thing) + '\n')
Now, you don't have a JSON structure that you want to append things to; you have a list of dicts, and you want to create a new list of dicts, merging together the ones with the same key.
Here's one way to do it:
new_things = {}
for thing in things:
thing_id = thing['id']
try:
old_thing = new_things[thing_id]
except KeyError:
new_things[thing_id] = thing
else:
old_thing['ua'].extend(thing['ua'])
new_things = new_things.values()
There are a few different ways you could simplify this; I just wrote it this way because it uses no tricks that should be beyond a novice. For example, you could do by it sorting and grouping:
def merge(things):
return {'id': things[0]['id'],
'ua': list(itertools.chain.from_iterable(t['ua'] for t in things))}
sorted_things = sorted(things, key=operator.itemgetter('id'))
grouped_things = itertools.groupby(sorted_things, key=operator.itemgetter('id'))
new_things = [merge(list(group)) for key, group in grouped_things]
I didn't realize from you original question that you had tens of millions of rows. All of the above steps require loading the entire original data set into memory, processing with some temporary storage, then writing it back out. But if your data set is too large, you need to find a way to process one row at a time, and keep as little in memory simultaneously as possibly.
First, to process one row at a time, you just need to change the initial list comprehension to a generator expression, and move the rest of the code inside the with statement, like this:
with open('spam.json') as f:
things = (json.loads(line) for line in f)
for thing in things:
# blah blah
… at which point it might be just as easy to rewrite it like this:
with open('spam.json') as f:
for line in f:
thing = json.loads(line)
# blah blah
Next, sorting obviously builds the whole sorted list in memory, so that's not acceptable here. But if you don't sort and group, the entire new_things result object has to be alive at the same time (because the very last input row could have to be merged into the very first output row).
Your sample data seems to already have the rows sorted by id. If you can count on that in real life—or just count on the rows always being grouped by id—just skip the sorting step, which isn't doing anything but wasting time and memory, and use a grouping solution.
On the other hand, if you can't count on the rows being grouped by id, there are only really two ways to reduce memory further: compress the data in some way, or back the storage to disk.
For the first, Foo Bar User's solution built a simpler and smaller data structure (a dict mapping each id to its list of uas, instead of a list of dicts, each with an id and a ua), which should take less memory, and which we could convert to the final format one row at a time. Like this:
with open('spam.json') as f:
new_dict = defaultdict(list)
for row in f:
thing = json.loads(row)
new_dict[thing["id"]].extend(thing["ua"])
with open('eggs.json', 'w') as f:
for id, ua in new_dict.items(): # use iteritems in Python 2.x
thing = {'id': id, 'ua': ua}
f.write(json.dumps(thing) + '\n')
For the second, Python comes with a nice way to use a dbm database as if it were a dictionary. If your values are just strings, you can use the anydbm/dbm module (or one of the specific implementations). Since your values are lists, you'll need to use shelve instead.
Anyway, while this will reduce your memory usage, it could slow things down. On a machine with 4GB of RAM, the savings in pagefile swapping will probably blow away the extra cost of going through the database… but on a machine with 16GB of RAM, you may just be adding overhead for very little gain. You may want to experiment with smaller files first, to see how much slower your code is with shelve vs. dict when memory isn't an issue.
Alternatively, if things get way beyond the limits of your memory, you can always use a more powerful database that actually can sort things on disk. For example (untested):
db = sqlite3.connect('temp.sqlite')
c = db.cursor()
c.execute('CREATE TABLE Things (tid, ua)')
for thing in things:
for ua in thing['ua']:
c.execute('INSERT INTO Things (tid, ua) VALUES (?, ?)',
thing['id'], ua)
c.commit()
c.execute('SELECT tid, ua FROM Things ORDER BY tid')
rows = iter(c.fetchone, None)
grouped_things = itertools.groupby(rows, key=operator.itemgetter(0))
new_things = (merge(list(group)) for key, group in grouped_things)
with open('eggs.json', 'w') as f:
for thing in new_things:
f.write(json.dumps(thing) + '\n')

Related

Merge Two TinyDB Databases

On Python, I'm trying to merge multiple JSON files obtained from TinyDB.
I was not able to find a way to directly merge two tinydb JSON files that have keys autogenerated in the sequence that not restart with the opening of the next file.
In code words, i want to merge large amount of data like this:
hello1={"1":"bye",2:"good"....,"20000":"goodbye"}
hello2={"1":"dog",2:"cat"....,"15000":"monkey"}
As:
Hello3= {"1":"bye",2:"good"....,"20000":"goodbye","20001":"dog",20002:"cat"....,"35000":"monkey"}
Because of the problem to find the correct way to do it with TinyDB, I opened and transformed them simply in classic syntax json file, loading each file and then doing:
Data = Data['_default']
The problem that I have, is that at the moment the code works, but it has serious memory problems. After a few seconds, the created merged Db contains like 28Mb of data, but (probably) the cache saturate, and it starts to add all the other data in a really slow way.
So, I need to empty the cache after a certain amount of data, or probably i need to change the way to do this!
That's the code that i use:
Try1.purge()
Try1 = TinyDB('FullDB.json')
with open('FirstDataBase.json') as Part1 :
Datapart1 = json.load(Part1)
Datapart1 = Datapart1['_default']
for dets in range(1, len(Datapart1)):
Try1.insert(Datapart1[str(dets)])
with open('SecondDatabase.json') as Part2:
Datapart2 = json.load(Part2)
Datapart2 = Datapart2['_default']
for dets in range(1, len(Datapart2)):
Try1.insert(Datapart2[str(dets)])
Question: Merge Two TinyDB Databases ... probably i need to change the way to do this!
From TinyDB Documentation
Why Not Use TinyDB?
...
You are really concerned about performance and need a high speed database.
Single row insertion into a DB are always slow, try db.insert_multiple(....
The second one. with generator. gives you the option to hold down the memory footprint.
# From list
Try1.insert_multiple([{"1":"bye",2:"good"....,"20000":"goodbye"}])
or
# From generator function
Try1.insert_multiple(generator())

How do I write unknown keys to a CSV in large data sets?

I'm currently working on a script that will query data from a REST API, and write the resulting values to a CSV. The data set could potentially contain hundreds of thousands of records, but it returns the data in sets of 100 entries. My goal is to include every key from each entry in the CSV.
What I have so far (this is a simplified structure for the purposes of this question):
import csv
resp = client.get_list()
while resp.token:
my_data = resp.data
process_data(my_data)
resp = client.get_list(resp.token)
def process_data(my_data):
#This section should write my_data to a CSV file
#I know I can use csv.dictwriter as part of the solution
#Notice that in this example, "fieldnames" is undefined
#Defining it is part of the question
with open('output.csv', 'a') as output_file:
writer = csv.DictWriter(output_file, fieldnames = fieldnames)
for element in my_data:
writer.writerow(element)
The problem: Each entry doesn't necessarily have the same keys. A later entry missing a key isn't that big of a deal. My problem is, for example, entry 364 introducing an entirely new key.
Options that I've considered:
Whenever I encounter a new key, read in the output CSV, append the new key to the header, and append a comma to each previous line. This leads to a TON of file I/O, which I'm hoping to avoid.
Rather than writing to a CSV, write the raw JSON to a file. Meanwhile, build up a list of all known keys as I iterate over the data. Once I've finished querying the API, iterate over the JSON files that I wrote, and write the CSV using the list that I built. This leads to 2 total iterations over the data, and feels unnecessarily complex.
Hard code the list of potential keys beforehand. This approach is impossible, for a number of reasons.
None of these solutions feel particularly elegant to me, which leads me to my question. Is there a better way for me to approach this problem? Am I overlooking something obvious?
Options 1 and 2 both seem reasonable.
Does the CSV need to valid and readable while you're creating it? If not you could do the append of missing columns in one pass after you've finished reading from the API (which would be like a combination of the two approaches). If you do this you'll probably have to use the regular csv.writer in the first pass rather than csv.DictWriter, since your columns definition will grow while you're writing.
One thing to bear in mind - if the overall file is expected to be large (eg won't fit into memory), then your solution probably needs to use a streaming approach, which is easy with CSV but fiddly with JSON. You might also want to look into to alternative formats to JSON for the intermediate data (eg XML, BSON etc).

Python: How to restart a FOR loop, which iterates over a csv

I am using Python 3.5 and I wanna load data from a csv into several lists, but it only works exactly one time with a FOR-Loop. Then it loads 0 into it.
Here is the code:
f1 = open("csvfile.csv", encoding="latin-1")
csv_f1 = csv.reader(f1, delimiter=';')
list_f1_vorname = []
for row_f1 in csv_f1:
list_f1_vorname.append(row_f1[2])
list_f1_name = []
for row_f1 in csv_f1: # <-- HERE IS THE ERROR, IT DOESNT WORK A SECOND TIME!
list_f1_name.append(row_f1[3])
Does anybody know how to restart this thing?
Many thanks and best regards,
Saitam
csv_f1 is not an list, it is an iterative.
Either, you cache the csv_f1 into a list by using list() or you just recreate the object.
I would recommend to recreate the object in case your cvs data gets very big. This way, the data is not put into RAM completely.
The simple answer is to iterate over the csv once and store it into a list.
something like
my_list = []
for row in csv_f1:
my_list.append(row)
or what abukaj wrote with
csv_f1 = list(csv.reader(f1, delimiter=';'))
and then move on and iterate over that list as many times as you want.
However if you are only trying to get certain columns then you can simply do that in the same for loop.
list_f1_vorname = []
list_f1_name = []
for row in csv_f1:
list_f1_vorname.append(row[2])
list_f1_name.append(row[3])
The reason it doesn't work multiple times is because it is an iterator...so it will iterate over the values once but not restart at beginning again after it has already iterated over the data.
Try:
csv_f1 = list(csv.reader(f1, delimiter=';'))
It is not exactly restarting the reader, but rather caching the file contents in a list, which may be iterated many times.
One thing nobody noticed so far is that you're trying to store names and last names in two separate lists. This is not going to be very convenient to use later on. Therefore despite other answers show correct possible solutions to read names and last names from csv into two separate lists, I'm going to propose you to use a single list of dicts instead:
f1 = open("csvfile.csv", encoding="latin-1")
csv_f1 = csv.reader(f1, delimiter=";")
list_of_names = []
for row_f1 in csv_f1:
list_of_names.append({
"vorname": row_f1[2],
"name": row_f1[3]
})
Then you can iterate over this list and take the value you want. For example to simply print the values:
for row in list_of_names:
print(row["vorname"])
print(row["name"])
And the last but not least, you could build this list also by using list comprehension (kinda more Pythonic):
list_of_names = [
{
"vorname": row_f1[2],
"name": row_f1[3]
}
for row_f1 in csv_f1
]
As I said, I appreciate other answers. They solve the issue of csv reader being iterable and not a list-like object.
Nevertheless I see a little bit of XY Problem in your question. I've seen so many times attempts to store entity properties (name and last name are obviously related properties and form a simple entity together) in multiple lists. It always ends up with the code which is hard to read and maintain.

Reading/Writing/Appending using json module

I was recommended some code from someone here but I still need some help. Here's the code that was recommended:
import json
def write_data(data,filename):
with open(filename,'w') as outfh:
json.dump(data,outfh)
def read_data(filename):
with open(filename, 'r') as infh:
json.load(infh)
The writing code works fine, but the reading code doesn't return any strings after inputting the data:
read_data('player.txt')
Another thing that I'd like to be able to do is to be able to specify a line to be printed. Something that is also pretty important for this project / exorcise is to be able to append strings into my file. Thanks to anyone that can help me.
Edit: I need to be storing strings in the file that I can convert to values. IE;
name="Karatepig"
Is something I would store so I can recall the data if I ever need to load a previous set of data or something of that sort.
I'm very much a noob at python so I don't know what would be best, whether a list or dictionary; I haven't really used a dictionary before, and also I have no idea yet as to how I'm going to convert strings into values.
Suppose you have a dictionary like this:
data = {'name': 'Karatepig', 'score': 10}
Your write_data function is fine, but your read_data function needs to actually return the data after reading it from the file:
def read_data(filename):
with open(filename, 'r') as infh:
data = json.load(infh)
return data
data = read_data('player.json')
Now suppose you want to print the name and update the score:
print data['name']
data['score'] += 5 # add 5 points to the score
You can then write the data back to disk using the write_data function. You can not use the json module to append data to a file. You need to actually read the file, modify the data, and write the file again. This is not so bad for small amounts of data, but consider using a database for larger projects.
Whether you use a list or dictionary depends on what you're trying to do. You need to explain what you're trying to accomplish before anyone can recommend an approach. Dictionaries store values assigned to keys (see above). Lists are collections of items without keys and are accessed using an index.
It isn't clear what you mean by "convert strings into values." Notice that the above code shows the string "Karatepig" using the dictionary key "name".

Python - creating dictionary for each key given file

I'm working on a text-parsing algorithm (open-source side project). I'd be very appreciative for any advice.
I have a tab-delimited txt file which is sorted by the first column (sample dataset below). Duplicate entries exist within this column.
Ultimately, I would like to use a hash to point to all values of which have the same key (first column value). Should a new key come along, the contents of the hash are to be serialized, saved, etc, and then cleared for the new key to populate it. As a result, my goal is to have only 1 key present. Therefore, if I have N unique keys, I wish to make N hashes each pointing to their respective entry. Datasets though are GBs in size and in-memory heaps won't be much help, hence my reasoning to create a hash per key and process each individually.
SAMPLE DATASET
A ... 23.4421
A ... -23.442
A ... 76.2224
B ... 32.1232
B ... -23.001
C ... 652.123
...
So in the above dataset snippet, I wish to have a hash for 'A' (pointing to its 3x respective items). When 'B' is read, serialize the 'A' hash and clear the hash-contents. Repeat for 'B' until end of dataset.
My pseudocode is as follows:
declare hash
for item in the dataset:
key, value = item[0], item[1:]
if key not in hash:
if hash.size is 0: // pertains to the very first item
hash.put(key, value)
else:
clear hash // if a new key is read but a diff. key is present.
else:
hash.put(key, value) // key already there so append it.
If any suggestions exist as to how to efficiently implement the above algorithm, I'd be very appreciative. Also, if my hash-reasoning/approach is not efficient or if improvements could be brought-up, I'd be very thankful. My goal is to ultimately create in-memory hashes until a new key comes along.
Thank you,
p.
Use itertools.groupby, passing it the file as an iterator:
from itertools import groupby
from cStringIO import StringIO
sourcedata = StringIO("""\
A ... 23.4421
A ... -23.442
A ... 76.2224
B ... 32.1232
B ... -23.001
C ... 652.123""")
# or sourcedata = open("zillion_gigabyte_file.dat")
for key,lines in groupby(sourcedata, key=lambda s:s.split()[0]):
accum = [float(s.split()[2]) for s in lines]
print key, accum
groupby is very smart and efficient, keeping very little data in memory at a time, keeping things purely in the form of iterators until the last possible moment. What you describe out hashes and keeping only one in memory at a time and all that, is already done for you in groupby.
You could open an anydbm (2.x) or dbm (3.x) for each key in your first column, named by the value of the column. This is pretty trivial - I'm not sure what the question is.
You could also use something like my cachedb module, to let it figure out whether something is "new" or not: http://stromberg.dnsalias.org/~strombrg/cachedb.html I've used it in two projects, both with good results.
Either way, you could probably make your keys just lists of ascii floats separated by newlines or nulls or something.
You don't make it explicit whether the sorted data you give is typical or whether the same key can be interspersed with other keys throughout the file, and that does make a fundamental difference to the algorithm. I deduce from your sample code that they will appear in arbitrary order.
Neither do you say to what use you intend to put the data so extracted. This can matter a lot—there are many different ways to store data, and the application can be a crucial feature in determining access times. So you may want to consider the use of a various different storage types. Without knowing how you propose to use the resulting structure the following suggestion may be inappropriate.
Since the data are floating-point numbers then you may want to consider using the shelve module to maintain simple lists of the floating point numbers keyed agains the alphamerics. This has the advantage that all pickling and unpickling to/from external storage is handled automatically. If you need an increase in speed consider using a more efficient pickling protocol (one of the unused arguments to shelve.open()).
# Transform the data:
# note it's actually more efficient to process line-by-line
# as you read it from a file - obviously it's best to try
# to avoid reading the whole data set into memory at once.
data = """\
A ... 23.4421
A ... -23.442
A ... 76.2224
B ... 32.1232
B ... -23.001
C ... 652.123"""
data = [(k, float(v))
for (k, _, v) in
[_.split() for _ in data.splitlines()]]
# create a shelve
import shelve
shelf = shelve.open("myshelf", "c")
# process data
for (k, v) in data:
if k in shelf:
# see note below for rationale
tmp = shelf[k]
tmp.append(v)
shelf[k] = tmp
else:
shelf[k] = [v]
# verify results
for k in shelf.keys():
print k, shelf[k]
You may be wondering why I didn't just use shelf[k].append(v) in the case where a key has already been seen. This is because it's only the operation of key assignment that triggers detection of the value change. You can read the shelve module docs for more detail, and to learn how to use the binary pickle format.
Note also that this program re-creates the shelf each time it is run due to the "c" argument to shelve.open().

Categories