How to prevent ast.literal_eval from overwriting same key values? [duplicate] - python

I need to parse a json file which unfortunately for me, does not follow the prototype. I have two issues with the data, but i've already found a workaround for it so i'll just mention it at the end, maybe someone can help there as well.
So i need to parse entries like this:
"Test":{
"entry":{
"Type":"Something"
},
"entry":{
"Type":"Something_Else"
}
}, ...
The json default parser updates the dictionary and therfore uses only the last entry. I HAVE to somehow store the other one as well, and i have no idea how to do this. I also HAVE to store the keys in the several dictionaries in the same order they appear in the file, thats why i am using an OrderedDict to do so. it works fine, so if there is any way to expand this with the duplicate entries i'd be grateful.
My second issue is that this very same json file contains entries like that:
"Test":{
{
"Type":"Something"
}
}
Json.load() function raises an exception when it reaches that line in the json file. The only way i worked around this was to manually remove the inner brackets myself.
Thanks in advance

You can use JSONDecoder.object_pairs_hook to customize how JSONDecoder decodes objects. This hook function will be passed a list of (key, value) pairs that you usually do some processing on, and then turn into a dict.
However, since Python dictionaries don't allow for duplicate keys (and you simply can't change that), you can return the pairs unchanged in the hook and get a nested list of (key, value) pairs when you decode your JSON:
from json import JSONDecoder
def parse_object_pairs(pairs):
return pairs
data = """
{"foo": {"baz": 42}, "foo": 7}
"""
decoder = JSONDecoder(object_pairs_hook=parse_object_pairs)
obj = decoder.decode(data)
print obj
Output:
[(u'foo', [(u'baz', 42)]), (u'foo', 7)]
How you use this data structure is up to you. As stated above, Python dictionaries won't allow for duplicate keys, and there's no way around that. How would you even do a lookup based on a key? dct[key] would be ambiguous.
So you can either implement your own logic to handle a lookup the way you expect it to work, or implement some sort of collision avoidance to make keys unique if they're not, and then create a dictionary from your nested list.
Edit: Since you said you would like to modify the duplicate key to make it unique, here's how you'd do that:
from collections import OrderedDict
from json import JSONDecoder
def make_unique(key, dct):
counter = 0
unique_key = key
while unique_key in dct:
counter += 1
unique_key = '{}_{}'.format(key, counter)
return unique_key
def parse_object_pairs(pairs):
dct = OrderedDict()
for key, value in pairs:
if key in dct:
key = make_unique(key, dct)
dct[key] = value
return dct
data = """
{"foo": {"baz": 42, "baz": 77}, "foo": 7, "foo": 23}
"""
decoder = JSONDecoder(object_pairs_hook=parse_object_pairs)
obj = decoder.decode(data)
print obj
Output:
OrderedDict([(u'foo', OrderedDict([(u'baz', 42), ('baz_1', 77)])), ('foo_1', 7), ('foo_2', 23)])
The make_unique function is responsible for returning a collision-free key. In this example it just suffixes the key with _n where n is an incremental counter - just adapt it to your needs.
Because the object_pairs_hook receives the pairs exactly in the order they appear in the JSON document, it's also possible to preserve that order by using an OrderedDict, I included that as well.

Thanks a lot #Lukas Graf, i got it working as well by implementing my own version of the hook function
def dict_raise_on_duplicates(ordered_pairs):
count=0
d=collections.OrderedDict()
for k,v in ordered_pairs:
if k in d:
d[k+'_dupl_'+str(count)]=v
count+=1
else:
d[k]=v
return d
Only thing remaining is to automatically get rid of the double brackets and i am done :D Thanks again

If you would prefer to convert those duplicated keys into an array, instead of having separate copies, this could do the work:
def dict_raise_on_duplicates(ordered_pairs):
"""Convert duplicate keys to JSON array."""
d = {}
for k, v in ordered_pairs:
if k in d:
if type(d[k]) is list:
d[k].append(v)
else:
d[k] = [d[k],v]
else:
d[k] = v
return d
And then you just use:
dict = json.loads(yourString, object_pairs_hook=dict_raise_on_duplicates)

Related

How do I filter this python defaultdict on a value within a dict?

I have a defaultdict, python 3.4, object similar to:
mydefaultdict = defaultdict(dict)
mydefaultdict['name1']['filename'] = 'name1'
mydefaultdict['name1']['records_available'] = True
mydefaultdict['name2']['filename'] = 'name2'
mydefaultdict['name2']['records_available'] = False
I want to create a filtered subset of just those with records_available = True. So far I have a dict comprehension that does:
defaultdict(dict, {k: v['records_available'] for k, v
in mydefaultdict.items() })
which is really close. But I'm not sure how to implement a filter. If this were a basic dict something like the following would work.
defaultdict(dict, {k: v['records_available'] for k, v
in mydefaultdict.items() if v == True })
I've managed to accomplish this using a nested for, if loop. But I got so close with the dict comprehension I wonder if someone might have the solution. Parsing python collections is a daily activity for me.
My current solution:
for i in audio_files.values():
if i['records_available']:
filterd_files[i['filename']] = i
Edit:
user3100115 is correct. Mock problem data will no return one record with user31000115's response.
It looks odd that you want to get v['records_available'] as the value (for all that this value is true). But what you probably want to look out for is the case where the values does not have the element 'records_available' or is not a dictionary at all. The former can be solved by using using the .get method, but if you want to protect you from the latter you have to define your own getitem (similar to getattr) that will return default when the object either doesn't support subscription at all or doesn't contain the requested item:
def getitem(obj, key, default=None):
try:
return obj[key]
except (AttributeError, KeyError):
return default
Then you just use this as a filter (I use func(v) to transform the value to what value you want since I think going for the "records_available" item is pointless - it would only be True):
{k, func(v) for k, v in mydefaultdict.items() if getitem(v, "records_available") == True)}
Note that getitem will return None whenever v doesn't contain "records_available" or doesn't support subscription at all - and then consequently that key-value pair will not make it into the dictionary.

Python json parser allow duplicate keys

I need to parse a json file which unfortunately for me, does not follow the prototype. I have two issues with the data, but i've already found a workaround for it so i'll just mention it at the end, maybe someone can help there as well.
So i need to parse entries like this:
"Test":{
"entry":{
"Type":"Something"
},
"entry":{
"Type":"Something_Else"
}
}, ...
The json default parser updates the dictionary and therfore uses only the last entry. I HAVE to somehow store the other one as well, and i have no idea how to do this. I also HAVE to store the keys in the several dictionaries in the same order they appear in the file, thats why i am using an OrderedDict to do so. it works fine, so if there is any way to expand this with the duplicate entries i'd be grateful.
My second issue is that this very same json file contains entries like that:
"Test":{
{
"Type":"Something"
}
}
Json.load() function raises an exception when it reaches that line in the json file. The only way i worked around this was to manually remove the inner brackets myself.
Thanks in advance
You can use JSONDecoder.object_pairs_hook to customize how JSONDecoder decodes objects. This hook function will be passed a list of (key, value) pairs that you usually do some processing on, and then turn into a dict.
However, since Python dictionaries don't allow for duplicate keys (and you simply can't change that), you can return the pairs unchanged in the hook and get a nested list of (key, value) pairs when you decode your JSON:
from json import JSONDecoder
def parse_object_pairs(pairs):
return pairs
data = """
{"foo": {"baz": 42}, "foo": 7}
"""
decoder = JSONDecoder(object_pairs_hook=parse_object_pairs)
obj = decoder.decode(data)
print obj
Output:
[(u'foo', [(u'baz', 42)]), (u'foo', 7)]
How you use this data structure is up to you. As stated above, Python dictionaries won't allow for duplicate keys, and there's no way around that. How would you even do a lookup based on a key? dct[key] would be ambiguous.
So you can either implement your own logic to handle a lookup the way you expect it to work, or implement some sort of collision avoidance to make keys unique if they're not, and then create a dictionary from your nested list.
Edit: Since you said you would like to modify the duplicate key to make it unique, here's how you'd do that:
from collections import OrderedDict
from json import JSONDecoder
def make_unique(key, dct):
counter = 0
unique_key = key
while unique_key in dct:
counter += 1
unique_key = '{}_{}'.format(key, counter)
return unique_key
def parse_object_pairs(pairs):
dct = OrderedDict()
for key, value in pairs:
if key in dct:
key = make_unique(key, dct)
dct[key] = value
return dct
data = """
{"foo": {"baz": 42, "baz": 77}, "foo": 7, "foo": 23}
"""
decoder = JSONDecoder(object_pairs_hook=parse_object_pairs)
obj = decoder.decode(data)
print obj
Output:
OrderedDict([(u'foo', OrderedDict([(u'baz', 42), ('baz_1', 77)])), ('foo_1', 7), ('foo_2', 23)])
The make_unique function is responsible for returning a collision-free key. In this example it just suffixes the key with _n where n is an incremental counter - just adapt it to your needs.
Because the object_pairs_hook receives the pairs exactly in the order they appear in the JSON document, it's also possible to preserve that order by using an OrderedDict, I included that as well.
Thanks a lot #Lukas Graf, i got it working as well by implementing my own version of the hook function
def dict_raise_on_duplicates(ordered_pairs):
count=0
d=collections.OrderedDict()
for k,v in ordered_pairs:
if k in d:
d[k+'_dupl_'+str(count)]=v
count+=1
else:
d[k]=v
return d
Only thing remaining is to automatically get rid of the double brackets and i am done :D Thanks again
If you would prefer to convert those duplicated keys into an array, instead of having separate copies, this could do the work:
def dict_raise_on_duplicates(ordered_pairs):
"""Convert duplicate keys to JSON array."""
d = {}
for k, v in ordered_pairs:
if k in d:
if type(d[k]) is list:
d[k].append(v)
else:
d[k] = [d[k],v]
else:
d[k] = v
return d
And then you just use:
dict = json.loads(yourString, object_pairs_hook=dict_raise_on_duplicates)

Challenge with codes stemming from an issue regarding the "Hashable" property

We are attempting to refactor and modify a Python program such that it is able to take a user-defined JSON file, parse that file, and then execute a workflow based on the options that they user wants and had defined in the JSON. So basically, the user will have to specify a dictionary in JSON, and when this JSON file is parsed by the Python program, we obtain a python dictionary which we then pass in as an argument into a class that we instantiate in a top level module. To sum this up, the JSON dictionary defined by the user will eventually be added into the instance namespace when the python program is running.
Implementing the context managers to parse the JSON inputs was not a problem for us. However, we have a requirement that we be able to use the JSON dictionary (which gets subsequently added into the instance namespace) and generate multiple lines from a Jinja2 template file using looping within a template. We attempted to use this line for one of the key-value pairs in the JSON:
"extra_scripts" : [["Altera/AlteraCommon.lua",
"Altera/StratixIV/EP4SGX70HF35C2.lua"]]
and this is sitting in a large dictionary object, let's call it option_space_dict and for simplicity in this example, it has only 4 key-value pairs (assume that "extra_scripts" is 'key4' here), although for our program, it is much larger:
option_space_dict = {
'key1' : ['value1'],
'key2' : ['value2'],
'key3' : ['value3A', 'value3B', 'value3C'],
'key4' : [['value4A', 'value4B']]
}
which is the parsed by this line:
import itertools
option_space = [ dict(itertools.izip(option_space_dict, opt)) for opt in itertools.product(*option_space_dict.itervalues()) ]
to get the option_space which essentially differs from option_space_dict in that it is something like:
[
{ 'key1' : 'value1',
'key2' : 'value2',
'key3' : 'value3A'
'key4' : ['value4A', 'value4B'] },
{ 'key1' : 'value1',
'key2' : 'value2',
'key3' : 'value3B'
'key4' : ['value4A', 'value4B'] },
{ 'key1' : 'value1',
'key2' : 'value2',
'key3' : 'value3C'
'key4' : ['value4A', 'value4B'] }
]
So the option_space we generate serves us well for what we want to do with the jinja2 templating. However, in order to get this, the key4 key that we added to option_space_dict caused an issue somewhere else in the program which did:
# ignore self.option as it is not relevant to the issue here
def getOptionCompack(self) :
return [ (k, v) for k, v in self.option.iteritems() if set([v]) != set(self.option_space_dict[k])]
I get the error TypeError: unhashable type: 'list' stemming from the fact that the value of key4 contains a nested list structure, which is 'unhashable'.
So we kind of hit a barrier. Does anyone have a suggestion on how we could overcome this; being able to specify our JSON files in that way to do what we'd want with Jinja2 while still being able to parse the data structures out in the same format?
Thanks a million!
You can normalize your key data structures to use hashable types after they have parsed from JSON.
Since key4 is a list, you have two options:
Convert it to a tuple where order is significant. E.g.,
key = tuple(key)
Convert it to a frozenset where order is insignificant. E.g.,
key = frozenset(key)
If a key can contain a dictionary, then you'll have two additional options:
Convert it to either a sorted tuple or frozenset of its item tuples. E.g.,
key = tuple(sorted(key.iteritems())) # Use key.items() for Python 3.
# OR
key = frozenset(key.iteritems()) # Use key.items() for Python 3.
Convert it to a third-party frozendict (Python 3 compatible version here). E.g.,
import frozendict
key = frozendict.frozendict(key)
Depending on how simple or complex your keys are, you may have to apply the transformation recursively.
Since your keys come directly from JSON, you can check for the native types directly:
if isinstance(key, list):
# Freeze list.
elif isinstance(key, dict):
# Freeze dict.
If you want to support the generic types, you can do something similar to:
import collections
if isinstance(key, collections.Sequence) and not isinstance(key, basestring): # Use str for Python 2.
# NOTE: Make sure to exclude basestring because it meets the requirements for a Sequence (of characters).
# Freeze list.
elif isinstance(key, collections.Mapping):
# Freeze dict.
Here is a full example:
def getOptionCompack(self):
results = []
for k, v in self.option.iteritems():
k = self.freeze_key(k)
if set([v]) != set(self.option_space_dict[k]):
results.append((k, v))
return results
def freeze_key(self, key):
if isinstance(key, list):
return frozenset(self.freeze_key(subv) for subv in key)
# If dictionaries need to be supported, uncomment this.
#elif isinstance(key, dict):
# return frozendict((subk, self.freeze_key(subv)) for subk, subv in key.iteritems())
return key
Where self.option_space_dict already had its keys converted using self.freeze_key().
We have managed to figure out the solution to this problem. The main gist of our solution lies in that we implemented a Helper Function that assists us to actually convert a list into a tuple. Basically, going back to my question, remember we had this list: [["Altera/AlteraCommon.lua", "Altera/StratixIV/EP4SGX70HF35C2.lua"]]?
With our original getOptionCompack(self) method, and the way we were invoking it, what happened was that we directly tried to convert the list to a set with the statement
return [ (k, v) for k, v in self.option.iteritems() if set([v]) != set(self.option_space_dict[k])]
where set(self.option_space_dict[k]) and iterating over k would mean we will hit the dictionary key-value pair that would give us one instance of doing set([["Altera/AlteraCommon.lua", "Altera/StratixIV/EP4SGX70HF35C2.lua"]])
which was the cause of the error. This is because a list object is not hashable and set() would actually hash over each element within the outer list that is fed to it, and the element in this case is an inner list. Try doing set([[2]]) and you will see what I mean.
So we figured that the workaround would be to define a Helper function that would accept a list object, or any iterable object for that matter, and test whether each element in it is a list or not. If the element was not a list, it would not do any change to its object type, if it was (and that which would be a nested list), then the Helper function would convert that nested list to a tuple object instead, and in doing that iteratively, it actually constructs a set object that it returns to itself. The definition of the function is:
# Helper function to build a set
def Set(iterable) :
return { tuple(v) if isinstance(v, list) else v for v in iterable }
and so a call that invoked Set() would be in our example:
Set([["Altera/AlteraCommon.lua", "Altera/StratixIV/EP4SGX70HF35C2.lua"]])
and the object that it returns to itself would be:
{("Altera/AlteraCommon.lua", "Altera/StratixIV/EP4SGX70HF35C2.lua")}
The inner nested list gets converted to a tuple, which is an object type that fits within a set object, as denoted by the {} that encloses the tuple. That's why it can work now, that the set can be formed.
We proceeded to redefine out original method to use our own Set() function:
def getOptionCompack(self) :
return [ (k, v) for k, v in self.option.iteritems() if Set([v]) != Set(self.option_space_dict[k]) ]
and now we no longer have the TypeError, and solved the problem. Seems like a lot of trouble just to do this, but the reason why we went through all this was so as to have an objective means of comparing two objects by sort of "normalizing" them to be the same object type, a set, in order to perform some other action later on as part of our source code.

Json in Python: Receive/Check duplicate key error

The json module of python acts a little of the specification when having duplicate keys in a map:
import json
>>> json.loads('{"a": "First", "a": "Second"}')
{u'a': u'Second'}
I know that this behaviour is specified in the documentation:
The RFC specifies that the names within a JSON object should be
unique, but does not specify how repeated names in JSON objects should
be handled. By default, this module does not raise an exception;
instead, it ignores all but the last name-value pair for a given name:
For my current project, I absolutely need to make sure that no duplicate keys are present in the file and receive an error/exception if this is the case? How can this be accomplished?
I'm still stuck on Python 2.7, so a solution which also works with older versions would help me most.
Well, you could try using the JSONDecoder class and specifying a custom object_pairs_hook, which will receive the duplicates before they would get deduped.
import json
def dupe_checking_hook(pairs):
result = dict()
for key,val in pairs:
if key in result:
raise KeyError("Duplicate key specified: %s" % key)
result[key] = val
return result
decoder = json.JSONDecoder(object_pairs_hook=dupe_checking_hook)
# Raises a KeyError
some_json = decoder.decode('''{"a":"hi","a":"bye"}''')
# works
some_json = decoder.decode('''{"a":"hi","b":"bye"}''')
The below code will take any json that has repeated keys and put them into an array. for example takes this json string '''{"a":"hi","a":"bye"}''' and gives {"a":['hi','bye']} as output
import json
def dupe_checking_hook(pairs):
result = dict()
for key,val in pairs:
if key in result:
if(type(result[key]) == dict):
temp = []
temp.append(result[key])
temp.append(val)
result[key] = temp
else:
result[key].append(val)
else:
result[key] = val
return result
decoder = json.JSONDecoder(object_pairs_hook=dupe_checking_hook)
#it will not raise error
some_json = decoder.decode('''{"a":"hi","a":"bye"}''')

Map list of tuples into a dictionary

I've got a list of tuples extracted from a table in a DB which looks like (key , foreignkey , value). There is a many to one relationship between the key and foreignkeys and I'd like to convert it into a dict indexed by the foreignkey containing the sum of all values with that foreignkey, i.e. { foreignkey , sumof( value ) }. I wrote something that's rather verbose:
myDict = {}
for item in myTupleList:
if item[1] in myDict:
myDict [ item[1] ] += item[2]
else:
myDict [ item[1] ] = item[2]
but after seeing this question's answer or these two there's got to be a more concise way of expressing what I'd like to do. And if this is a repeat, I missed it and will remove the question if you can provide the link.
Assuming all your values are ints, you could use a defaultdict to make this easier:
from collections import defaultdict
myDict = defaultdict(int)
for item in myTupleList:
myDict[item[1]] += item[2]
defaultdict is like a dictionary, except if you try to get a key that isn't there it fills in the value returned by the callable - in this case, int, which returns 0 when called with no arguments.
UPDATE: Thanks to #gnibbler for reminding me, but tuples can be unpacked in a for loop:
from collections import defaultdict
myDict = defaultdict(int)
for _, key, val in myTupleList:
myDict[key] += val
Here, the 3-item tuple gets unpacked into the variables _, key, and val. _ is a common placeholder name in Python, used to indicate that the value isn't really important. Using this, we can avoid the hairy item[1] and item[2] indexing. We can't rely on this if the tuples in myTupleList aren't all the same size, but I bet they are.
(We also avoid the situation of someone looking at the code and thinking it's broken because the writer thought arrays were 1-indexed, which is what I thought when I first read the code. I wasn't alleviated of this until I read the question. In the above loop, however, it's obvious that myTupleList is a tuple of three elements, and we just don't need the first one.)
from collections import defaultdict
myDict = defaultdict(int)
for _, key, value in myTupleList:
myDict[key] += value
Here's my (tongue in cheek) answer:
myDict = reduce(lambda d, t: (d.__setitem__(t[1], d.get(t[1], 0) + t[2]), d)[1], myTupleList, {})
It is ugly and bad, but here is how it works.
The first argument to reduce (because it isn't clear there) is lambda d, t: (d.__setitem__(t[1], d.get(t[1], 0) + t[2]), d)[1]. I will talk about this later, but for now, I'll just call it joe (no offense to any people named Joe intended). The reduce function basically works like this:
joe(joe(joe({}, myTupleList[0]), myTupleList[1]), myTupleList[2])
And that's for a three element list. As you can see, it basically uses its first argument to sort of accumulate each result into the final answer. In this case, the final answer is the dictionary you wanted.
Now for joe itself. Here is joe as a def:
def joe(myDict, tupleItem):
myDict[tupleItem[1]] = myDict.get(tupleItem[1], 0) + tupleItem[2]
return myDict
Unfortunately, no form of = or return is allowed in a Python lambda so that has to be gotten around. I get around the lack of = by calling the dicts __setitem__ function directly. I get around the lack of return in by creating a tuple with the return value of __setitem__ and the dictionary and then return the tuple element containing the dictionary. I will slowly alter joe so you can see how I accomplished this.
First, remove the =:
def joe(myDict, tupleItem):
# Using __setitem__ to avoid using '='
myDict.__setitem__(tupleItem[1], myDict.get(tupleItem[1], 0) + tupleItem[2])
return myDict
Next, make the entire expression evaluate to the value we want to return:
def joe(myDict, tupleItem):
return (myDict.__setitem__(tupleItem[1], myDict.get(tupleItem[1], 0) + tupleItem[2]),
myDict)[1]
I have run across this use-case for reduce and dict many times in my Python programming. In my opinion, dict could use a member function reduceto(keyfunc, reduce_func, iterable, default_val=None). keyfunc would take the current value from the iterable and return the key. reduce_func would take the existing value in the dictionary and the value from the iterable and return the new value for the dictionary. default_val would be what was passed into reduce_func if the dictionary was missing a key. The return value should be the dictionary itself so you could do things like:
myDict = dict().reduceto(lambda t: t[1], lambda o, t: o + t, myTupleList, 0)
Maybe not exactly readable but it should work:
fks = dict([ (v[1], True) for v in myTupleList ]).keys()
myDict = dict([ (fk, sum([ v[2] for v in myTupleList if v[1] == fk ])) for fk in fks ])
The first line finds all unique foreign keys. The second line builds your dictionary by first constructing a list of (fk, sum(all values for this fk))-pairs and turning that into a dictionary.
Look at SQLAlchemy and see if that does all the mapping you need and perhaps more

Categories