Parsing a stringified OrderedDictionary in Python - python

I have a need to change one of my dictionaries from a normal dict to a collections.OrderedDict. The problem is that I store my dictionary as a string in an object that I need to later extract. Call my dict object foo. My previous method of doing this was:
storedDict = str(foo)
Then in another file:
import ast
parsedDict = ast.literal_eval(storedDict)
This works all fine and dandy, but now that I have changed foo to be an OrderedDict, it can no longer be parsed by the ast.literal_eval function. Is there a function or good method I can use to parse a stringified OrderedDict? I am trying to avoid serialization here to stay consistent with the rest of the program.

Seems like the easiest way is to write your own serializer for it; especially because it's an ordered dict, so you want to keep the order, I assume.
It could be as simple as this (assuming your keys and values are simple:
buf = ""
for k, v in myOrderedDict.iteritems():
buf += k + "\t" + v + "\n"
print(buf)
If the data can be more complicated, you can escape or quote as needed.
Then when you want it back:
d = ordereddict()
while(True):
r = readline()
k, v = r.split('\t')
d[k] = v

Related

How to prevent ast.literal_eval from overwriting same key values? [duplicate]

I need to parse a json file which unfortunately for me, does not follow the prototype. I have two issues with the data, but i've already found a workaround for it so i'll just mention it at the end, maybe someone can help there as well.
So i need to parse entries like this:
"Test":{
"entry":{
"Type":"Something"
},
"entry":{
"Type":"Something_Else"
}
}, ...
The json default parser updates the dictionary and therfore uses only the last entry. I HAVE to somehow store the other one as well, and i have no idea how to do this. I also HAVE to store the keys in the several dictionaries in the same order they appear in the file, thats why i am using an OrderedDict to do so. it works fine, so if there is any way to expand this with the duplicate entries i'd be grateful.
My second issue is that this very same json file contains entries like that:
"Test":{
{
"Type":"Something"
}
}
Json.load() function raises an exception when it reaches that line in the json file. The only way i worked around this was to manually remove the inner brackets myself.
Thanks in advance
You can use JSONDecoder.object_pairs_hook to customize how JSONDecoder decodes objects. This hook function will be passed a list of (key, value) pairs that you usually do some processing on, and then turn into a dict.
However, since Python dictionaries don't allow for duplicate keys (and you simply can't change that), you can return the pairs unchanged in the hook and get a nested list of (key, value) pairs when you decode your JSON:
from json import JSONDecoder
def parse_object_pairs(pairs):
return pairs
data = """
{"foo": {"baz": 42}, "foo": 7}
"""
decoder = JSONDecoder(object_pairs_hook=parse_object_pairs)
obj = decoder.decode(data)
print obj
Output:
[(u'foo', [(u'baz', 42)]), (u'foo', 7)]
How you use this data structure is up to you. As stated above, Python dictionaries won't allow for duplicate keys, and there's no way around that. How would you even do a lookup based on a key? dct[key] would be ambiguous.
So you can either implement your own logic to handle a lookup the way you expect it to work, or implement some sort of collision avoidance to make keys unique if they're not, and then create a dictionary from your nested list.
Edit: Since you said you would like to modify the duplicate key to make it unique, here's how you'd do that:
from collections import OrderedDict
from json import JSONDecoder
def make_unique(key, dct):
counter = 0
unique_key = key
while unique_key in dct:
counter += 1
unique_key = '{}_{}'.format(key, counter)
return unique_key
def parse_object_pairs(pairs):
dct = OrderedDict()
for key, value in pairs:
if key in dct:
key = make_unique(key, dct)
dct[key] = value
return dct
data = """
{"foo": {"baz": 42, "baz": 77}, "foo": 7, "foo": 23}
"""
decoder = JSONDecoder(object_pairs_hook=parse_object_pairs)
obj = decoder.decode(data)
print obj
Output:
OrderedDict([(u'foo', OrderedDict([(u'baz', 42), ('baz_1', 77)])), ('foo_1', 7), ('foo_2', 23)])
The make_unique function is responsible for returning a collision-free key. In this example it just suffixes the key with _n where n is an incremental counter - just adapt it to your needs.
Because the object_pairs_hook receives the pairs exactly in the order they appear in the JSON document, it's also possible to preserve that order by using an OrderedDict, I included that as well.
Thanks a lot #Lukas Graf, i got it working as well by implementing my own version of the hook function
def dict_raise_on_duplicates(ordered_pairs):
count=0
d=collections.OrderedDict()
for k,v in ordered_pairs:
if k in d:
d[k+'_dupl_'+str(count)]=v
count+=1
else:
d[k]=v
return d
Only thing remaining is to automatically get rid of the double brackets and i am done :D Thanks again
If you would prefer to convert those duplicated keys into an array, instead of having separate copies, this could do the work:
def dict_raise_on_duplicates(ordered_pairs):
"""Convert duplicate keys to JSON array."""
d = {}
for k, v in ordered_pairs:
if k in d:
if type(d[k]) is list:
d[k].append(v)
else:
d[k] = [d[k],v]
else:
d[k] = v
return d
And then you just use:
dict = json.loads(yourString, object_pairs_hook=dict_raise_on_duplicates)

Python 2.7 parsing data

I have data that look like this:
data = 'somekey:value4thekey&second-key:valu3-can.be?anything&third_k3y:it%can have spaces;too'
In a nice human-readable way it would look like this:
somekey : value4thekey
second-key : valu3-can.be?anything
third_k3y : it%can have spaces;too
How should I parse the data so when I do data['somekey'] I would get >>> value4thekey?
Note: The & is connecting all of the different items
How am I currently tackling with it
Currently, I use this ugly solution:
all = data.split('&')
for i in all:
if i.startswith('somekey'):
print i
This solution is very bad due to multiple obvious limitations. It would be much better if I can somehow parse it into a python tree object.
I'd split the string by & to get a list of key-value strings, and then split each such string by : to get key-value pairs. Using dict and list comprehensions actually makes this quite elegant:
result = {k:v for k, v in (part.split(':') for part in data.split('&'))}
You can parse your data directly to a dictionary - split on the item separator & then split again on the key,value separator ::
table = {
key: value for key, value in
(item.split(':') for item in data.split('&'))
}
This allows you direct access to elements, e.g. as table['somekey'].
If you don't have objects within a value, you can parse it to a dictionary
structure = {}
for ele in data.split('&'):
ele_split = ele.split(':')
structure[ele_split[0]] = ele_split[1]
You can now use structure to get the values:
print structure["somekey"]
#returns "value4thekey"
Since the keys have a common format of being in the form of "key":"value".
You can use it as a parameter to split on.
for i in x.split("&"):
print(i.split(":"))
This would generate an array of even items where every even index is the key and odd index being the value. Iterate through the array and load it into a dictionary. You should be good!
I'd format data to YAML and parse the YAML
import re
import yaml
data = 'somekey:value4thekey&second-key:valu3-can.be?anything&third_k3y:it%can have spaces;too'
yaml_data = re.sub('[:]', ': ', re.sub('[&]', '\n', data ))
y = yaml.load(yaml_data)
for k in y:
print "%s : %s" % (k,y[k])
Here's the output:
third_k3y : it%can have spaces;too
somekey : value4thekey
second-key : valu3-can.be?anything

how to create a dictionary from a set of properly formatted tuples in python

Is there a simple way to create a dictionary from a list of formatted tuples. e.g. if I do something like:
d={"responseStatus":"SUCCESS","sessionId":"01234","userId":2000004904}
This creates a dictionary called d. However, if I want to create a dictionary from a string which contains the same string, I can't do that
res=<some command that returns {"responseStatus":"SUCCESS","sessionId":"01234","userId":2000004904}>
print res
# returns {"responseStatus":"SUCCESS","sessionId":"01234","userId":2000004904}
d=dict(res)
This throws an error that says:
ValueError: dictionary update sequence element #0 has length 1; 2 is required
I strongly strongly suspect that you have json on your hands.
import json
d = json.loads('{"responseStatus":"SUCCESS","sessionId":"01234","userId":2000004904}')
would give you what you want.
Use dict(zip(tuples))
>>> u = ("foo", "bar")
>>> v = ("blah", "zoop")
>>> d = dict(zip(u, v))
>>> d
{'foo': 'blah', 'bar': 'zoop'}
Note, if you have an odd number of tuples this will not work.
Based on what you gave is, res is
# returns {"responseStatus":"SUCCESS","sessionId":"01234","userId":2000004904}
So the plan is to grab the string starting at the curly brace to the end and use json to decode it:
import json
# Discard the text before the curly brace
res = res[res.index('{'):]
# Turn that text into a dictionary
d = json.loads(res)
All you need to do in your particular case is
d = eval(res)
And please keep security in mind when using eval, especially if you're mixing it with ajax/json.
UPDATE
Since others pointed out you might be getting this data over the web and it isn't just a "how to make this work" question, use this:
import json
json.loads(res)

Python json parser allow duplicate keys

I need to parse a json file which unfortunately for me, does not follow the prototype. I have two issues with the data, but i've already found a workaround for it so i'll just mention it at the end, maybe someone can help there as well.
So i need to parse entries like this:
"Test":{
"entry":{
"Type":"Something"
},
"entry":{
"Type":"Something_Else"
}
}, ...
The json default parser updates the dictionary and therfore uses only the last entry. I HAVE to somehow store the other one as well, and i have no idea how to do this. I also HAVE to store the keys in the several dictionaries in the same order they appear in the file, thats why i am using an OrderedDict to do so. it works fine, so if there is any way to expand this with the duplicate entries i'd be grateful.
My second issue is that this very same json file contains entries like that:
"Test":{
{
"Type":"Something"
}
}
Json.load() function raises an exception when it reaches that line in the json file. The only way i worked around this was to manually remove the inner brackets myself.
Thanks in advance
You can use JSONDecoder.object_pairs_hook to customize how JSONDecoder decodes objects. This hook function will be passed a list of (key, value) pairs that you usually do some processing on, and then turn into a dict.
However, since Python dictionaries don't allow for duplicate keys (and you simply can't change that), you can return the pairs unchanged in the hook and get a nested list of (key, value) pairs when you decode your JSON:
from json import JSONDecoder
def parse_object_pairs(pairs):
return pairs
data = """
{"foo": {"baz": 42}, "foo": 7}
"""
decoder = JSONDecoder(object_pairs_hook=parse_object_pairs)
obj = decoder.decode(data)
print obj
Output:
[(u'foo', [(u'baz', 42)]), (u'foo', 7)]
How you use this data structure is up to you. As stated above, Python dictionaries won't allow for duplicate keys, and there's no way around that. How would you even do a lookup based on a key? dct[key] would be ambiguous.
So you can either implement your own logic to handle a lookup the way you expect it to work, or implement some sort of collision avoidance to make keys unique if they're not, and then create a dictionary from your nested list.
Edit: Since you said you would like to modify the duplicate key to make it unique, here's how you'd do that:
from collections import OrderedDict
from json import JSONDecoder
def make_unique(key, dct):
counter = 0
unique_key = key
while unique_key in dct:
counter += 1
unique_key = '{}_{}'.format(key, counter)
return unique_key
def parse_object_pairs(pairs):
dct = OrderedDict()
for key, value in pairs:
if key in dct:
key = make_unique(key, dct)
dct[key] = value
return dct
data = """
{"foo": {"baz": 42, "baz": 77}, "foo": 7, "foo": 23}
"""
decoder = JSONDecoder(object_pairs_hook=parse_object_pairs)
obj = decoder.decode(data)
print obj
Output:
OrderedDict([(u'foo', OrderedDict([(u'baz', 42), ('baz_1', 77)])), ('foo_1', 7), ('foo_2', 23)])
The make_unique function is responsible for returning a collision-free key. In this example it just suffixes the key with _n where n is an incremental counter - just adapt it to your needs.
Because the object_pairs_hook receives the pairs exactly in the order they appear in the JSON document, it's also possible to preserve that order by using an OrderedDict, I included that as well.
Thanks a lot #Lukas Graf, i got it working as well by implementing my own version of the hook function
def dict_raise_on_duplicates(ordered_pairs):
count=0
d=collections.OrderedDict()
for k,v in ordered_pairs:
if k in d:
d[k+'_dupl_'+str(count)]=v
count+=1
else:
d[k]=v
return d
Only thing remaining is to automatically get rid of the double brackets and i am done :D Thanks again
If you would prefer to convert those duplicated keys into an array, instead of having separate copies, this could do the work:
def dict_raise_on_duplicates(ordered_pairs):
"""Convert duplicate keys to JSON array."""
d = {}
for k, v in ordered_pairs:
if k in d:
if type(d[k]) is list:
d[k].append(v)
else:
d[k] = [d[k],v]
else:
d[k] = v
return d
And then you just use:
dict = json.loads(yourString, object_pairs_hook=dict_raise_on_duplicates)

Building Nested dictionary in Python reading in line by line from file

The way I go about nested dictionary is this:
dicty = dict()
tmp = dict()
tmp["a"] = 1
tmp["b"] = 2
dicty["A"] = tmp
dicty == {"A" : {"a" : 1, "b" : 1}}
The problem starts when I try to implement this on a big file, reading in line by line.
This is printing the content per line in a list:
['proA', 'macbook', '0.666667']
['proA', 'smart', '0.666667']
['proA', 'ssd', '0.666667']
['FrontPage', 'frontpage', '0.710145']
['FrontPage', 'troubleshooting', '0.971014']
I would like to end up with a nested dictionary (ignore decimals):
{'FrontPage': {'frontpage': '0.710145', 'troubleshooting': '0.971014'},
'proA': {'macbook': '0.666667', 'smart': '0.666667', 'ssd': '0.666667'}}
As I am reading in line by line, I have to check whether or not the first word is still found in the file (they are all grouped), before I add it as a complete dict to the higher dict.
This is my implementation:
def doubleDict(filename):
dicty = dict()
with open(filename, "r") as f:
row = 0
tmp = dict()
oldword = ""
for line in f:
values = line.rstrip().split(" ")
print(values)
if oldword == values[0]:
tmp[values[1]] = values[2]
else:
if oldword is not "":
dicty[oldword] = tmp
tmp.clear()
oldword = values[0]
tmp[values[1]] = values[2]
row += 1
if row % 25 == 0:
print(dicty)
break #print(row)
return(dicty)
I would actually like to have this in pandas, but for now I would be happy if this would work as a dict. For some reason after reading in just the first 5 lines, I end up with:
{'proA': {'frontpage': '0.710145', 'troubleshooting': '0.971014'}},
which is clearly incorrect. What is wrong?
Use a collections.defaultdict() object to auto-instantiate nested dictionaries:
from collections import defaultdict
def doubleDict(filename):
dicty = defaultdict(dict)
with open(filename, "r") as f:
for i, line in enumerate(f):
outer, inner, value = line.split()
dicty[outer][inner] = value
if i % 25 == 0:
print(dicty)
break #print(row)
return(dicty)
I used enumerate() to generate the line count here; much simpler than keeping a separate counter going.
Even without a defaultdict, you can let the outer dictionary keep the reference to the nested dictionary, and retrieve it again by using values[0]; there is no need to keep the temp reference around:
>>> dicty = {}
>>> dicty['A'] = {}
>>> dicty['A']['a'] = 1
>>> dicty['A']['b'] = 2
>>> dicty
{'A': {'a': 1, 'b': 1}}
All the defaultdict then does is keep us from having to test if we already created that nested dictionary. Instead of:
if outer not in dicty:
dicty[outer] = {}
dicty[outer][inner] = value
we simply omit the if test as defaultdict will create a new dictionary for us if the key was not yet present.
While this isn't the ideal way to do things, you're pretty close to making it work.
Your main problem is that you're reusing the same tmp dictionary. After you insert it into dicty under the first key, you then clear it and start filling it with the new values. Replace tmp.clear() with tmp = {} to fix that, so you have a different dictionary for each key, instead of the same one for all keys.
Your second problem is that you're never storing the last tmp value in the dictionary when you reach the end, so add another dicty[oldword] = tmp after the for loop.
Your third problem is that you're checking if oldword is not "":. That may be true even if it's an empty string, because you're comparing identity, not equality. Just change that to if oldword:. (This one, you'll usually get away with, because small strings are usually interned and will usually share identity… but you shouldn't count on that.)
If you fix both of those, you get this:
{'FrontPage': {'frontpage': '0.710145', 'troubleshooting': '0.971014'},
'proA': {'macbook': '0.666667', 'smart': '0.666667', 'ssd': '0.666667'}}
I'm not sure how to turn this into the format you claim to want, because that format isn't even a valid dictionary. But hopefully this gets you close.
There are two simpler ways to do it:
Group the values with, e.g., itertools.groupby, then transform each group into a dict and insert it all in one step. This, like your existing code, requires that the input already be batched by values[0].
Use the dictionary as a dictionary. You can look up each key as it comes in and add to the value if found, create a new one if not. A defaultdict or the setdefault method will make this concise, but even if you don't know about those, it's pretty simple to write it out explicitly, and it'll still be less verbose than what you have now.
The second version is already explained very nicely in Martijn Pieters's answer.
The first can be written like this:
def doubleDict(s):
with open(filename, "r") as f:
rows = (line.rstrip().split(" ") for line in f)
return {k: {values[1]: values[2] for values in g}
for k, g in itertools.groupby(rows, key=operator.itemgetter(0))}
Of course that doesn't print out the dict so far after every 25 rows, but that's easy to add by turning the comprehension into an explicit loop (and ideally using enumerate instead of keeping an explicit row counter).

Categories