How to decode an Opaque data which has obtained by pysnmp?

How to decode an Opaque data which has obtained by pysnmp? - python

I'm going to read data from an SNMP device by its OID via pysnmp library. However, I'm dealing with an error from Opaque type:
from pysnmp import hlapi
def construct_object_types(list_of_oids):
object_types = []
for oid in list_of_oids:
object_types.append(hlapi.ObjectType(hlapi.ObjectIdentity(oid)))
return object_types
def get(target, oids, credentials, port=161, engine=hlapi.SnmpEngine(),
context=hlapi.ContextData()):
handler = hlapi.getCmd(
engine,
credentials,
hlapi.UdpTransportTarget((target, port)),
context,
*construct_object_types(oids)
)
return fetch(handler, 1)[0]
def cast(value):
try:
return int(value)
except (ValueError, TypeError):
try:
return float(value)
except (ValueError, TypeError):
try:
return str(value)
except (ValueError, TypeError) as exc:
print(exc)
return value
def fetch(handler, count):
result = []
for i in range(count):
(error_indication, error_status,
error_index, var_binds) = next(handler)
if not error_indication and not error_status:
items = {}
print(var_binds)
for var_bind in var_binds:
items[str(var_bind[0])] = cast(var_bind[1])
result.append(items)
else:
raise RuntimeError(f'SNMP error: {error_indication}')
return result
print(get("192.168.100.112", [".1.3.6.1.4.1.9839.1.2.532.0",
'.1.3.6.1.4.1.9839.1.2.513.0'],
hlapi.CommunityData('public')))
Out:
[ObjectType(ObjectIdentity(<ObjectName value object, tagSet <TagSet object, tags 0:0:6>, payload [1.3.6.1.4.1.9839.1.2.532.0]>), <Opaque value object, tagSet <TagSet object, tags 64:0:4>, encoding iso-8859-1, payload [0x9f780441ccb646]>), ObjectType(ObjectIdentity(<ObjectName value object, tagSet <TagSet object, tags 0:0:6>, payload [1.3.6.1.4.1.9839.1.2.513.0]>), <Integer value object, tagSet <TagSet object, tags 0:0:2>, subtypeSpec <ConstraintsIntersection object, consts <ValueRangeConstraint object, consts -2147483648, 2147483647>>, payload [10]>)]
{'1.3.6.1.4.1.9839.1.2.532.0': '\x9fx\x04AÌ¶F', '1.3.6.1.4.1.9839.1.2.513.0': 10}
The first OID (.1.3.6.1.4.1.9839.1.2.532.0) returns an Opaque value (\x9fx\x04AÌ¶F) and I don't know how I can convert it to a float value. I should add that, that is a temperature value of 25.5°C.
In other words, how can I reach the following values by each other?
25.5
encoding iso-8859-1, payload [0x9f780441ccb646]
'\x9fx\x04AÌ¶F'

Your value 0x9f780441ccb646 can be
split into two floats, of which one is 25.589001, the other part is something else, or
the middle is the representation of the pysnmp object (__repr__) without a known interpretation (probably MIB missing), or
converted to a byte representation (with iso-8859-1 encoding) which is your string '\x9fx\x04AÌ¶F'.
So the data is there, it just needs to be extracted from the SNMP packet. The proper way would be to give the corresponding MIB entry to pysnmp.
Alternatively (answering your second question), the manual way of decoding the bytes can be done with the Python's struct module.
import struct
data = 0x9f780441ccb646 # this is what you got from pysnmp
thebytes = struct.pack("l", data)
print(thebytes.decode('latin1'))
print(thebytes)
print(struct.unpack("ff", thebytes))
gives
F¶ÌAx
b'F\xb6\xccA\x04x\x9f\x00'
(25.589000701904297, 1.4644897383138518e-38)
instead of unpacking to two floats, the MIB will tell you how the other data should be interpreted, so instead of unpack("ff",… you might want something else, check out the available format specifiers, for example "fhh" would give (25.589000701904297, 30724, 159).
EDIT:
TL;DR:
data = '\0\x9fx\x04AÌ¶F'
print("temperature: %f°C" % struct.unpack('>ff', data.encode('latin1'))[1])
temperature: 25.589001°C
To elaborate on the string representation: The bytes you see 'AÌ¶F' are in a reversed order than the ones in my print statement 'F¶ÌA' because of the different endianess. The byte order is already corrected in the int-converted data 0x9f780441ccb646 that you give in your output and I used in the conversion example. If you want to start from the encoded string, you first need to convert it back to the correct memory representation:
data = '\0\x9fx\x04AÌ¶F' # (initial '\0' is for filling the 8-bytes in correct alignment)
thebytes = data.encode('latin1')
But that's only half of the trick, because now the endianess is still wrong. Fortunately struct has the flags to correct for that. You can unpack in both byte-orders and choose the right one
print("unpacked little-endian: ", struct.unpack("<ff", thebytes))
print("unpacked big-endian: ", struct.unpack(">ff", thebytes))
unknown, temperature = struct.unpack(">ff", thebytes)
print("temperature: %f°C" % temperature)
giving
unpacked little-endian: (2.9225269119838333e-36, 23398.126953125)
unpacked big-endian: (1.4644897383138518e-38, 25.589000701904297)
temperature: 25.589001°C
The correct endianess of the opaque packet is either part of SNMP standard (then probably "network-byte order" '!' is the correct one), or should also be given in the MIB together with the correct field types which need to be given as format specifiers. If your packets are always 7-byte long, you might try a combination that adds to 7 bytes instead of 8 (ff = 4+4), then you can also omit adding the \0 padding byte.

According to the TerhorstD's asnwer plus some changes and knowing that the Opaque frame consists of 7 bytes in which the 3 bytes of those are constant (\x9fx\x04 or 159\120\4 in decimal), I wrote the following code snippet to deal with that problem:
...
handler = get("192.168.100.112", [".1.3.6.1.4.1.9839.1.2.532.0",
'.1.3.6.1.4.1.9839.1.2.513.0'],
hlapi.CommunityData('public'))
for key, value in handler.items():
try:
if len(value) == 7 and value[0].encode('latin1')[0] == 159\
and value[1].encode('latin1')[0] == 120\
and value[2].encode('latin1')[0] == 4:
data = value[3:]
print(struct.unpack('>f', data.encode('latin1'))[0])
else:
print(value)
except AttributeError:
print(value)
Out:
25.589001
10
[NOTE]:
Opaque is a little-endian format (> in struct).
[UPDATE]:
More wisely:
for key, value in handler.items():
try:
unpacked = struct.unpack('>BBBf', value.encode('latin1'))
if unpacked[:3] == (159,120,4):
'''Checking if data Opaque or not.'''
print(unpacked[-1])
else:
print(value)
except AttributeError:
print(value)

Related

How to put dowloaded JSON data into variables in python

import requests
import json
import csv
# These our are demo API keys, you can use them!
#location = ""
api_key = 'simplyrets'
api_secret = 'simplyrets'
#api_url = 'https://api.simplyrets.com/properties?q=%s&limit=1' % (location)
api_url = 'https://api.simplyrets.com/properties'
response = requests.get(api_url, auth=(api_key, api_secret))
response.raise_for_status()
houseData = json.loads(response.text)
#different parameters we need to know
p = houseData['property']
roof = p["roof"]
cooling = p["cooling"]
style = p["style"]
area = p["area"]
bathsFull = p["bathsFull"]
bathsHalf = p["bathsHalf"]
This is a snippet of the code that I am working with to try and take the information from the JSON provided by the API and put them into variables that I can actually use.
I thought that when you loaded it with json.loads() it would become a dictionary.
Yet it is telling me that I cannot do p = houseData['property'] because "list indices must be integers, not str".
Am I wrong that houseData should be a dictionary?

There are hundreds of properties returned, all of which are in a list.
You'll need to specify which property you want, so for the first one:
p = houseData[0]['property']

From https://docs.python.org/2/library/json.html :
json.loads(s[, encoding[, cls[, object_hook[, parse_float[, parse_int[, parse_constant[, object_pairs_hook[, **kw]]]]]]]])
Deserialize s (a str or unicode instance containing a JSON document) to a Python object using this conversion table.
If s is a str instance and is encoded with an ASCII based encoding other than UTF-8 (e.g. latin-1), then an appropriate encoding name must be specified. Encodings that are not ASCII based (such as UCS-2) are not allowed and should be decoded to unicode first.
The other arguments have the same meaning as in load().
If your JSON starts as an array at the outermost layer, it will be an array. If your JSON's outermost layer is an associative array, then please post your JSON and we can look into it a little further.

The problem is that json.loads() doesn't necessarily return a dictionary. If the outside container of the JSON is a list, then json.loads() will return a list, where the elements could be lists or dictionaries. Try iterating through the list returned by json.loads(). It's possible the dictionary you're looking for is simply json.loads()[0] or some other element.

There are 2 different types of JSON elements: nodes and arrays.
A node looks like:
node = {
foo = 7
bar = "Hello World!"
}
A array looks like this:
array = [ "one", "two", 3, 4, "5ive" ]
Your JSON element is probably a array. You can verify whether it's an array, dict, or other by using:
isinstance(json_element, dict)
isinstance(json_element, list)
Hope this helps!

There are some minor changes you should do:
Your API response is returning a list, so you have to iterate over it.
The requests library already supports converting to JSON so you don't have to worry about it.
import requests
# These our are demo API keys, you can use them!
#location = ""
api_key = 'simplyrets'
api_secret = 'simplyrets'
#api_url = 'https://api.simplyrets.com/properties?q=%s&limit=1' % (location)
api_url = 'https://api.simplyrets.com/properties'
response = requests.get(api_url, auth=(api_key, api_secret))
response.raise_for_status()
houseData = response.json()
# different parameters we need to know
for data in houseData:
p = data['property']
roof = p["roof"]
cooling = p["cooling"]
style = p["style"]
area = p["area"]
bathsFull = p["bathsFull"]
bathsHalf = p["bathsHalf"]
If you want to make sure you will have only one result, do an if statement to check this.
if len(houseData) != 1:
raise ValueError("Expecting only 1 houseData.")
data = houseData[0]
...

Removing 'u' character from the output of json.loads(jsonstring) [duplicate]

I'm using Python 2 to parse JSON from ASCII encoded text files.
When loading these files with either json or simplejson, all my string values are cast to Unicode objects instead of string objects. The problem is, I have to use the data with some libraries that only accept string objects. I can't change the libraries nor update them.
Is it possible to get string objects instead of Unicode ones?
Example
>>> import json
>>> original_list = ['a', 'b']
>>> json_list = json.dumps(original_list)
>>> json_list
'["a", "b"]'
>>> new_list = json.loads(json_list)
>>> new_list
[u'a', u'b'] # I want these to be of type `str`, not `unicode`
(One easy and clean solution for 2017 is to use a recent version of Python — i.e. Python 3 and forward.)

While there are some good answers here, I ended up using PyYAML to parse my JSON files, since it gives the keys and values as str type strings instead of the unicode type. Because JSON is a subset of YAML, it works nicely:
>>> import json
>>> import yaml
>>> list_org = ['a', 'b']
>>> list_dump = json.dumps(list_org)
>>> list_dump
'["a", "b"]'
>>> json.loads(list_dump)
[u'a', u'b']
>>> yaml.safe_load(list_dump)
['a', 'b']
Notes
Some things to note though:
I get string objects because all my entries are ASCII encoded. If I would use Unicode encoded entries, I would get them back as unicode objects — there is no conversion!
You should (probably always) use PyYAML's safe_load function; if you use it to load JSON files, you don't need the "additional power" of the load function anyway.
If you want a YAML parser that has more support for the 1.2 version of the spec (and correctly parses very low numbers) try Ruamel YAML: pip install ruamel.yaml and import ruamel.yaml as yaml was all I needed in my tests.
Conversion
As stated, there isn't any conversion! If you can't be sure to only deal with ASCII values (and you can't be sure most of the time), better use a conversion function:
I used the one from Mark Amery a couple of times now, it works great and is very easy to use. You can also use a similar function as an object_hook instead, as it might gain you a performance boost on big files. See the slightly more involved answer from Mirec Miskuf for that.

There's no built-in option to make the json module functions return byte strings instead of Unicode strings. However, this short and simple recursive function will convert any decoded JSON object from using Unicode strings to UTF-8-encoded byte strings:
def byteify(input):
if isinstance(input, dict):
return {byteify(key): byteify(value)
for key, value in input.iteritems()}
elif isinstance(input, list):
return [byteify(element) for element in input]
elif isinstance(input, unicode):
return input.encode('utf-8')
else:
return input
Just call this on the output you get from a json.load or json.loads call.
A couple of notes:
To support Python 2.6 or earlier, replace return {byteify(key): byteify(value) for key, value in input.iteritems()} with return dict([(byteify(key), byteify(value)) for key, value in input.iteritems()]), since dictionary comprehensions weren't supported until Python 2.7.
Since this answer recurses through the entire decoded object, it has a couple of undesirable performance characteristics that can be avoided with very careful use of the object_hook or object_pairs_hook parameters. Mirec Miskuf's answer is so far the only one that manages to pull this off correctly, although as a consequence, it's significantly more complicated than my approach.

A solution with object_hook
It works for both Python 2.7 and 3.x.
import json
def json_load_byteified(file_handle):
return _byteify(
json.load(file_handle, object_hook=_byteify),
ignore_dicts=True
)
def json_loads_byteified(json_text):
return _byteify(
json.loads(json_text, object_hook=_byteify),
ignore_dicts=True
)
def _byteify(data, ignore_dicts = False):
if isinstance(data, str):
return data
# If this is a list of values, return list of byteified values
if isinstance(data, list):
return [ _byteify(item, ignore_dicts=True) for item in data ]
# If this is a dictionary, return dictionary of byteified keys and values
# but only if we haven't already byteified it
if isinstance(data, dict) and not ignore_dicts:
return {
_byteify(key, ignore_dicts=True): _byteify(value, ignore_dicts=True)
for key, value in data.items() # changed to .items() for Python 2.7/3
}
# Python 3 compatible duck-typing
# If this is a Unicode string, return its string representation
if str(type(data)) == "<type 'unicode'>":
return data.encode('utf-8')
# If it's anything else, return it in its original form
return data
Example usage:
>>> json_loads_byteified('{"Hello": "World"}')
{'Hello': 'World'}
>>> json_loads_byteified('"I am a top-level string"')
'I am a top-level string'
>>> json_loads_byteified('7')
7
>>> json_loads_byteified('["I am inside a list"]')
['I am inside a list']
>>> json_loads_byteified('[[[[[[[["I am inside a big nest of lists"]]]]]]]]')
[[[[[[[['I am inside a big nest of lists']]]]]]]]
>>> json_loads_byteified('{"foo": "bar", "things": [7, {"qux": "baz", "moo": {"cow": ["milk"]}}]}')
{'things': [7, {'qux': 'baz', 'moo': {'cow': ['milk']}}], 'foo': 'bar'}
>>> json_load_byteified(open('somefile.json'))
{'more json': 'from a file'}
How does this work and why would I use it?
Mark Amery's function is shorter and clearer than these ones, so what's the point of them? Why would you want to use them?
Purely for performance. Mark's answer decodes the JSON text fully first with Unicode strings, then recurses through the entire decoded value to convert all strings to byte strings. This has a couple of undesirable effects:
A copy of the entire decoded structure gets created in memory
If your JSON object is really deeply nested (500 levels or more) then you'll hit Python's maximum recursion depth
This answer mitigates both of those performance issues by using the object_hook parameter of json.load and json.loads. From the documentation:
object_hook is an optional function that will be called with the result of any object literal decoded (a dict). The return value of object_hook will be used instead of the dict. This feature can be used to implement custom decoders
Since dictionaries nested many levels deep in other dictionaries get passed to object_hook as they're decoded, we can byteify any strings or lists inside them at that point and avoid the need for deep recursion later.
Mark's answer isn't suitable for use as an object_hook as it stands, because it recurses into nested dictionaries. We prevent that recursion in this answer with the ignore_dicts parameter to _byteify, which gets passed to it at all times except when object_hook passes it a new dict to byteify. The ignore_dicts flag tells _byteify to ignore dicts since they already been byteified.
Finally, our implementations of json_load_byteified and json_loads_byteified call _byteify (with ignore_dicts=True) on the result returned from json.load or json.loads to handle the case where the JSON text being decoded doesn't have a dict at the top level.

You can use the object_hook parameter for json.loads to pass in a converter. You don't have to do the conversion after the fact. The json module will always pass the object_hook dicts only, and it will recursively pass in nested dicts, so you don't have to recurse into nested dicts yourself. I don't think I would convert Unicode strings to numbers like Wells shows. If it's a Unicode string, it was quoted as a string in the JSON file, so it is supposed to be a string (or the file is bad).
Also, I'd try to avoid doing something like str(val) on a unicode object. You should use value.encode(encoding) with a valid encoding, depending on what your external library expects.
So, for example:
def _decode_list(data):
rv = []
for item in data:
if isinstance(item, unicode):
item = item.encode('utf-8')
elif isinstance(item, list):
item = _decode_list(item)
elif isinstance(item, dict):
item = _decode_dict(item)
rv.append(item)
return rv
def _decode_dict(data):
rv = {}
for key, value in data.iteritems():
if isinstance(key, unicode):
key = key.encode('utf-8')
if isinstance(value, unicode):
value = value.encode('utf-8')
elif isinstance(value, list):
value = _decode_list(value)
elif isinstance(value, dict):
value = _decode_dict(value)
rv[key] = value
return rv
obj = json.loads(s, object_hook=_decode_dict)

That's because json() has no difference between string objects and Unicode objects. They're all strings in JavaScript.
I think JSON is right to return Unicode objects. In fact, I wouldn't accept anything less, since JavaScript strings are in fact unicode objects (i.e., JSON (JavaScript) strings can store any kind of Unicode character), so it makes sense to create unicode objects when translating strings from JSON. Plain strings just wouldn't fit since the library would have to guess the encoding you want.
It's better to use unicode string objects everywhere. So your best option is to update your libraries so they can deal with Unicode objects.
But if you really want bytestrings, just encode the results to the encoding of your choice:
>>> nl = json.loads(js)
>>> nl
[u'a', u'b']
>>> nl = [s.encode('utf-8') for s in nl]
>>> nl
['a', 'b']

There exists an easy work-around.
TL;DR - Use ast.literal_eval() instead of json.loads(). Both ast and json are in the standard library.
While not a 'perfect' answer, it gets one pretty far if your plan is to ignore Unicode altogether. In Python 2.7
import json, ast
d = { 'field' : 'value' }
print "JSON Fail: ", json.loads(json.dumps(d))
print "AST Win:", ast.literal_eval(json.dumps(d))
gives:
JSON Fail: {u'field': u'value'}
AST Win: {'field': 'value'}
This gets more hairy when some objects are really Unicode strings. The full answer gets hairy quickly.

Mike Brennan's answer is close, but there isn't any reason to retraverse the entire structure. If you use the object_hook_pairs (Python 2.7+) parameter:
object_pairs_hook is an optional function that will be called with the result of any object literal decoded with an ordered list of pairs. The return value of object_pairs_hook will be used instead of the dict. This feature can be used to implement custom decoders that rely on the order that the key and value pairs are decoded (for example, collections.OrderedDict will remember the order of insertion). If object_hook is also defined, the object_pairs_hook takes priority.
With it, you get each JSON object handed to you, so you can do the decoding with no need for recursion:
def deunicodify_hook(pairs):
new_pairs = []
for key, value in pairs:
if isinstance(value, unicode):
value = value.encode('utf-8')
if isinstance(key, unicode):
key = key.encode('utf-8')
new_pairs.append((key, value))
return dict(new_pairs)
In [52]: open('test.json').read()
Out[52]: '{"1": "hello", "abc": [1, 2, 3], "def": {"hi": "mom"}, "boo": [1, "hi", "moo", {"5": "some"}]}'
In [53]: json.load(open('test.json'))
Out[53]:
{u'1': u'hello',
u'abc': [1, 2, 3],
u'boo': [1, u'hi', u'moo', {u'5': u'some'}],
u'def': {u'hi': u'mom'}}
In [54]: json.load(open('test.json'), object_pairs_hook=deunicodify_hook)
Out[54]:
{'1': 'hello',
'abc': [1, 2, 3],
'boo': [1, 'hi', 'moo', {'5': 'some'}],
'def': {'hi': 'mom'}}
Notice that I never have to call the hook recursively since every object will get handed to the hook when you use the object_pairs_hook. You do have to care about lists, but as you can see, an object within a list will be properly converted, and you don't have to recurse to make it happen.
A coworker pointed out that Python2.6 doesn't have object_hook_pairs. You can still use this will Python2.6 by making a very small change. In the hook above, change:
for key, value in pairs:
to
for key, value in pairs.iteritems():
Then use object_hook instead of object_pairs_hook:
In [66]: json.load(open('test.json'), object_hook=deunicodify_hook)
Out[66]:
{'1': 'hello',
'abc': [1, 2, 3],
'boo': [1, 'hi', 'moo', {'5': 'some'}],
'def': {'hi': 'mom'}}
Using object_pairs_hook results in one less dictionary being instantiated for each object in the JSON object, which, if you were parsing a huge document, might be worth while.

I'm afraid there isn't any way to achieve this automatically within the simplejson library.
The scanner and decoder in simplejson are designed to produce Unicode text. To do this, the library uses a function called c_scanstring (if it's available, for speed), or py_scanstring if the C version is not available. The scanstring function is called several times by nearly every routine that simplejson has for decoding a structure that might contain text. You'd have to either monkey patch the scanstring value in simplejson.decoder, or subclass JSONDecoder and provide pretty much your own entire implementation of anything that might contain text.
The reason that simplejson outputs Unicode, however, is that the JSON specification specifically mentions that "A string is a collection of zero or more Unicode characters"... support for Unicode is assumed as part of the format itself. simplejson's scanstring implementation goes so far as to scan and interpret Inicode escapes (even error-checking for malformed multi-byte charset representations), so the only way it can reliably return the value to you is as Unicode.
If you have an aged library that needs an str, I recommend you either laboriously search the nested data structure after parsing (which I acknowledge is what you explicitly said you wanted to avoid... sorry), or perhaps wrap your libraries in some sort of facade where you can massage the input parameters at a more granular level. The second approach might be more manageable than the first if your data structures are indeed deeply nested.

As Mark (Amery) correctly notes: Using PyYAML's deserializer on a JSON dump works only if you have ASCII only. At least out of the box.
Two quick comments on the PyYAML approach:
Never use yaml.load() on data from the field. It’s a feature(!) of YAML to execute arbitrary code hidden within the structure.
You can make it work also for non ASCII via this:
def to_utf8(loader, node):
return loader.construct_scalar(node).encode('utf-8')
yaml.add_constructor(u'tag:yaml.org,2002:str', to_utf8)
But performance-wise, it’s of no comparison to Mark Amery's answer:
Throwing some deeply-nested sample dicts onto the two methods, I get this (with dt[j] = time delta of json.loads(json.dumps(m))):
dt[yaml.safe_load(json.dumps(m))] =~ 100 * dt[j]
dt[byteify recursion(Mark Amery)] =~ 5 * dt[j]
So deserialization, including fully walking the tree and encoding, is well within the order of magnitude of JSON's C-based implementation. I find this remarkably fast and its also more robust than the yaml load at deeply nested structures. And less security error prone, looking at yaml.load.
=> While I would appreciate a pointer to a C-only based converter, the byteify function should be the default answer.
This holds especially true if your JSON structure is from the field, containing user input. Because then you probably need to walk anyway over your structure - independent on your desired internal data structures ('unicode sandwich' or byte strings only).
Why?
Unicode normalisation. For the unaware: Take a painkiller and read this.
So using the byteify recursion you kill two birds with one stone:
get your bytestrings from nested JSON dumps
get user input values normalised, so that you find the stuff in your storage.
In my tests it turned out that replacing the input.encode('utf-8') with a unicodedata.normalize('NFC', input).encode('utf-8') was even faster than without NFC - but that’s heavily dependent on the sample data I guess.

The gotcha is that simplejson and json are two different modules, at least in the manner they deal with Unicode. You have json in Python 2.6+, and this gives you Unicode values, whereas simplejson returns string objects.
Just try easy_install-ing simplejson in your environment and see if that works. It did for me.

Just use pickle instead of json for dump and load, like so:
import json
import pickle
d = { 'field1': 'value1', 'field2': 2, }
json.dump(d,open("testjson.txt","w"))
print json.load(open("testjson.txt","r"))
pickle.dump(d,open("testpickle.txt","w"))
print pickle.load(open("testpickle.txt","r"))
The output it produces is (strings and integers are handled correctly):
{u'field2': 2, u'field1': u'value1'}
{'field2': 2, 'field1': 'value1'}

I had a JSON dict as a string. The keys and values were Unicode objects like in the following example:
myStringDict = "{u'key':u'value'}"
I could use the byteify function suggested above by converting the string to a dict object using ast.literal_eval(myStringDict).

So, I've run into the same problem.
Because I need to pass all data to PyGTK, Unicode strings aren't very useful to me either. So I have another recursive conversion method. It's actually also needed for type-safe JSON conversion - json.dump() would bail on any non-literals, like Python objects. It doesn't convert dict indexes though.
# removes any objects, turns Unicode back into str
def filter_data(obj):
if type(obj) in (int, float, str, bool):
return obj
elif type(obj) == unicode:
return str(obj)
elif type(obj) in (list, tuple, set):
obj = list(obj)
for i,v in enumerate(obj):
obj[i] = filter_data(v)
elif type(obj) == dict:
for i,v in obj.iteritems():
obj[i] = filter_data(v)
else:
print "invalid object in data, converting to string"
obj = str(obj)
return obj

Support for Python 2 and 3 using a hook (from Mirec Miskuf's answer):
import requests
import six
from six import iteritems
requests.packages.urllib3.disable_warnings() # #UndefinedVariable
r = requests.get("http://echo.jsontest.com/key/value/one/two/three", verify=False)
def _byteify(data):
# If this is a Unicode string, return its string representation
if isinstance(data, six.string_types):
return str(data.encode('utf-8').decode())
# If this is a list of values, return list of byteified values
if isinstance(data, list):
return [ _byteify(item) for item in data ]
# If this is a dictionary, return dictionary of byteified keys and values,
# but only if we haven't already byteified it
if isinstance(data, dict):
return {
_byteify(key): _byteify(value) for key, value in iteritems(data)
}
# If it's anything else, return it in its original form
return data
w = r.json(object_hook=_byteify)
print(w)
Returns:
{'three': '', 'key': 'value', 'one': 'two'}

I built this recursive caster. It works for my needs and I think it's relatively complete.
def _parseJSON(self, obj):
newobj = {}
for key, value in obj.iteritems():
key = str(key)
if isinstance(value, dict):
newobj[key] = self._parseJSON(value)
elif isinstance(value, list):
if key not in newobj:
newobj[key] = []
for i in value:
newobj[key].append(self._parseJSON(i))
elif isinstance(value, unicode):
val = str(value)
if val.isdigit():
val = int(val)
else:
try:
val = float(val)
except ValueError:
val = str(val)
newobj[key] = val
return newobj
Just pass it a JSON object like so:
obj = json.loads(content, parse_float=float, parse_int=int)
obj = _parseJSON(obj)
I have it as a private member of a class, but you can repurpose the method as you see fit.

I rewrote Wells's _parse_json() to handle cases where the json object itself is an array (my use case).
def _parseJSON(self, obj):
if isinstance(obj, dict):
newobj = {}
for key, value in obj.iteritems():
key = str(key)
newobj[key] = self._parseJSON(value)
elif isinstance(obj, list):
newobj = []
for value in obj:
newobj.append(self._parseJSON(value))
elif isinstance(obj, unicode):
newobj = str(obj)
else:
newobj = obj
return newobj

Here is a recursive encoder written in C:
https://github.com/axiros/nested_encode
The performance overhead for "average" structures is around 10% compared to json.loads().
python speed.py
json loads [0.16sec]: {u'a': [{u'b': [[1, 2, [u'\xd6ster..
json loads + encoding [0.18sec]: {'a': [{'b': [[1, 2, ['\xc3\x96ster.
time overhead in percent: 9%
using this teststructure:
import json, nested_encode, time
s = """
{
"firstName": "Jos\\u0301",
"lastName": "Smith",
"isAlive": true,
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "\\u00d6sterreich",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
}
],
"children": [],
"spouse": null,
"a": [{"b": [[1, 2, ["\\u00d6sterreich"]]]}]
}
"""
t1 = time.time()
for i in xrange(10000):
u = json.loads(s)
dt_json = time.time() - t1
t1 = time.time()
for i in xrange(10000):
b = nested_encode.encode_nested(json.loads(s))
dt_json_enc = time.time() - t1
print "json loads [%.2fsec]: %s..." % (dt_json, str(u)[:20])
print "json loads + encoding [%.2fsec]: %s..." % (dt_json_enc, str(b)[:20])
print "time overhead in percent: %i%%" % (100 * (dt_json_enc - dt_json)/dt_json)

With Python 3.6, sometimes I still run into this problem. For example, when getting a response from a REST API and loading the response text to JSON, I still get the Unicode strings.
Found a simple solution using json.dumps().
response_message = json.loads(json.dumps(response.text))
print(response_message)

I ran into this problem too, and having to deal with JSON, I came up with a small loop that converts the Unicode keys to strings. (simplejson on GAE does not return string keys.)
obj is the object decoded from JSON:
if NAME_CLASS_MAP.has_key(cls):
kwargs = {}
for i in obj.keys():
kwargs[str(i)] = obj[i]
o = NAME_CLASS_MAP[cls](**kwargs)
o.save()
kwargs is what I pass to the constructor of the GAE application (which does not like Unicode keys in **kwargs).
It is not as robust as the solution from Wells, but much smaller.

I've adapted the code from the answer of Mark Amery, particularly in order to get rid of isinstance for the pros of duck typing.
The encoding is done manually and ensure_ascii is disabled. The Python documentation for json.dump says that:
If ensure_ascii is True (the default), all non-ASCII characters in the output are escaped with \uXXXX sequences
Disclaimer: in the doctest I used the Hungarian language. Some notable Hungarian-related character encodings are: cp852, the IBM/OEM encoding used e.g. in DOS (sometimes referred to as ASCII. Incorrectly I think, as it is dependent on the code page setting). Windows-1250 is used e.g. in Windows (sometimes referred as ANSI, dependent on the locale settings), and ISO 8859-1, sometimes used on HTTP servers.
The test text Tüskéshátú kígyóbűvölő is attributed to Koltai László (native personal name form) and is from Wikipedia.
# coding: utf-8
"""
This file should be encoded correctly with utf-8.
"""
import json
def encode_items(input, encoding='utf-8'):
u"""original from: https://stackoverflow.com/a/13101776/611007
adapted by SO/u/611007 (20150623)
>>>
>>> ## run this with `python -m doctest <this file>.py` from command line
>>>
>>> txt = u"Tüskéshátú kígyóbűvölő"
>>> txt2 = u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"
>>> txt3 = u"uúuutifu"
>>> txt4 = b'u\\xfauutifu'
>>> # txt4 shouldn't be 'u\\xc3\\xbauutifu', string content needs double backslash for doctest:
>>> assert u'\\u0102' not in b'u\\xfauutifu'.decode('cp1250')
>>> txt4u = txt4.decode('cp1250')
>>> assert txt4u == u'u\\xfauutifu', repr(txt4u)
>>> txt5 = b"u\\xc3\\xbauutifu"
>>> txt5u = txt5.decode('utf-8')
>>> txt6 = u"u\\u251c\\u2551uutifu"
>>> there_and_back_again = lambda t: encode_items(t, encoding='utf-8').decode('utf-8')
>>> assert txt == there_and_back_again(txt)
>>> assert txt == there_and_back_again(txt2)
>>> assert txt3 == there_and_back_again(txt3)
>>> assert txt3.encode('cp852') == there_and_back_again(txt4u).encode('cp852')
>>> assert txt3 == txt4u,(txt3,txt4u)
>>> assert txt3 == there_and_back_again(txt5)
>>> assert txt3 == there_and_back_again(txt5u)
>>> assert txt3 == there_and_back_again(txt4u)
>>> assert txt3.encode('cp1250') == encode_items(txt4, encoding='utf-8')
>>> assert txt3.encode('utf-8') == encode_items(txt5, encoding='utf-8')
>>> assert txt2.encode('utf-8') == encode_items(txt, encoding='utf-8')
>>> assert {'a':txt2.encode('utf-8')} == encode_items({'a':txt}, encoding='utf-8')
>>> assert [txt2.encode('utf-8')] == encode_items([txt], encoding='utf-8')
>>> assert [[txt2.encode('utf-8')]] == encode_items([[txt]], encoding='utf-8')
>>> assert [{'a':txt2.encode('utf-8')}] == encode_items([{'a':txt}], encoding='utf-8')
>>> assert {'b':{'a':txt2.encode('utf-8')}} == encode_items({'b':{'a':txt}}, encoding='utf-8')
"""
try:
input.iteritems
return {encode_items(k): encode_items(v) for (k,v) in input.iteritems()}
except AttributeError:
if isinstance(input, unicode):
return input.encode(encoding)
elif isinstance(input, str):
return input
try:
iter(input)
return [encode_items(e) for e in input]
except TypeError:
return input
def alt_dumps(obj, **kwargs):
"""
>>> alt_dumps({'a': u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"})
'{"a": "T\\xc3\\xbcsk\\xc3\\xa9sh\\xc3\\xa1t\\xc3\\xba k\\xc3\\xadgy\\xc3\\xb3b\\xc5\\xb1v\\xc3\\xb6l\\xc5\\x91"}'
"""
if 'ensure_ascii' in kwargs:
del kwargs['ensure_ascii']
return json.dumps(encode_items(obj), ensure_ascii=False, **kwargs)
I'd also like to highlight the answer of Jarret Hardie which references the JSON specification, quoting:
A string is a collection of zero or more Unicode characters
In my use case, I had files with JSON content. They are UTF-8 encoded files. ensure_ascii results in properly escaped, but not very readable JSON files, and that is why I've adapted Mark Amery's answer to fit my needs.
The doctest is not particularly thoughtful, but I share the code in the hope that it will useful for someone.

Check out this answer to a similar question like this which states that
The u- prefix just means that you have a Unicode string. When you really use the string, it won't appear in your data. Don't be thrown by the printed output.
For example, try this:
print mail_accounts[0]["i"]
You won't see a u.

How to use struct.pack when the data and the size to pack is undefined in advance

I need to dynamically generate a binary file from CSV file.
Example:
CSV file:
#size, #data
1 , 0xAB
2 , 1234 (0x04D2)
5 , "ascii" (0x6173636969)
1 , "\x23" (0x23)
Expected binary file:
'\xAB\x04\xD2\x61\x73\x63\x69\x69\x23'
The data can be a string, unsignedinteger or an hexadecimal value.
In my program i process as follow:
I read size/data data from CSV file
I use eval function to get data value
I use Struct.pack function to generate output data
The problem is how to use Struct.pack function in order to process string or value.
I tried this:
check isinstance(value, basestring) to handle string
but i dont know how to handle the unsigned value defined in hexadecimal (but i dont know how to specify the format type for special size eg: 5 Byte)
I am thinking about putting any value into a hexadecimal string ...
What is the simplest way to handle (string/unsigned value to defined sized binary output)

If you encounter a string, you just need to use encode to get a byte string from it. If you encounter a value, just try to convert it to an int in base 10 or 16 and then use struct.pack:
formats = {
1: "B",
2: "H",
4: "I",
8: "Q"
}
def handle_value (size, value):
try:
value = int(value)
except:
try:
value = int(value, 16)
except:
pass
if type(value) == str:
value = value[value.find('"') + 1, value.find('"') + 1 + size]
value = value.encode("ascii") # or whatever encoding you want
else:
value = struct.pack(">" + formats[size], value)
return value
Then to read the whole file:
output = bytes()
for line in files:
size, value = line.split(",")
size = int(size.strip())
value = value.strip()
output += handle_value(size, value)
Edit: I didn't notice you get the size from the CSV file, so you can infer the format you want from this size if the value is a int.

How to read in binary data after ascii header in Python

I have some imaging data that's stored in a file that contains an ascii text header, ending with a null character, followed by the binary data. The ascii headers vary in length, and I'm wondering what's the best way to open the file, read the header and find the null character, and then load the binary data (in Python).
Thanks for the help,
James

Probably ought to start with something like this.
with open('some file','rb') as input:
aByte= input.read(1)
while aByte and ord(aByte) != 0: aByte= input.read(1)
# At this point, what's left is the binary data.
Python version numbers matter a lot for this kind of thing. The issue is the result of the read function. Some versions can return bytes (which are numbers). Other versions will return strings (which requires ord(aByte)).

Does something like this work:
with open('some_file','rb') as f:
binary_data = f.read().split('\0',1)[1]

Other people have already answered your direction question, but I thought I'd add this.
When working with binary data, I often find it useful to subclass file and add various convince methods for reading/writing packed binary data.
It's overkill for simple things, but if you find yourself parsing lots of binary file formats, it's worth the extra effort to avoid repeating yourself.
If nothing else, hopefully it serves as a useful example of how to use struct. On a side note, this is pulled from older code, and is very much python 2.x. Python 3.x handles this (particularly strings vs. bytes) significantly differently.
import struct
import array
class BinaryFile(file):
"""
Automatically packs or unpacks binary data according to a format
when reading or writing.
"""
def __init__(self, *args, **kwargs):
"""
Initialization is the same as a normal file object
%s""" % file.__doc__
super(BinaryFile, self).__init__(self, *args, **kwargs)
def read_binary(self,fmt):
"""
Read and unpack a binary value from the file based
on string fmt (see the struct module for details).
This will strip any trailing null characters if a string format is
specified.
"""
size = struct.calcsize(fmt)
data = self.read(size)
# Reading beyond the end of the file just returns ''
if len(data) != size:
raise EOFError('End of file reached')
data = struct.unpack(fmt, data)
for item in data:
# Strip trailing zeros in strings
if isinstance(item, str):
item = item.strip('\x00')
# Unpack the tuple if it only has one value
if len(data) == 1:
data = data[0]
return data
def write_binary(self, fmt, dat):
"""Pack and write data to the file according to string fmt."""
# Try expanding input arguments (struct.pack won't take a tuple)
try:
dat = struct.pack(fmt, *dat)
except (TypeError, struct.error):
# If it's not a sequence (TypeError), or if it's a
# string (struct.error), don't expand.
dat = struct.pack(fmt, dat)
self.write(dat)
def read_header(self, header):
"""
Reads a defined structure "header" consisting of a sequence of (name,
format) strings from the file. Returns a dict with keys of the given
names and values unpaced according to the given format for each item in
"header".
"""
header_values = {}
for key, format in header:
header_values[key] = self.read_binary(format)
return header_values
def read_nullstring(self):
"""
Reads a null-terminated string from the file. This is not implemented
in an efficient manner for long strings!
"""
output_string = ''
char = self.read(1)
while char != '\x00':
output_string += char
char = self.read(1)
if len(char) == 0:
break
return output_string
def read_array(self, type, number):
"""
Read data from the file and return an array.array of the given
"type" with "number" elements
"""
size = struct.calcsize(type)
data = self.read(size * number)
return array.array(type, data)

How to get string objects instead of Unicode from JSON

I'm using Python 2 to parse JSON from ASCII encoded text files.
When loading these files with either json or simplejson, all my string values are cast to Unicode objects instead of string objects. The problem is, I have to use the data with some libraries that only accept string objects. I can't change the libraries nor update them.
Is it possible to get string objects instead of Unicode ones?
Example
>>> import json
>>> original_list = ['a', 'b']
>>> json_list = json.dumps(original_list)
>>> json_list
'["a", "b"]'
>>> new_list = json.loads(json_list)
>>> new_list
[u'a', u'b'] # I want these to be of type `str`, not `unicode`
(One easy and clean solution for 2017 is to use a recent version of Python — i.e. Python 3 and forward.)

While there are some good answers here, I ended up using PyYAML to parse my JSON files, since it gives the keys and values as str type strings instead of the unicode type. Because JSON is a subset of YAML, it works nicely:
>>> import json
>>> import yaml
>>> list_org = ['a', 'b']
>>> list_dump = json.dumps(list_org)
>>> list_dump
'["a", "b"]'
>>> json.loads(list_dump)
[u'a', u'b']
>>> yaml.safe_load(list_dump)
['a', 'b']
Notes
Some things to note though:
I get string objects because all my entries are ASCII encoded. If I would use Unicode encoded entries, I would get them back as unicode objects — there is no conversion!
You should (probably always) use PyYAML's safe_load function; if you use it to load JSON files, you don't need the "additional power" of the load function anyway.
If you want a YAML parser that has more support for the 1.2 version of the spec (and correctly parses very low numbers) try Ruamel YAML: pip install ruamel.yaml and import ruamel.yaml as yaml was all I needed in my tests.
Conversion
As stated, there isn't any conversion! If you can't be sure to only deal with ASCII values (and you can't be sure most of the time), better use a conversion function:
I used the one from Mark Amery a couple of times now, it works great and is very easy to use. You can also use a similar function as an object_hook instead, as it might gain you a performance boost on big files. See the slightly more involved answer from Mirec Miskuf for that.

There's no built-in option to make the json module functions return byte strings instead of Unicode strings. However, this short and simple recursive function will convert any decoded JSON object from using Unicode strings to UTF-8-encoded byte strings:
def byteify(input):
if isinstance(input, dict):
return {byteify(key): byteify(value)
for key, value in input.iteritems()}
elif isinstance(input, list):
return [byteify(element) for element in input]
elif isinstance(input, unicode):
return input.encode('utf-8')
else:
return input
Just call this on the output you get from a json.load or json.loads call.
A couple of notes:
To support Python 2.6 or earlier, replace return {byteify(key): byteify(value) for key, value in input.iteritems()} with return dict([(byteify(key), byteify(value)) for key, value in input.iteritems()]), since dictionary comprehensions weren't supported until Python 2.7.
Since this answer recurses through the entire decoded object, it has a couple of undesirable performance characteristics that can be avoided with very careful use of the object_hook or object_pairs_hook parameters. Mirec Miskuf's answer is so far the only one that manages to pull this off correctly, although as a consequence, it's significantly more complicated than my approach.

A solution with object_hook
It works for both Python 2.7 and 3.x.
import json
def json_load_byteified(file_handle):
return _byteify(
json.load(file_handle, object_hook=_byteify),
ignore_dicts=True
)
def json_loads_byteified(json_text):
return _byteify(
json.loads(json_text, object_hook=_byteify),
ignore_dicts=True
)
def _byteify(data, ignore_dicts = False):
if isinstance(data, str):
return data
# If this is a list of values, return list of byteified values
if isinstance(data, list):
return [ _byteify(item, ignore_dicts=True) for item in data ]
# If this is a dictionary, return dictionary of byteified keys and values
# but only if we haven't already byteified it
if isinstance(data, dict) and not ignore_dicts:
return {
_byteify(key, ignore_dicts=True): _byteify(value, ignore_dicts=True)
for key, value in data.items() # changed to .items() for Python 2.7/3
}
# Python 3 compatible duck-typing
# If this is a Unicode string, return its string representation
if str(type(data)) == "<type 'unicode'>":
return data.encode('utf-8')
# If it's anything else, return it in its original form
return data
Example usage:
>>> json_loads_byteified('{"Hello": "World"}')
{'Hello': 'World'}
>>> json_loads_byteified('"I am a top-level string"')
'I am a top-level string'
>>> json_loads_byteified('7')
7
>>> json_loads_byteified('["I am inside a list"]')
['I am inside a list']
>>> json_loads_byteified('[[[[[[[["I am inside a big nest of lists"]]]]]]]]')
[[[[[[[['I am inside a big nest of lists']]]]]]]]
>>> json_loads_byteified('{"foo": "bar", "things": [7, {"qux": "baz", "moo": {"cow": ["milk"]}}]}')
{'things': [7, {'qux': 'baz', 'moo': {'cow': ['milk']}}], 'foo': 'bar'}
>>> json_load_byteified(open('somefile.json'))
{'more json': 'from a file'}
How does this work and why would I use it?
Mark Amery's function is shorter and clearer than these ones, so what's the point of them? Why would you want to use them?
Purely for performance. Mark's answer decodes the JSON text fully first with Unicode strings, then recurses through the entire decoded value to convert all strings to byte strings. This has a couple of undesirable effects:
A copy of the entire decoded structure gets created in memory
If your JSON object is really deeply nested (500 levels or more) then you'll hit Python's maximum recursion depth
This answer mitigates both of those performance issues by using the object_hook parameter of json.load and json.loads. From the documentation:
object_hook is an optional function that will be called with the result of any object literal decoded (a dict). The return value of object_hook will be used instead of the dict. This feature can be used to implement custom decoders
Since dictionaries nested many levels deep in other dictionaries get passed to object_hook as they're decoded, we can byteify any strings or lists inside them at that point and avoid the need for deep recursion later.
Mark's answer isn't suitable for use as an object_hook as it stands, because it recurses into nested dictionaries. We prevent that recursion in this answer with the ignore_dicts parameter to _byteify, which gets passed to it at all times except when object_hook passes it a new dict to byteify. The ignore_dicts flag tells _byteify to ignore dicts since they already been byteified.
Finally, our implementations of json_load_byteified and json_loads_byteified call _byteify (with ignore_dicts=True) on the result returned from json.load or json.loads to handle the case where the JSON text being decoded doesn't have a dict at the top level.

You can use the object_hook parameter for json.loads to pass in a converter. You don't have to do the conversion after the fact. The json module will always pass the object_hook dicts only, and it will recursively pass in nested dicts, so you don't have to recurse into nested dicts yourself. I don't think I would convert Unicode strings to numbers like Wells shows. If it's a Unicode string, it was quoted as a string in the JSON file, so it is supposed to be a string (or the file is bad).
Also, I'd try to avoid doing something like str(val) on a unicode object. You should use value.encode(encoding) with a valid encoding, depending on what your external library expects.
So, for example:
def _decode_list(data):
rv = []
for item in data:
if isinstance(item, unicode):
item = item.encode('utf-8')
elif isinstance(item, list):
item = _decode_list(item)
elif isinstance(item, dict):
item = _decode_dict(item)
rv.append(item)
return rv
def _decode_dict(data):
rv = {}
for key, value in data.iteritems():
if isinstance(key, unicode):
key = key.encode('utf-8')
if isinstance(value, unicode):
value = value.encode('utf-8')
elif isinstance(value, list):
value = _decode_list(value)
elif isinstance(value, dict):
value = _decode_dict(value)
rv[key] = value
return rv
obj = json.loads(s, object_hook=_decode_dict)

That's because json() has no difference between string objects and Unicode objects. They're all strings in JavaScript.
I think JSON is right to return Unicode objects. In fact, I wouldn't accept anything less, since JavaScript strings are in fact unicode objects (i.e., JSON (JavaScript) strings can store any kind of Unicode character), so it makes sense to create unicode objects when translating strings from JSON. Plain strings just wouldn't fit since the library would have to guess the encoding you want.
It's better to use unicode string objects everywhere. So your best option is to update your libraries so they can deal with Unicode objects.
But if you really want bytestrings, just encode the results to the encoding of your choice:
>>> nl = json.loads(js)
>>> nl
[u'a', u'b']
>>> nl = [s.encode('utf-8') for s in nl]
>>> nl
['a', 'b']

There exists an easy work-around.
TL;DR - Use ast.literal_eval() instead of json.loads(). Both ast and json are in the standard library.
While not a 'perfect' answer, it gets one pretty far if your plan is to ignore Unicode altogether. In Python 2.7
import json, ast
d = { 'field' : 'value' }
print "JSON Fail: ", json.loads(json.dumps(d))
print "AST Win:", ast.literal_eval(json.dumps(d))
gives:
JSON Fail: {u'field': u'value'}
AST Win: {'field': 'value'}
This gets more hairy when some objects are really Unicode strings. The full answer gets hairy quickly.

Mike Brennan's answer is close, but there isn't any reason to retraverse the entire structure. If you use the object_hook_pairs (Python 2.7+) parameter:
object_pairs_hook is an optional function that will be called with the result of any object literal decoded with an ordered list of pairs. The return value of object_pairs_hook will be used instead of the dict. This feature can be used to implement custom decoders that rely on the order that the key and value pairs are decoded (for example, collections.OrderedDict will remember the order of insertion). If object_hook is also defined, the object_pairs_hook takes priority.
With it, you get each JSON object handed to you, so you can do the decoding with no need for recursion:
def deunicodify_hook(pairs):
new_pairs = []
for key, value in pairs:
if isinstance(value, unicode):
value = value.encode('utf-8')
if isinstance(key, unicode):
key = key.encode('utf-8')
new_pairs.append((key, value))
return dict(new_pairs)
In [52]: open('test.json').read()
Out[52]: '{"1": "hello", "abc": [1, 2, 3], "def": {"hi": "mom"}, "boo": [1, "hi", "moo", {"5": "some"}]}'
In [53]: json.load(open('test.json'))
Out[53]:
{u'1': u'hello',
u'abc': [1, 2, 3],
u'boo': [1, u'hi', u'moo', {u'5': u'some'}],
u'def': {u'hi': u'mom'}}
In [54]: json.load(open('test.json'), object_pairs_hook=deunicodify_hook)
Out[54]:
{'1': 'hello',
'abc': [1, 2, 3],
'boo': [1, 'hi', 'moo', {'5': 'some'}],
'def': {'hi': 'mom'}}
Notice that I never have to call the hook recursively since every object will get handed to the hook when you use the object_pairs_hook. You do have to care about lists, but as you can see, an object within a list will be properly converted, and you don't have to recurse to make it happen.
A coworker pointed out that Python2.6 doesn't have object_hook_pairs. You can still use this will Python2.6 by making a very small change. In the hook above, change:
for key, value in pairs:
to
for key, value in pairs.iteritems():
Then use object_hook instead of object_pairs_hook:
In [66]: json.load(open('test.json'), object_hook=deunicodify_hook)
Out[66]:
{'1': 'hello',
'abc': [1, 2, 3],
'boo': [1, 'hi', 'moo', {'5': 'some'}],
'def': {'hi': 'mom'}}
Using object_pairs_hook results in one less dictionary being instantiated for each object in the JSON object, which, if you were parsing a huge document, might be worth while.

I'm afraid there isn't any way to achieve this automatically within the simplejson library.
The scanner and decoder in simplejson are designed to produce Unicode text. To do this, the library uses a function called c_scanstring (if it's available, for speed), or py_scanstring if the C version is not available. The scanstring function is called several times by nearly every routine that simplejson has for decoding a structure that might contain text. You'd have to either monkey patch the scanstring value in simplejson.decoder, or subclass JSONDecoder and provide pretty much your own entire implementation of anything that might contain text.
The reason that simplejson outputs Unicode, however, is that the JSON specification specifically mentions that "A string is a collection of zero or more Unicode characters"... support for Unicode is assumed as part of the format itself. simplejson's scanstring implementation goes so far as to scan and interpret Inicode escapes (even error-checking for malformed multi-byte charset representations), so the only way it can reliably return the value to you is as Unicode.
If you have an aged library that needs an str, I recommend you either laboriously search the nested data structure after parsing (which I acknowledge is what you explicitly said you wanted to avoid... sorry), or perhaps wrap your libraries in some sort of facade where you can massage the input parameters at a more granular level. The second approach might be more manageable than the first if your data structures are indeed deeply nested.

As Mark (Amery) correctly notes: Using PyYAML's deserializer on a JSON dump works only if you have ASCII only. At least out of the box.
Two quick comments on the PyYAML approach:
Never use yaml.load() on data from the field. It’s a feature(!) of YAML to execute arbitrary code hidden within the structure.
You can make it work also for non ASCII via this:
def to_utf8(loader, node):
return loader.construct_scalar(node).encode('utf-8')
yaml.add_constructor(u'tag:yaml.org,2002:str', to_utf8)
But performance-wise, it’s of no comparison to Mark Amery's answer:
Throwing some deeply-nested sample dicts onto the two methods, I get this (with dt[j] = time delta of json.loads(json.dumps(m))):
dt[yaml.safe_load(json.dumps(m))] =~ 100 * dt[j]
dt[byteify recursion(Mark Amery)] =~ 5 * dt[j]
So deserialization, including fully walking the tree and encoding, is well within the order of magnitude of JSON's C-based implementation. I find this remarkably fast and its also more robust than the yaml load at deeply nested structures. And less security error prone, looking at yaml.load.
=> While I would appreciate a pointer to a C-only based converter, the byteify function should be the default answer.
This holds especially true if your JSON structure is from the field, containing user input. Because then you probably need to walk anyway over your structure - independent on your desired internal data structures ('unicode sandwich' or byte strings only).
Why?
Unicode normalisation. For the unaware: Take a painkiller and read this.
So using the byteify recursion you kill two birds with one stone:
get your bytestrings from nested JSON dumps
get user input values normalised, so that you find the stuff in your storage.
In my tests it turned out that replacing the input.encode('utf-8') with a unicodedata.normalize('NFC', input).encode('utf-8') was even faster than without NFC - but that’s heavily dependent on the sample data I guess.

The gotcha is that simplejson and json are two different modules, at least in the manner they deal with Unicode. You have json in Python 2.6+, and this gives you Unicode values, whereas simplejson returns string objects.
Just try easy_install-ing simplejson in your environment and see if that works. It did for me.

Just use pickle instead of json for dump and load, like so:
import json
import pickle
d = { 'field1': 'value1', 'field2': 2, }
json.dump(d,open("testjson.txt","w"))
print json.load(open("testjson.txt","r"))
pickle.dump(d,open("testpickle.txt","w"))
print pickle.load(open("testpickle.txt","r"))
The output it produces is (strings and integers are handled correctly):
{u'field2': 2, u'field1': u'value1'}
{'field2': 2, 'field1': 'value1'}

I had a JSON dict as a string. The keys and values were Unicode objects like in the following example:
myStringDict = "{u'key':u'value'}"
I could use the byteify function suggested above by converting the string to a dict object using ast.literal_eval(myStringDict).

So, I've run into the same problem.
Because I need to pass all data to PyGTK, Unicode strings aren't very useful to me either. So I have another recursive conversion method. It's actually also needed for type-safe JSON conversion - json.dump() would bail on any non-literals, like Python objects. It doesn't convert dict indexes though.
# removes any objects, turns Unicode back into str
def filter_data(obj):
if type(obj) in (int, float, str, bool):
return obj
elif type(obj) == unicode:
return str(obj)
elif type(obj) in (list, tuple, set):
obj = list(obj)
for i,v in enumerate(obj):
obj[i] = filter_data(v)
elif type(obj) == dict:
for i,v in obj.iteritems():
obj[i] = filter_data(v)
else:
print "invalid object in data, converting to string"
obj = str(obj)
return obj

Support for Python 2 and 3 using a hook (from Mirec Miskuf's answer):
import requests
import six
from six import iteritems
requests.packages.urllib3.disable_warnings() # #UndefinedVariable
r = requests.get("http://echo.jsontest.com/key/value/one/two/three", verify=False)
def _byteify(data):
# If this is a Unicode string, return its string representation
if isinstance(data, six.string_types):
return str(data.encode('utf-8').decode())
# If this is a list of values, return list of byteified values
if isinstance(data, list):
return [ _byteify(item) for item in data ]
# If this is a dictionary, return dictionary of byteified keys and values,
# but only if we haven't already byteified it
if isinstance(data, dict):
return {
_byteify(key): _byteify(value) for key, value in iteritems(data)
}
# If it's anything else, return it in its original form
return data
w = r.json(object_hook=_byteify)
print(w)
Returns:
{'three': '', 'key': 'value', 'one': 'two'}

I built this recursive caster. It works for my needs and I think it's relatively complete.
def _parseJSON(self, obj):
newobj = {}
for key, value in obj.iteritems():
key = str(key)
if isinstance(value, dict):
newobj[key] = self._parseJSON(value)
elif isinstance(value, list):
if key not in newobj:
newobj[key] = []
for i in value:
newobj[key].append(self._parseJSON(i))
elif isinstance(value, unicode):
val = str(value)
if val.isdigit():
val = int(val)
else:
try:
val = float(val)
except ValueError:
val = str(val)
newobj[key] = val
return newobj
Just pass it a JSON object like so:
obj = json.loads(content, parse_float=float, parse_int=int)
obj = _parseJSON(obj)
I have it as a private member of a class, but you can repurpose the method as you see fit.

I rewrote Wells's _parse_json() to handle cases where the json object itself is an array (my use case).
def _parseJSON(self, obj):
if isinstance(obj, dict):
newobj = {}
for key, value in obj.iteritems():
key = str(key)
newobj[key] = self._parseJSON(value)
elif isinstance(obj, list):
newobj = []
for value in obj:
newobj.append(self._parseJSON(value))
elif isinstance(obj, unicode):
newobj = str(obj)
else:
newobj = obj
return newobj

Here is a recursive encoder written in C:
https://github.com/axiros/nested_encode
The performance overhead for "average" structures is around 10% compared to json.loads().
python speed.py
json loads [0.16sec]: {u'a': [{u'b': [[1, 2, [u'\xd6ster..
json loads + encoding [0.18sec]: {'a': [{'b': [[1, 2, ['\xc3\x96ster.
time overhead in percent: 9%
using this teststructure:
import json, nested_encode, time
s = """
{
"firstName": "Jos\\u0301",
"lastName": "Smith",
"isAlive": true,
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "\\u00d6sterreich",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
}
],
"children": [],
"spouse": null,
"a": [{"b": [[1, 2, ["\\u00d6sterreich"]]]}]
}
"""
t1 = time.time()
for i in xrange(10000):
u = json.loads(s)
dt_json = time.time() - t1
t1 = time.time()
for i in xrange(10000):
b = nested_encode.encode_nested(json.loads(s))
dt_json_enc = time.time() - t1
print "json loads [%.2fsec]: %s..." % (dt_json, str(u)[:20])
print "json loads + encoding [%.2fsec]: %s..." % (dt_json_enc, str(b)[:20])
print "time overhead in percent: %i%%" % (100 * (dt_json_enc - dt_json)/dt_json)

With Python 3.6, sometimes I still run into this problem. For example, when getting a response from a REST API and loading the response text to JSON, I still get the Unicode strings.
Found a simple solution using json.dumps().
response_message = json.loads(json.dumps(response.text))
print(response_message)

I ran into this problem too, and having to deal with JSON, I came up with a small loop that converts the Unicode keys to strings. (simplejson on GAE does not return string keys.)
obj is the object decoded from JSON:
if NAME_CLASS_MAP.has_key(cls):
kwargs = {}
for i in obj.keys():
kwargs[str(i)] = obj[i]
o = NAME_CLASS_MAP[cls](**kwargs)
o.save()
kwargs is what I pass to the constructor of the GAE application (which does not like Unicode keys in **kwargs).
It is not as robust as the solution from Wells, but much smaller.

I've adapted the code from the answer of Mark Amery, particularly in order to get rid of isinstance for the pros of duck typing.
The encoding is done manually and ensure_ascii is disabled. The Python documentation for json.dump says that:
If ensure_ascii is True (the default), all non-ASCII characters in the output are escaped with \uXXXX sequences
Disclaimer: in the doctest I used the Hungarian language. Some notable Hungarian-related character encodings are: cp852, the IBM/OEM encoding used e.g. in DOS (sometimes referred to as ASCII. Incorrectly I think, as it is dependent on the code page setting). Windows-1250 is used e.g. in Windows (sometimes referred as ANSI, dependent on the locale settings), and ISO 8859-1, sometimes used on HTTP servers.
The test text Tüskéshátú kígyóbűvölő is attributed to Koltai László (native personal name form) and is from Wikipedia.
# coding: utf-8
"""
This file should be encoded correctly with utf-8.
"""
import json
def encode_items(input, encoding='utf-8'):
u"""original from: https://stackoverflow.com/a/13101776/611007
adapted by SO/u/611007 (20150623)
>>>
>>> ## run this with `python -m doctest <this file>.py` from command line
>>>
>>> txt = u"Tüskéshátú kígyóbűvölő"
>>> txt2 = u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"
>>> txt3 = u"uúuutifu"
>>> txt4 = b'u\\xfauutifu'
>>> # txt4 shouldn't be 'u\\xc3\\xbauutifu', string content needs double backslash for doctest:
>>> assert u'\\u0102' not in b'u\\xfauutifu'.decode('cp1250')
>>> txt4u = txt4.decode('cp1250')
>>> assert txt4u == u'u\\xfauutifu', repr(txt4u)
>>> txt5 = b"u\\xc3\\xbauutifu"
>>> txt5u = txt5.decode('utf-8')
>>> txt6 = u"u\\u251c\\u2551uutifu"
>>> there_and_back_again = lambda t: encode_items(t, encoding='utf-8').decode('utf-8')
>>> assert txt == there_and_back_again(txt)
>>> assert txt == there_and_back_again(txt2)
>>> assert txt3 == there_and_back_again(txt3)
>>> assert txt3.encode('cp852') == there_and_back_again(txt4u).encode('cp852')
>>> assert txt3 == txt4u,(txt3,txt4u)
>>> assert txt3 == there_and_back_again(txt5)
>>> assert txt3 == there_and_back_again(txt5u)
>>> assert txt3 == there_and_back_again(txt4u)
>>> assert txt3.encode('cp1250') == encode_items(txt4, encoding='utf-8')
>>> assert txt3.encode('utf-8') == encode_items(txt5, encoding='utf-8')
>>> assert txt2.encode('utf-8') == encode_items(txt, encoding='utf-8')
>>> assert {'a':txt2.encode('utf-8')} == encode_items({'a':txt}, encoding='utf-8')
>>> assert [txt2.encode('utf-8')] == encode_items([txt], encoding='utf-8')
>>> assert [[txt2.encode('utf-8')]] == encode_items([[txt]], encoding='utf-8')
>>> assert [{'a':txt2.encode('utf-8')}] == encode_items([{'a':txt}], encoding='utf-8')
>>> assert {'b':{'a':txt2.encode('utf-8')}} == encode_items({'b':{'a':txt}}, encoding='utf-8')
"""
try:
input.iteritems
return {encode_items(k): encode_items(v) for (k,v) in input.iteritems()}
except AttributeError:
if isinstance(input, unicode):
return input.encode(encoding)
elif isinstance(input, str):
return input
try:
iter(input)
return [encode_items(e) for e in input]
except TypeError:
return input
def alt_dumps(obj, **kwargs):
"""
>>> alt_dumps({'a': u"T\\u00fcsk\\u00e9sh\\u00e1t\\u00fa k\\u00edgy\\u00f3b\\u0171v\\u00f6l\\u0151"})
'{"a": "T\\xc3\\xbcsk\\xc3\\xa9sh\\xc3\\xa1t\\xc3\\xba k\\xc3\\xadgy\\xc3\\xb3b\\xc5\\xb1v\\xc3\\xb6l\\xc5\\x91"}'
"""
if 'ensure_ascii' in kwargs:
del kwargs['ensure_ascii']
return json.dumps(encode_items(obj), ensure_ascii=False, **kwargs)
I'd also like to highlight the answer of Jarret Hardie which references the JSON specification, quoting:
A string is a collection of zero or more Unicode characters
In my use case, I had files with JSON content. They are UTF-8 encoded files. ensure_ascii results in properly escaped, but not very readable JSON files, and that is why I've adapted Mark Amery's answer to fit my needs.
The doctest is not particularly thoughtful, but I share the code in the hope that it will useful for someone.

Check out this answer to a similar question like this which states that
The u- prefix just means that you have a Unicode string. When you really use the string, it won't appear in your data. Don't be thrown by the printed output.
For example, try this:
print mail_accounts[0]["i"]
You won't see a u.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to decode an Opaque data which has obtained by pysnmp? - python

Related

How to put dowloaded JSON data into variables in python

Removing 'u' character from the output of json.loads(jsonstring) [duplicate]

How to use struct.pack when the data and the size to pack is undefined in advance

How to read in binary data after ascii header in Python

How to get string objects instead of Unicode from JSON

Categories

Resources