I'll be receiving a JSON encoded string from Objective-C, and I am decoding a dummy string (for now) like the code below. My output comes out with character 'u' prefixing each item:
[{u'i': u'imap.gmail.com', u'p': u'aaaa'}, {u'i': u'333imap.com', u'p': u'bbbb'}...
How is JSON adding this Unicode character? What's the best way to remove it?
mail_accounts = []
da = {}
try:
s = '[{"i":"imap.gmail.com","p":"aaaa"},{"i":"imap.aol.com","p":"bbbb"},{"i":"333imap.com","p":"ccccc"},{"i":"444ap.gmail.com","p":"ddddd"},{"i":"555imap.gmail.com","p":"eee"}]'
jdata = json.loads(s)
for d in jdata:
for key, value in d.iteritems():
if key not in da:
da[key] = value
else:
da = {}
da[key] = value
mail_accounts.append(da)
except Exception, err:
sys.stderr.write('Exception Error: %s' % str(err))
print mail_accounts
The u- prefix just means that you have a Unicode string. When you really use the string, it won't appear in your data. Don't be thrown by the printed output.
For example, try this:
print mail_accounts[0]["i"]
You won't see a u.
Everything is cool, man. The 'u' is a good thing, it indicates that the string is of type Unicode in python 2.x.
http://docs.python.org/2/howto/unicode.html#the-unicode-type
The d3 print below is the one you are looking for (which is the combination of dumps and loads) :)
Having:
import json
d = """{"Aa": 1, "BB": "blabla", "cc": "False"}"""
d1 = json.loads(d) # Produces a dictionary out of the given string
d2 = json.dumps(d) # Produces a string out of a given dict or string
d3 = json.dumps(json.loads(d)) # 'dumps' gets the dict from 'loads' this time
print "d1: " + str(d1)
print "d2: " + d2
print "d3: " + d3
Prints:
d1: {u'Aa': 1, u'cc': u'False', u'BB': u'blabla'}
d2: "{\"Aa\": 1, \"BB\": \"blabla\", \"cc\": \"False\"}"
d3: {"Aa": 1, "cc": "False", "BB": "blabla"}
Those 'u' characters being appended to an object signifies that the object is encoded in Unicode.
If you want to remove those 'u' characters from your object, you can do this:
import json, ast
jdata = ast.literal_eval(json.dumps(jdata)) # Removing uni-code chars
Let's checkout from python shell
>>> import json, ast
>>> jdata = [{u'i': u'imap.gmail.com', u'p': u'aaaa'}, {u'i': u'333imap.com', u'p': u'bbbb'}]
>>> jdata = ast.literal_eval(json.dumps(jdata))
>>> jdata
[{'i': 'imap.gmail.com', 'p': 'aaaa'}, {'i': '333imap.com', 'p': 'bbbb'}]
Unicode is an appropriate type here. The JSONDecoder documentation describe the conversion table and state that JSON string objects are decoded into Unicode objects.
From 18.2.2. Encoders and Decoders:
JSON Python
==================================
object dict
array list
string unicode
number (int) int, long
number (real) float
true True
false False
null None
"encoding determines the encoding used to interpret any str objects decoded by this instance (UTF-8 by default)."
The u prefix means that those strings are unicode rather than 8-bit strings. The best way to not show the u prefix is to switch to Python 3, where strings are unicode by default. If that's not an option, the str constructor will convert from unicode to 8-bit, so simply loop recursively over the result and convert unicode to str. However, it is probably best just to leave the strings as unicode.
I kept running into this problem when trying to capture JSON data in the log with the Python logging library, for debugging and troubleshooting purposes. Getting the u character is a real nuisance when you want to copy the text and paste it into your code somewhere.
As everyone will tell you, this is because it is a Unicode representation, and it could come from the fact that you’ve used json.loads() to load in the data from a string in the first place.
If you want the JSON representation in the log, without the u prefix, the trick is to use json.dumps() before logging it out. For example:
import json
import logging
# Prepare the data
json_data = json.loads('{"key": "value"}')
# Log normally and get the Unicode indicator
logging.warning('data: {}'.format(json_data))
>>> WARNING:root:data: {u'key': u'value'}
# Dump to a string before logging and get clean output!
logging.warning('data: {}'.format(json.dumps(json_data)))
>>> WARNING:root:data: {'key': 'value'}
Try this:
mail_accounts[0].encode("ascii")
Just replace the u' with a single quote...
print (str.replace(mail_accounts,"u'","'"))
Related
I'm trying to parse a JSON string with
json.loads(json_string)
but it returns a string instead of a dict. I can get the expected result by parsing it again
json.loads(json.loads(json_string))
but I don't understand why.
I receive a bytes object from a webhook:
bytes_object = b'"{\\"action\\":\\"connection_test\\",\\"data\\":{}}"'
The bytes object is then utf-8 decoded:
decoded_bytes = bytes_object.decode('utf-8')
"{\"action\":\"connection_test\",\"data\":{}}"
Then, the utf-8 decoded object is parsed using json.loads:
parsed_once = json.loads(decoded_bytes)
But this doesn't return a dict, but a string object looking like this:
{"action":"connection_test","data":{}}
of type <class 'str'>.
But if I parse it again I get the dict expected from the first try:
parsed_twice = json.loads(parsed_once)
{'action': 'connection_test', 'data': {}}
of type <class 'dict'>.
I suspect it's something about how Python 3.9 handles JSON escaping, but I'm not sure. Any help?
The JSON is double encoded, so it needs to be double-decoded. It went something like this:
>>> import json
>>> data = {'action': 'connection_test', 'data': {}}
>>> a = json.dumps(data)
>>> print(a)
{"action": "connection_test", "data": {}}
>>> b = json.dumps(a)
>>> print(b)
"{\"action\": \"connection_test\", \"data\": {}}"
That's a mistake that needs to be rectified on the producer side. As long as the producer gives you this double encoded JSON, you need to double decode it.
I have a string which includes encoded bytes inside it:
str1 = "b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'"
I want to decode it, but I can't since it has become a string. Therefore I want to ask whether there is any way I can convert it into
str2 = b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'
Here str2 is a bytes object which I can decode easily using
str2.decode('utf-8')
to get the final result:
'Output file 문항분석.xlsx Created'
You could use ast.literal_eval:
>>> print(str1)
b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'
>>> type(str1)
<class 'str'>
>>> from ast import literal_eval
>>> literal_eval(str1).decode('utf-8')
'Output file 문항분석.xlsx Created'
Based on the SyntaxError mentioned in your comments, you may be having a testing issue when attempting to print due to the fact that stdout is set to ascii in your console (and you may also find that your console does not support some of the characters you may be trying to print). You can try something like the following to set sys.stdout to utf-8 and see what your console will print (just using string slice and encode below to get bytes rather than the ast.literal_eval approach that has already been suggested):
import codecs
import sys
sys.stdout = codecs.getwriter('utf-8')(sys.stdout.buffer)
s = "b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'"
b = s[2:-1].encode().decode('utf-8')
A simple way is to assume that all the characters of the initial strings are in the [0,256) range and map to the same Unicode value, which means that it is a Latin1 encoded string.
The conversion is then trivial:
str1[2:-1].encode('Latin1').decode('utf8')
Finally I have found an answer where i use a function to cast a string to bytes without encoding.Given string
str1 = "b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'"
now i take only actual encoded text inside of it
str1[2:-1]
and pass this to the function which convert the string to bytes without encoding its values
import struct
def rawbytes(s):
"""Convert a string to raw bytes without encoding"""
outlist = []
for cp in s:
num = ord(cp)
if num < 255:
outlist.append(struct.pack('B', num))
elif num < 65535:
outlist.append(struct.pack('>H', num))
else:
b = (num & 0xFF0000) >> 16
H = num & 0xFFFF
outlist.append(struct.pack('>bH', b, H))
return b''.join(outlist)
So, calling the function would convert it to bytes which then is decoded
rawbytes(str1[2:-1]).decode('utf-8')
will give the correct output
'Output file 문항분석.xlsx Created'
I want to parse a bytes string in JSON format to convert it into python objects. This is the source I have:
my_bytes_value = b'[{\'Date\': \'2016-05-21T21:35:40Z\', \'CreationDate\': \'2012-05-05\', \'LogoType\': \'png\', \'Ref\': 164611595, \'Classe\': [\'Email addresses\', \'Passwords\'],\'Link\':\'http://some_link.com\'}]'
And this is the desired outcome I want to have:
[{
"Date": "2016-05-21T21:35:40Z",
"CreationDate": "2012-05-05",
"LogoType": "png",
"Ref": 164611595,
"Classes": [
"Email addresses",
"Passwords"
],
"Link": "http://some_link.com"}]
First, I converted the bytes to string:
my_new_string_value = my_bytes_value.decode("utf-8")
but when I try to invoke loads to parse it as JSON:
my_json = json.loads(my_new_string_value)
I get this error:
json.decoder.JSONDecodeError: Expecting value: line 1 column 174 (char 173)
Your bytes object is almost JSON, but it's using single quotes instead of double quotes, and it needs to be a string. So one way to fix it is to decode the bytes to str and replace the quotes. Another option is to use ast.literal_eval; see below for details. If you want to print the result or save it to a file as valid JSON you can load the JSON to a Python list and then dump it out. Eg,
import json
my_bytes_value = b'[{\'Date\': \'2016-05-21T21:35:40Z\', \'CreationDate\': \'2012-05-05\', \'LogoType\': \'png\', \'Ref\': 164611595, \'Classe\': [\'Email addresses\', \'Passwords\'],\'Link\':\'http://some_link.com\'}]'
# Decode UTF-8 bytes to Unicode, and convert single quotes
# to double quotes to make it valid JSON
my_json = my_bytes_value.decode('utf8').replace("'", '"')
print(my_json)
print('- ' * 20)
# Load the JSON to a Python list & dump it back out as formatted JSON
data = json.loads(my_json)
s = json.dumps(data, indent=4, sort_keys=True)
print(s)
output
[{"Date": "2016-05-21T21:35:40Z", "CreationDate": "2012-05-05", "LogoType": "png", "Ref": 164611595, "Classe": ["Email addresses", "Passwords"],"Link":"http://some_link.com"}]
- - - - - - - - - - - - - - - - - - - -
[
{
"Classe": [
"Email addresses",
"Passwords"
],
"CreationDate": "2012-05-05",
"Date": "2016-05-21T21:35:40Z",
"Link": "http://some_link.com",
"LogoType": "png",
"Ref": 164611595
}
]
As Antti Haapala mentions in the comments, we can use ast.literal_eval to convert my_bytes_value to a Python list, once we've decoded it to a string.
from ast import literal_eval
import json
my_bytes_value = b'[{\'Date\': \'2016-05-21T21:35:40Z\', \'CreationDate\': \'2012-05-05\', \'LogoType\': \'png\', \'Ref\': 164611595, \'Classe\': [\'Email addresses\', \'Passwords\'],\'Link\':\'http://some_link.com\'}]'
data = literal_eval(my_bytes_value.decode('utf8'))
print(data)
print('- ' * 20)
s = json.dumps(data, indent=4, sort_keys=True)
print(s)
Generally, this problem arises because someone has saved data by printing its Python repr instead of using the json module to create proper JSON data. If it's possible, it's better to fix that problem so that proper JSON data is created in the first place.
You can simply use,
import json
json.loads(my_bytes_value)
Python 3.5 + Use io module
import json
import io
my_bytes_value = b'[{\'Date\': \'2016-05-21T21:35:40Z\', \'CreationDate\': \'2012-05-05\', \'LogoType\': \'png\', \'Ref\': 164611595, \'Classe\': [\'Email addresses\', \'Passwords\'],\'Link\':\'http://some_link.com\'}]'
fix_bytes_value = my_bytes_value.replace(b"'", b'"')
my_json = json.load(io.BytesIO(fix_bytes_value))
d = json.dumps(byte_str.decode('utf-8'))
To convert this bytesarray directly to json, you could first convert the bytesarray to a string with decode(), utf-8 is standard. Change the quotation markers.. The last step is to remove the " from the dumped string, to change the json object from string to list.
dumps(s.decode()).replace("'", '"')[1:-1]
Better solution is:
import json
byte_array_example = b'{"text": "\u0627\u06CC\u0646 \u06CC\u06A9 \u0645\u062A\u0646 \u062A\u0633\u062A\u06CC \u0641\u0627\u0631\u0633\u06CC \u0627\u0633\u062A."}'
res = json.loads(byte_array_example.decode('unicode_escape'))
print(res)
result:
{'text': 'این یک متن تستی فارسی است.'}
decode by utf-8 cannot decode unicode characters. The right solution is uicode_escape
It is OK
if you have a bytes object and want to store it in a JSON file, then you should first decode the byte object because JSON only has a few data types and raw byte data isn't one of them. It has arrays, decimal numbers, strings, and objects.
To decode a byte object you first have to know its encoding. For this, you can use
import chardet
encoding = chardet.detect(your_byte_object)['encoding']
then you can save this object to your json file like this
data = {"data": your_byte_object.decode(encoding)}
with open('request.txt', 'w') as file:
json.dump(data, file)
The most simple solution is to use the json function that comes with http request.
For example:
I am converting a CSV to dict, all the values are loaded correctly but with one issue.
CSV :
Testing testing\nwe are into testing mode
My\nServer This is my server.
When I convert the CSV to dict and if I try to use dict.get() method it is returning None.
When I debug, I get the following output:
{'Testing': 'testing\\nwe are into testing mode', 'My\\nServer': 'This is my server.'}
The My\nServer key is having an extra backslash.
If I do .get("My\nServer"), I am getting the output as None.
Can anyone help me?
#!/usr/bin/env python
import os
import codecs
import json
from csv import reader
def get_dict(path):
with codecs.open(path, 'r', 'utf-8') as msgfile:
data = msgfile.read()
data = reader([r.encode('utf-8') for r in data.splitlines()])
newdata = []
for row in data:
newrow = []
for val in row:
newrow.append(unicode(val, 'utf-8'))
newdata.append(newrow)
return dict(newdata)
thanks
You either need to escape the newline properly, using \\n:
>>> d = {'Testing': 'testing\\nwe are into testing mode', 'My\\nServer': 'This is my server.'}
>>> d.get('My\\nServer')
'This is my server.'
or you can use a raw string literal which doesn't need extra escaping:
>>> d.get(r'My\nServer')
'This is my server.'
Note that raw string will treat all the backslash escape sequences this way, not just the newline \n.
In case you are getting the values dynamically, you can use str.encode with string_escape or unicode_escape encoding:
>>> k = 'My\nServer' # API call result
>>> k.encode('string_escape')
'My\\nServer'
>>> d.get(k.encode('string_escape'))
'This is my server.'
"\n" is newline.
If you want to represent a text like "---\n---" in Python, and not having there newline, you have to escape it.
The way you write it in code and how it gets printed differs, in code, you will have to write "\" (unless u use raw string), when printed, the extra slash will not be seen
So in your code, you shall ask:
>>> dct = {'Testing': 'testing\\nwe are into testing mode', 'My\\nServer': 'This is my server.'}
>>> dct.get("My\\nServer")
'This is my server.'
I have two python dictionaries containing information about japanese words and characters:
vocabDic : contains vocabulary, key: word, value: dictionary with information about it
kanjiDic : contains kanji ( single japanese character ), key: kanji, value: dictionary with information about it
Now I would like to iterate through each character of each word in the vocabDic and look up this character in the kanji dictionary. My goal is to create a csv file which I can then import into a database as join table for vocabulary and kanji.
My Python version is 2.6
My code is as following:
kanjiVocabJoinWriter = csv.writer(open('kanjiVocabJoin.csv', 'wb'), delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
kanjiVocabJoinCount = 1
#loop through dictionary
for key, val in vocabDic.iteritems():
if val['lang'] is 'jpn': # only check japanese words
vocab = val['text']
print vocab
# loop through vocab string
for v in vocab:
test = kanjiDic.get(v)
print v
print test
if test is not None:
print str(kanjiVocabJoinCount)+','+str(test['id'])+','+str(val['id'])
kanjiVocabJoinWriter([str(kanjiVocabJoinCount),str(test['id']),str(val['id'])])
kanjiVocabJoinCount = kanjiVocabJoinCount+1
If I print the variables to the command line, I get:
vocab : works, prints in japanese
v ( one character of the vocab in the for loop ) : �
test ( character looked up in the kanjiDic ) : None
To me it seems like the for loop messes the encoding up.
I tried various functions ( decode, encode.. ) but no luck so far.
Any ideas on how I could get this working?
Help would be very much appreciated.
From your description of the problem, it sounds like vocab is an encoded str object, not a unicode object.
For concreteness, suppose vocab equals u'債務の天井' encoded in utf-8:
In [42]: v=u'債務の天井'
In [43]: vocab=v.encode('utf-8') # val['text']
Out[43]: '\xe5\x82\xb5\xe5\x8b\x99\xe3\x81\xae\xe5\xa4\xa9\xe4\xba\x95'
If you loop over the encoded str object, you get one byte at a time: \xe5, then \x82, then \xb5, etc.
However if you loop over the unicode object, you'd get one unicode character at a time:
In [45]: for v in u'債務の天井':
....: print(v)
債
務
の
天
井
Note that the first unicode character, encoded in utf-8, is 3 bytes:
In [49]: u'債'.encode('utf-8')
Out[49]: '\xe5\x82\xb5'
That's why looping over the bytes, printing one byte at a time, (e.g. print \xe5) fails to print a recognizable character.
So it looks like you need to decode your str objects and work with unicode objects. You didn't mention what encoding you are using for your str objects. If it is utf-8, then you'd decode it like this:
vocab=val['text'].decode('utf-8')
If you are not sure what encoding val['text'] is in, post the output of
print(repr(vocab))
and maybe we can guess the encoding.