json dumps escape characters - python

I'm having some trouble with escape characters and json.dumps.
It seems like extra escape characters are being added whenever json.dumps is called. Example:
not_encoded = {'data': '''!"#$%'()*+,-/:;=?#[\]^_`{|}~0000&<>'''}
print(not_encoded)
{'data': '!"#$%\'()*+,-/:;=?#[\\]^_`{|}~0000&<>'}
This is fine, but when I do a json dumps it adds a lot of extra values.
json.dumps(not_encoded)
'{"data": "!\\"#$%\'()*+,-/:;=?#[\\\\]^_`{|}~0000&<>"}'
The dump shouldn't look like this. It's double escaping the \ and the ". Anyone know why this is and how to fix it? I would want the json.dumps to output
'{"data": "!\"#$%'()*+,-/:;=?#[\\]^_`{|}~0000&<>"}'
edit
Loading back in the dump:
the_dump = json.dumps(not_encoded)
json.loads(the_dump)
{u'data': u'!"#$%\'()*+,-/:;=?#[\\]^_`{|}~0000&<>'}
The problem is I'm hitting an API endpoint which needs these special characters, but it goes over character limit when the json.dumps adds additional escape characters (\\\\ and \\").

It is worth reading up on the difference between print, str and repr in python (see here for example). You are comparing the printed original string with a repr of the json encoding, the latter will have double escapes - one from the json encoding and one from python's string representation.
But otherwise there is no issue, if you compare len(not_encoded['data']) with len(json.loads(json.dumps(not_encoded))['data']) you will find they are the same. There are no extra characters, but there are different methods of displaying them.

json.dumps is required to escape " and \ according to the JSON standard. If the API uses JSON you cannot avoid your data to grow in length when using these characters.
From json.org:

Related

UTF-8 decoding doesn't decode special characters in python

Hi I have the following data (abstracted) that comes from an API.
"Product" : "T\u00e1bua 21X40"
I'm using the following code to decode the data byte:
var = json.loads(cleanhtml(str(json.dumps(response.content.decode('utf-8')))))
The cleanhtml is a regex function that I've created to remove html tags from the returned data (It's working correctly). Although, decode(utf-8) is not removing characters like \u00e1. My expected output is:
"Product" : "Tábua 21X40"
I've tried to use replace("\\u00e1", "á") but with no success. How can I replace this type of character and what type of character is this?
\u00e1 is another way of representing the á character when displaying the contents of a Python string.
If you open a Python interactive session and run print({"Product" : "T\u00e1bua 21X40"}) you'll see output of {'Product': 'Tábua 21X40'}. The \u00e1 doesn't exist in the string as those individual characters.
The \u escape sequence indicates that the following numbers specify a Unicode character.
Attempting to replace \u00e1 with á won't achieve anything because that's what it already is. Additionally, replace("\\u00e1", "á") is attempting to replace the individual characters of a slash, a u, etc and, as mentioned, they don't actually exist in the string in that way.
If you explain the problem you're encountering further then we may be able to help more, but currently it sounds like the string has the correct content but is just being displayed differently than you expect.
what type of character is this
Here
"Product" : "T\u00e1bua 21X40"
you might observe \u escape sequence, it is followed by 4 hex digits: 00e1, note that this is different represenation of same character, so
print("\u00e1" == "á")
output
True
These type of characters are called character entities. There are different types of entities and this is JSON entity. For demonstration, enter your string here and click unescape.
For your question, if you are using python then you can solve the issue by importing json module. Then you have to decode it as follows.
import json
string = json.loads('"T\u00e1bua 21X40"')
print(string)

Decoding a byte with latin-1 characters to string with decimal representation

I am working on a migration project to upgrade a layer of web server from python 2.7.8 to python 3.6.3 and I have hit a roadblock for some special cases.
When a request is received from a client, payload is transmitted locally using pyzmq which now interacts in bytes in python3 instead of str (as it is in python2).
Now, the payload which I am receiving is encoded using iso-8859-1 (latin-1) scheme and I can easily convert it into string as payload.decode('latin-1') and pass it to next service (svc-save-entity) which expects string argument.
However, the subsequent service 'svc-save-entity' expects latin-1 chars (if present) to be represented in ASCII Character Reference (such as é for é) rather than in Hex (such as \xe9 for é).
I am struggling to find an efficient way to achieve this conversion. Can any python expert guide me here? Essentially I need the definition of a function say decode_tostring():
payload = b'Banco Santander (M\xe9xico)' #payload is in bytes
payload_str = decode_tostring(payload) #function to convert into string
payload_str == 'Banco Santander (México)' #payload_str is a string in ASCII Character Reference
Definition of decode_tostring() please. :)
The encode() and decode() methods accept a parameter called errors which allows you to specify how characters which are not representable in the specified encoding should be handled. The one you're looking for is XML numeric character reference replacement, which is fortunately one of the standard handlers provided in the codecs module.
Now, it's a little complex to actually do the replacement the way you want it, because the operation of replacing non-ASCII characters with their corresponding XML numeric character references happens during encoding, not decoding. After all, encoding is the process that takes in characters and emits bytes, so it's only during encoding that you can tell whether you have a character that is not part of ASCII. The cleanest way I can think of at the moment to get the transformation you want is to decode, re-encode, and re-decode, applying the XML entity reference replacement during the encoding step.
def decode_tostring(payload):
return payload.decode('latin-1').encode('ascii', errors='xmlcharrefreplace').decode('ascii')
I wouldn't be surprised if there is a method somewhere out there that will replace all non-ASCII characters in a string with their XML numeric character refs and give you back a string, and if so, you could use it to replace the encoding and the second decoding. But I don't know of one. The closest I found at the moment was xml.sax.saxutils.escape(), but that only acts on certain specific characters.
This isn't really relevant to your main question, but I did want to clarify one thing: the numeric entities like é are a feature of SGML, HTML, and XML, which are markup languages - a way to represent structured data as text. They have nothing to do with ASCII. A character encoding like ASCII is nothing more than a table of some characters and some byte sequences such that each character in the table is mapped to one byte sequence in the table and vice versa, with a few constraints to make the mapping unambiguous.
If you have a string with characters that are not in a particular encoding's table, you can't encode the string using that encoding. But what you can do is convert the string into a new string by replacing the characters which aren't in the table with sequences of characters that are in the table, and then encode the new string. There are many ways to do the replacement, of which XML numeric entity references are one example. Some of the other error handlers in Python's codecs module represent other approaches to this replacement.

Decoding escaped unicode in Python 3 from a non-ascii string

I have been searching for hours now to find a way to fully reverse the result of a str.encode-call like this:
"testäch基er".encode("cp1252", "backslashreplace")
The result is
b'test\xe4ch\\u57faer'
now i want to convert it back with
b'test\xe4ch\\u57faer'.decode("cp1252")
and I get
'testäch\\u57faer'
So how do I get my 基 back? I'm getting nearly there by using decode("unicode-escape") instead (it would work for this example), but that assumes bytes encoded with iso8859-1 not cp1252, so any characters between 80 and 9F would be wrong.
Well...
>>> b'test\xe4ch\\u57faer'.decode('unicode-escape')
'testäch基er'
But backslashreplace->unicode-escape is not a consistent round trip. If you have backslashes in the original string, they won't get encoded by backslashreplace but they will get decoded by unicode-escape, and replaced with unexpected characters.
>>> '☃ \\u2603'.encode('cp1252', 'backslashreplace').decode('unicode-escape')
'☃ ☃'
There is no way to reliably reverse the encoding of string that has been encoded with an errors fallback such as backslashreplace. That's why it's a fallback, if you could consistently encode and decode to it, it would have been a real encoding.
I was still very new to Python when I asked this question. Now I understand that these fallback mechanisms are just meant for handling unexpected errors, not something to save and restore data. If you really need a simple and reliable way to encode single unicode characters in ASCII, have a look at the quote and unquote functions from the urllib.parse module.

when python loads json,how to convert str to unicode ,so I can print Chinese characters?

I got a json file like this:
{
'errNum': 0,
'retData': {
'city': "武汉"
}
}
import json
content = json.loads(result) # supposing json file named result
cityname = content['retData']['city']
print cityname
After that, I got a output : \u6b66\u6c49
I know it's unicode of Chinese character of 武汉 ,but the type of it is str
isinstance(cityname,str) is True.
so how can I convert this str to unicode and output will be 武汉
I also have tried these solutions:
>>> u'\u6b66\u6c49'
u'\u6b66\u6c49'
>>> print u'\u6b66\u6c49'
武汉
>>> print '\u6b66\u6c49'.decode()
\u6b66\u6c49
>>> print '\u6b66\u6c49'
\u6b66\u6c49
Searched something about ascii,unicode and utf-8 ,encode and decode ,but also cannot understand,it is crazy!
I need some help ,Thanks !
Perhaps this answer comes five years too late, but since I had a similar issue that I was trying to solve, while building a preprocessor for the Japanese language, here is the answer I found.
when you loads the result to content add the following flag:
content = json.loads(result, ensure_ascii=False)
This fixed my issue.
Your json contains escaped unicode characters. You can decode them into actual unicode characters using the unicode_escape codec:
print cityname.decode('unicode_escape')
Note that, while this will usually work, depending on the source of the unicode escaping you could have problems with characters outside the Basic Multilingual Plane (U+0 to U+FFFF). A convenient quote from user #bobince that I took from a comment:
Note that ... there are a number of different formats that use \u
escapes - Python unicode literals (which unicode-escape handles), Java
properties, JavaScript string literals, JSON, and so on. It is
important to know which one you are dealing with because they all have
slightly different rules about what other escapes are valid.
unicode-escape may or may not be a valid way of parsing that data
depending on where it comes from.

Python unicode file writing

I'm using twitter python library to fetch some tweets from a public stream. The library fetches tweets in json format and converts them to python structures. What I'm trying to do is to directly get the json string and write it to a file. Inside the twitter library it first reads a network socket and applies .decode('utf8') to the buffer. Then, it wraps the info in a python structure and returns it. I can use jsonEncoder to encode it back to the json string and save it to a file. But there is a problem with character encoding I guess. When I try to print the json string it prints fine in the console. But when I try to write it into a file, some characters appear such as \u0627\u0644\u0644\u06be\u064f
I tried to open the saved file using different encodings and nothing has changed. It suppose to be in utf8 encoding and when I try to display it, those special characters should be replaced with actual characters they represent. Am I missing something here? How can I achieve this?
more info:
I'm using python 2.7
I open the file like this:
json_file = open('test.json', 'w')
I also tried this:
json_file = codecs.open( 'test.json', 'w', 'utf-8' )
nothing has changed. I blindly tried, .encode('utf8'), .decode('utf8') on the json string and the result is the same. I tried different text editors to view the written text, I used cat command to see the text in the console and those characters which start with \u still appear.
Update:
I solved the problem. jsonEncoder has an option ensure_ascii
If ensure_ascii is True (the default), all non-ASCII characters in the
output are escaped with \uXXXX sequences, and the results are str
instances consisting of ASCII characters only.
I made it False and the problem has gone away.
jsonEncoder has an option ensure_ascii
If ensure_ascii is True (the default), all non-ASCII characters in the
output are escaped with \uXXXX sequences, and the results are str
instances consisting of ASCII characters only.
Make it False and the problem will go away.
Well, since you won't post your solution as an answer, I will. This question should not be left showing no answer.
jsonEncoder has an option ensure_ascii.
If ensure_ascii is True (the default), all non-ASCII characters in the output are escaped with \uXXXX sequences, and the results are str instances consisting of ASCII characters only.
Make it False and the problem will go away.

Categories