Python3 saving JSON with unicode single quote [duplicate] - python

This question already has answers here:
Saving UTF-8 texts with json.dumps as UTF-8, not as a \u escape sequence
(12 answers)
Closed 1 year ago.
I know this has been asked before on Stackoverflow and on other sites but I cannot seem to be able to save a JSON file using escaped Unicode characters (Python3). I have read a lot of tutorials.
What am I missing? I have tried a lot of things but nothing works. I have also tried encoding/decoding in UTF-8 but I am obviously missing something.
Just to be clear, I have managed to get it working for other characters like й (0439) but I am having trouble with a single quote being encoded..
If I have the following dict:
import json
data = {"key": "Test \u0027TEXT\u0027 around"}
I want to save it exactly as it is in a new JSON file, but no matter what I do it always ends up as a single character, which is what is encoded in Unicode.
The following 2 blocks print the exact same thing: {"key": "Test 'TEXT' around"}.
print(json.dumps(data))
print(json.dumps(data, ensure_ascii=False))
Is there any way to keep the Unicode string literal? I want to have that very string as a value: "Test \u0027TEXT\u0027 around"

The behavior you are describing has nothing to do with JSON. This is simply how Python 3 handles strings. Open the shell and write:
>>> "Test \u0027TEXT\u0027 around"
"Test 'TEXT' around"
If you do not want Python to interpret the special characters, you should use raw strings (or maybe even byte sequences):
>>> r"Test \u0027TEXT\u0027 around"
'Test \\u0027TEXT\\u0027 around'
Reference:
https://docs.python.org/2.0/ref/strings.html
https://docs.python.org/3/library/stdtypes.html#binaryseq

Related

UTF-8 decoding doesn't decode special characters in python

Hi I have the following data (abstracted) that comes from an API.
"Product" : "T\u00e1bua 21X40"
I'm using the following code to decode the data byte:
var = json.loads(cleanhtml(str(json.dumps(response.content.decode('utf-8')))))
The cleanhtml is a regex function that I've created to remove html tags from the returned data (It's working correctly). Although, decode(utf-8) is not removing characters like \u00e1. My expected output is:
"Product" : "Tábua 21X40"
I've tried to use replace("\\u00e1", "á") but with no success. How can I replace this type of character and what type of character is this?
\u00e1 is another way of representing the á character when displaying the contents of a Python string.
If you open a Python interactive session and run print({"Product" : "T\u00e1bua 21X40"}) you'll see output of {'Product': 'Tábua 21X40'}. The \u00e1 doesn't exist in the string as those individual characters.
The \u escape sequence indicates that the following numbers specify a Unicode character.
Attempting to replace \u00e1 with á won't achieve anything because that's what it already is. Additionally, replace("\\u00e1", "á") is attempting to replace the individual characters of a slash, a u, etc and, as mentioned, they don't actually exist in the string in that way.
If you explain the problem you're encountering further then we may be able to help more, but currently it sounds like the string has the correct content but is just being displayed differently than you expect.
what type of character is this
Here
"Product" : "T\u00e1bua 21X40"
you might observe \u escape sequence, it is followed by 4 hex digits: 00e1, note that this is different represenation of same character, so
print("\u00e1" == "á")
output
True
These type of characters are called character entities. There are different types of entities and this is JSON entity. For demonstration, enter your string here and click unescape.
For your question, if you are using python then you can solve the issue by importing json module. Then you have to decode it as follows.
import json
string = json.loads('"T\u00e1bua 21X40"')
print(string)

json dumps escape characters

I'm having some trouble with escape characters and json.dumps.
It seems like extra escape characters are being added whenever json.dumps is called. Example:
not_encoded = {'data': '''!"#$%'()*+,-/:;=?#[\]^_`{|}~0000&<>'''}
print(not_encoded)
{'data': '!"#$%\'()*+,-/:;=?#[\\]^_`{|}~0000&<>'}
This is fine, but when I do a json dumps it adds a lot of extra values.
json.dumps(not_encoded)
'{"data": "!\\"#$%\'()*+,-/:;=?#[\\\\]^_`{|}~0000&<>"}'
The dump shouldn't look like this. It's double escaping the \ and the ". Anyone know why this is and how to fix it? I would want the json.dumps to output
'{"data": "!\"#$%'()*+,-/:;=?#[\\]^_`{|}~0000&<>"}'
edit
Loading back in the dump:
the_dump = json.dumps(not_encoded)
json.loads(the_dump)
{u'data': u'!"#$%\'()*+,-/:;=?#[\\]^_`{|}~0000&<>'}
The problem is I'm hitting an API endpoint which needs these special characters, but it goes over character limit when the json.dumps adds additional escape characters (\\\\ and \\").
It is worth reading up on the difference between print, str and repr in python (see here for example). You are comparing the printed original string with a repr of the json encoding, the latter will have double escapes - one from the json encoding and one from python's string representation.
But otherwise there is no issue, if you compare len(not_encoded['data']) with len(json.loads(json.dumps(not_encoded))['data']) you will find they are the same. There are no extra characters, but there are different methods of displaying them.
json.dumps is required to escape " and \ according to the JSON standard. If the API uses JSON you cannot avoid your data to grow in length when using these characters.
From json.org:

when python loads json,how to convert str to unicode ,so I can print Chinese characters?

I got a json file like this:
{
'errNum': 0,
'retData': {
'city': "武汉"
}
}
import json
content = json.loads(result) # supposing json file named result
cityname = content['retData']['city']
print cityname
After that, I got a output : \u6b66\u6c49
I know it's unicode of Chinese character of 武汉 ,but the type of it is str
isinstance(cityname,str) is True.
so how can I convert this str to unicode and output will be 武汉
I also have tried these solutions:
>>> u'\u6b66\u6c49'
u'\u6b66\u6c49'
>>> print u'\u6b66\u6c49'
武汉
>>> print '\u6b66\u6c49'.decode()
\u6b66\u6c49
>>> print '\u6b66\u6c49'
\u6b66\u6c49
Searched something about ascii,unicode and utf-8 ,encode and decode ,but also cannot understand,it is crazy!
I need some help ,Thanks !
Perhaps this answer comes five years too late, but since I had a similar issue that I was trying to solve, while building a preprocessor for the Japanese language, here is the answer I found.
when you loads the result to content add the following flag:
content = json.loads(result, ensure_ascii=False)
This fixed my issue.
Your json contains escaped unicode characters. You can decode them into actual unicode characters using the unicode_escape codec:
print cityname.decode('unicode_escape')
Note that, while this will usually work, depending on the source of the unicode escaping you could have problems with characters outside the Basic Multilingual Plane (U+0 to U+FFFF). A convenient quote from user #bobince that I took from a comment:
Note that ... there are a number of different formats that use \u
escapes - Python unicode literals (which unicode-escape handles), Java
properties, JavaScript string literals, JSON, and so on. It is
important to know which one you are dealing with because they all have
slightly different rules about what other escapes are valid.
unicode-escape may or may not be a valid way of parsing that data
depending on where it comes from.

how to convert u'\uf04a' to unicode in python [duplicate]

This question already has answers here:
Python unicode codepoint to unicode character
(4 answers)
Closed 1 year ago.
I am trying to decode u'\uf04a' in python thus I can print it without error warnings. In other words, I need to convert stupid microsoft Windows 1252 characters to actual unicode
The source of html containing the unusual errors comes from here http://members.lovingfromadistance.com/showthread.php?12338-HAVING-SECOND-THOUGHTS
Read about u'\uf04a' and u'\uf04c' by clicking here http://www.fileformat.info/info/unicode/char/f04a/index.htm
one example looks like this:
"Oh god please some advice ":
Out[408]: u'Oh god please some advice \uf04c'
Given a thread like this as one example for test:
thread = u'who are you \uf04a Why you are so harsh to her \uf04c'
thread.decode('utf8')
print u'\uf04a'
print u'\uf04a'.decode('utf8') # error!!!
'charmap' codec can't encode character u'\uf04a' in position 1526: character maps to undefined
With the help of two Python scripts, I successfully convert the u'\x92', but I am still stuck with u'\uf04a'. Any suggestions?
References
https://github.com/AnthonyBRoberts/NNS/blob/master/tools/killgremlins.py
Handling non-standard American English Characters and Symbols in a CSV, using Python
Solution:
According to the comments below: I replace these character set with the question mark('?')
thread = u'who are you \uf04a Why you are so harsh to her \uf04c'
thread = thread.replace(u'\uf04a', '?')
thread = thread.replace(u'\uf04c', '?')
Hope this helpful to the other beginners.
The notation u'\uf04a' denotes the Unicode codepoint U+F04A, which is by definition a private use codepoint. This means that the Unicode standard does not assign any character to it, and never will; instead, it can be used by private agreements.
It is thus meaningless to talk about printing it. If there is a private agreement on using it in some context, then you print it using a font that has a glyph allocated to that codepoint. Different agreements and different fonts may allocate completely different characters and glyphs to the same codepoint.
It is possible that U+F04A is a result of erroneous processing (e.g., wrong conversions) of character data at some earlier phase.
u'\uf04a'
already is a Unicode object, which means there's nothing to decode. The only thing you can do with it is encode it, if you're targeting a specific file encoding like UTF-8 (which is not the same as Unicode, but is confused with it all the time).
u'\uf04a'.encode("utf-8")
gives you a string (Python 2) or bytes object (Python 3) which you can then write to a file or a UTF-8 terminal etc.
You won't be able to encode it as a plain Windows string because cp1252 doesn't have that character.
What you can do is convert it to an encoding that doesn't have those offending characters by telling the encoder to replace missing characters by ?:
>>> u'who\uf04a why\uf04c'.encode("ascii", errors="replace")
'who? why?'

Python unicode file writing

I'm using twitter python library to fetch some tweets from a public stream. The library fetches tweets in json format and converts them to python structures. What I'm trying to do is to directly get the json string and write it to a file. Inside the twitter library it first reads a network socket and applies .decode('utf8') to the buffer. Then, it wraps the info in a python structure and returns it. I can use jsonEncoder to encode it back to the json string and save it to a file. But there is a problem with character encoding I guess. When I try to print the json string it prints fine in the console. But when I try to write it into a file, some characters appear such as \u0627\u0644\u0644\u06be\u064f
I tried to open the saved file using different encodings and nothing has changed. It suppose to be in utf8 encoding and when I try to display it, those special characters should be replaced with actual characters they represent. Am I missing something here? How can I achieve this?
more info:
I'm using python 2.7
I open the file like this:
json_file = open('test.json', 'w')
I also tried this:
json_file = codecs.open( 'test.json', 'w', 'utf-8' )
nothing has changed. I blindly tried, .encode('utf8'), .decode('utf8') on the json string and the result is the same. I tried different text editors to view the written text, I used cat command to see the text in the console and those characters which start with \u still appear.
Update:
I solved the problem. jsonEncoder has an option ensure_ascii
If ensure_ascii is True (the default), all non-ASCII characters in the
output are escaped with \uXXXX sequences, and the results are str
instances consisting of ASCII characters only.
I made it False and the problem has gone away.
jsonEncoder has an option ensure_ascii
If ensure_ascii is True (the default), all non-ASCII characters in the
output are escaped with \uXXXX sequences, and the results are str
instances consisting of ASCII characters only.
Make it False and the problem will go away.
Well, since you won't post your solution as an answer, I will. This question should not be left showing no answer.
jsonEncoder has an option ensure_ascii.
If ensure_ascii is True (the default), all non-ASCII characters in the output are escaped with \uXXXX sequences, and the results are str instances consisting of ASCII characters only.
Make it False and the problem will go away.

Categories