Python unicode file writing - python

I'm using twitter python library to fetch some tweets from a public stream. The library fetches tweets in json format and converts them to python structures. What I'm trying to do is to directly get the json string and write it to a file. Inside the twitter library it first reads a network socket and applies .decode('utf8') to the buffer. Then, it wraps the info in a python structure and returns it. I can use jsonEncoder to encode it back to the json string and save it to a file. But there is a problem with character encoding I guess. When I try to print the json string it prints fine in the console. But when I try to write it into a file, some characters appear such as \u0627\u0644\u0644\u06be\u064f
I tried to open the saved file using different encodings and nothing has changed. It suppose to be in utf8 encoding and when I try to display it, those special characters should be replaced with actual characters they represent. Am I missing something here? How can I achieve this?
more info:
I'm using python 2.7
I open the file like this:
json_file = open('test.json', 'w')
I also tried this:
json_file = codecs.open( 'test.json', 'w', 'utf-8' )
nothing has changed. I blindly tried, .encode('utf8'), .decode('utf8') on the json string and the result is the same. I tried different text editors to view the written text, I used cat command to see the text in the console and those characters which start with \u still appear.
Update:
I solved the problem. jsonEncoder has an option ensure_ascii
If ensure_ascii is True (the default), all non-ASCII characters in the
output are escaped with \uXXXX sequences, and the results are str
instances consisting of ASCII characters only.
I made it False and the problem has gone away.

jsonEncoder has an option ensure_ascii
If ensure_ascii is True (the default), all non-ASCII characters in the
output are escaped with \uXXXX sequences, and the results are str
instances consisting of ASCII characters only.
Make it False and the problem will go away.

Well, since you won't post your solution as an answer, I will. This question should not be left showing no answer.
jsonEncoder has an option ensure_ascii.
If ensure_ascii is True (the default), all non-ASCII characters in the output are escaped with \uXXXX sequences, and the results are str instances consisting of ASCII characters only.
Make it False and the problem will go away.

Related

Python 2.7 convert special characters into utf-8 byes

I have strings that I need to replace into an URL for accessing different JSON files. My problem is that some strings have special characters and I need only these as UTF-8 bytes, so I can properly find the JSON tables.
An example:
# I have this string
a = 'code - Brasilândia'
#in the JSON url it appears as
'code%20-%20Brasil%C3%A2ndia'
I managed to get the spaces converted right using urllib.quote(), but it does not convert the special characters as I need them.
print(urllib.quote('code - Brasilândia))
'code%20-%20Brasil%83ndia'
When I substitute this in the URL, I cannot reach the JSON table.
I managed to make this work using u before the string, u'code - Brasilândia', but this did not solve my issue, because the string will ultimately be a user input, and will need to be constantly changed.
I have tried several methods, but I could not get the result I need.
I'm specifically using python 2.7 for this project, and I cannot change it.
Any ideas?
You could try decoding the string as UTF-8, and if it fails, assume that it's Latin-1, or whichever 8-bit encoding you expect.
try:
yourstring.decode('utf-8')
except UnicodeDecodeError:
yourstring = yourstring.decode('latin-1').encode('utf-8')
print(urllib.quote(yourstring))
... provided you can establish the correct encoding; 0x83 seems to correspond to â only in some fairly obscure legacy encodings like code pages 437 and 850 (and those are the least obscure). See also https://tripleee.github.io/8bit/#83
(disclosure: the linked site is mine).
Demo: https://ideone.com/fjX15c

The correct way to load and read JSON file contains special characters in Python

I'm working with a JSON file contains some unknown-encoded strings as the example below:
"L\u00c3\u00aa Nguy\u00e1\u00bb\u0085n Ph\u00c3\u00ba"
I have loaded this text by using json.load() function in Python 3.7 environment and tried to encode/decode it with some methods I found around the Internet but I still cannot get the proper string as I expected. (In this case, it has to be Lê Nguyễn Phú).
My question is, which is the encoding method they used and how to parse this text in a proper way in Python?
Because the JSON file comes from an external source that I didn't handle so that I cannot know or make any changes in the process of encoding the text.
[Updated] More details:
The JSON file looks like this:
{
"content":"L\u00c3\u00aa Nguy\u00e1\u00bb\u0085n Ph\u00c3\u00ba"
}
Firstly, I loaded the JSON file:
with open(json_path, 'r') as f:
data = json.load(f)
But when I extract the content, it's not what I expected:
string = data.get('content', '')
print(string)
'Lê Nguyá»\x85n Phú'
Someone took "Lê Nguyễn Phú", encoded that as UTF-8, and then took the resulting series of bytes and lied to a JSON encoder by telling it that those bytes were the characters of a string. The JSON encoder then cooperatively produced garbage by encoding those characters. But it is reversible garbage. You can reverse this process using something like
json.loads(in_string).encode("latin_1").decode("utf_8")
Which decodes the string from the JSON, extracts the bytes from it (the 256 symbols in Latin-1 are in a 1-to-1 correspondence with the first 256 Unicode codepoints), and then re-decodes those bytes as UTF-8.
The big problem with this technique is that it only works if you are sure that all of your input is garbled in this fashion... there's no completely reliable way to look at an input and decide whether it should have this broken decoding applied to it. If you try to apply it to a validly-encoded string containing codepoints above U+00FF, it will crash. But if you try to apply it to a validly-encoding string containing only codepoints up to U+00FF, it will turn your perfectly good string into a different kind of garbage.

json dumps escape characters

I'm having some trouble with escape characters and json.dumps.
It seems like extra escape characters are being added whenever json.dumps is called. Example:
not_encoded = {'data': '''!"#$%'()*+,-/:;=?#[\]^_`{|}~0000&<>'''}
print(not_encoded)
{'data': '!"#$%\'()*+,-/:;=?#[\\]^_`{|}~0000&<>'}
This is fine, but when I do a json dumps it adds a lot of extra values.
json.dumps(not_encoded)
'{"data": "!\\"#$%\'()*+,-/:;=?#[\\\\]^_`{|}~0000&<>"}'
The dump shouldn't look like this. It's double escaping the \ and the ". Anyone know why this is and how to fix it? I would want the json.dumps to output
'{"data": "!\"#$%'()*+,-/:;=?#[\\]^_`{|}~0000&<>"}'
edit
Loading back in the dump:
the_dump = json.dumps(not_encoded)
json.loads(the_dump)
{u'data': u'!"#$%\'()*+,-/:;=?#[\\]^_`{|}~0000&<>'}
The problem is I'm hitting an API endpoint which needs these special characters, but it goes over character limit when the json.dumps adds additional escape characters (\\\\ and \\").
It is worth reading up on the difference between print, str and repr in python (see here for example). You are comparing the printed original string with a repr of the json encoding, the latter will have double escapes - one from the json encoding and one from python's string representation.
But otherwise there is no issue, if you compare len(not_encoded['data']) with len(json.loads(json.dumps(not_encoded))['data']) you will find they are the same. There are no extra characters, but there are different methods of displaying them.
json.dumps is required to escape " and \ according to the JSON standard. If the API uses JSON you cannot avoid your data to grow in length when using these characters.
From json.org:

Python: Correct Way to refer to index of unicode string

Not sure if this is exactly the problem, but I'm trying to insert a tag on the first letter of a unicode string and it seems that this is not working. Could these be because unicode indices work differently than those of regular strings?
Right now my code is this:
for index, paragraph in enumerate(intro[2:-2]):
intro[index] = bold_letters(paragraph, 1)
def bold_letters(string, index):
return "<b>"+string[0]+"</b>"+string[index:]
And I'm getting output like this:
<b>?</b>?רך האחד וישתבח הבורא בחכמתו ורצונו כל צבא השמים ארץ וימים אלה ואלונים.
It seems the unicode gets messed up when I try to insert the HTML tag. I tried messing with the insert position but didn't make any progress.
Example desired output (hebrew goes right to left):
>>>first_letter_bold("הקדמה")
"הקדמ<\b>ה<b>"
BTW, this is for Python 2
You are right, indices work over each byte when you are dealing with raw bytes i.e String in Python(2.x).
To work seamlessly with Unicode data, you need to first let Python(2.x) know that you are dealing with Unicode, then do the string manipulation. You can finally convert it back to raw bytes to keep the behavior abstracted i.e you get String and you return String.
Ideally you should convert all the data from UTF8 raw encoding to Unicode object (I am assuming your source encoding is Unicode UTF8 because that is the standard used by most applications these days) at the very beginning of your code and convert back to raw bytes at the fag end of code like saving to DB, responding to client etc. Some frameworks might handle that for you so that you don't have to worry.
def bold_letters(string, index):
string = string.decode('utf8')
string "<b>"+string[0]+"</b>"+string[index:]
return string.encode('utf8')
This will also work for ASCII because UTF8 is a super-set of ASCII. You can understand how Unicode works and in Python specifically better by reading http://nedbatchelder.com/text/unipain.html
Python 3.x String is a Unicode object so you don't have to explicitly do anything.
You should use Unicode strings. Byte strings in UTF-8 use a variable number of bytes per character. Unicode use one (at least those in the BMP on Python 2...the first 65536 characters):
#coding:utf8
s = u"הקדמה"
t = u'<b>'+s[0]+u'</b>'+s[1:]
print(t)
with open('out.htm','w',encoding='utf-8-sig') as f:
f.write(t)
Output:
<b>ה</b>קדמה
But my Chrome browser displays out.htm as:

Python 2: Comparing a unicode and a str

This topic is already on StackOverflow but I didn't find any satisfying solution:
I have some strings in Unicode coming from a server and I have some hardcoded strings in the code which I'd like to match against. And I do understand why I can't just make a == but I do not succeed in converting them properly (I don't care if I've to do str -> unicode or unicode -> str).
I tried encode and decode but it didn't gave any result.
Here is what I receive...
fromServer = {unicode} u'Führerschein nötig'
fromCode = {str} 'Führerschein nötig'
(as you can see, it is german!)
How can have them equals in Python 2 ?
First make sure you declare the encoding of your Python source file at the top of the file. Eg. if your file is encoded as latin-1:
# -*- coding: latin-1 -*-
And second, always store text as Unicode strings:
fromCode = u'Führerschein nötig'
If you get bytes from somewhere, convert them to Unicode with str.decode before working with the text. For text files, specify the encoding when opening the file, eg:
# use codecs.open to open a text file
f = codecs.open('unicode.rst', encoding='utf-8')
Code which compares byte strings with Unicode strings will often fail at random, depending on system settings, or whatever encoding happens to be used for a text file. Don't rely on it, always make sure you compare either two unicode strings or two byte strings.
Python 3 changed this behaviour, it will not try to convert any strings. 'a' and b'a' are considered objects of a different type and comparing them will always return False.
tested on 2.7
for German umlauts latin-1 is used.
if 'Führerschein nötig'.decode('latin-1') == u'Führerschein nötig':
print('yes....')
yes....

Categories