Unicode to ASCII string in Python 2

Unicode to ASCII string in Python 2 - python

Trying to make a wall display of current MET data for a selected airport.
This is my first use of a Raspberry Pi 3 and of Python.
The plan is to read from a net data provider and show selected data on a LCD display.
The LCD library seems to work only in Python2. Json data seems to be easier to handle in Python3.
This question python json loads unicode probably adresses my problem, but I do not anderstand what to do.
So, what shall I do to my code?
Minimal example demonstrating my problem:
#!/usr/bin/python
import I2C_LCD_driver
import urllib2
import urllib
import json
mylcd = I2C_LCD_driver.lcd()
mylcd.lcd_clear()
url = 'https://avwx.rest/api/metar/ESSB'
request = urllib2.Request(url)
response = urllib2.urlopen(request).read()
data = json.loads(response)
str1 = data["Altimeter"], data["Units"]["Altimeter"]
mylcd.lcd_display_string(str1, 1, 0)
The error is as follows:
$python Minimal.py
Traceback (most recent call last):
File "Minimal.py", line 18, in <module>
mylcd.lcd_display_string(str1, 1, 0)
File "/home/pi/I2C_LCD_driver.py", line 161, in lcd_display_string
self.lcd_write(ord(char), Rs)
TypeError: ord() expected a character, but string of length 4 found

It's a little bit hard to tell without seeing the inside of mylcd.lcd_display_string(), but I think the problem is here:
str1 = data["Altimeter"], data["Units"]["Altimeter"]
I suspect that you want str1 to contain something of type string, with a value like "132 metres". Try adding a print statement just after, so that you can see what str1 contains.
str1 = data["Altimeter"], data["Units"]["Altimeter"]
print( "str1 is: {0}, of type {1}.".format(str1, type(str1)) )
I think you will see a result like:
str1 is: ('foo', 'bar'), of type <type 'tuple'>.
The mention of "type tuple", the parentheses, and the comma indicate that str1 is not a string.
The problem is that the comma in the print statement does not do concatenation, which is perhaps what you are expecting. It joins the two values into a tuple. To concatenate, a flexible and sufficient method is to use the str.format() method:
str1 = "{0} {1}".format(data["Altimeter"], data["Units"]["Altimeter"])
print( "str1 is: {0}, of type {1}.".format(str1, type(str1)) )
Then I expect you will see a result like:
str1 is: 132 metres, of type <type 'str'>.
That value of type "str" should work fine with mylcd.lcd_display_string().

You are passing in a tuple, not a single string:
str1 = data["Altimeter"], data["Units"]["Altimeter"]
mylcd.lcd_display_string() expects a single string instead. Perhaps you meant to concatenate the two strings:
str1 = data["Altimeter"] + data["Units"]["Altimeter"]
Python 2 will implicitly encode Unicode strings to ASCII byte strings where a byte string is expected; if your data is not ASCII encodable you'd have to encode your data yourself, with a suitable codec, one that maps your Unicode codepoints to the right bytes for your LCD display (I believe you can have different character sets stored in ROM). This likely would involve a manual translation table as non-ASCII LCD ROM character sets seldom correspond with standard codecs.
If you just want to force the string to contain ASCII characters only, encode with the error handler set to 'ignore':
str1 = (data["Altimeter"] + data["Units"]["Altimeter"]).encode('ASCII', 'ignore')

Related

Python - Issues with Unicode String from API Call

I'm using Python to call an API that returns the last name of some soccer players. One of the players has a "ć" in his name.
When I call the endpoint, the name prints out with the unicode attached to it:
>>> last_name = (json.dumps(response["response"][2]["player"]["lastname"]))
>>> print(last_name)
"Mitrovi\u0107"
>>> print(type(last_name))
<class 'str'>
If I were to take copy and paste that output and put it in a variable on its own like so:
>>> print("Mitrovi\u0107")
Mitrović
>>> print(type("Mitrovi\u0107"))
<class 'str'>
Then it prints just fine?
What is wrong with the API endpoint call and the string that comes from it?

Well, you serialise the string with json.dumps() before printing it, that's why you get a different output.
Compare the following:
>>> print("Mitrović")
Mitrović
and
>>> print(json.dumps("Mitrović"))
"Mitrovi\u0107"
The second command adds double quotes to the output and escapes non-ASCII chars, because that's how strings are encoded in JSON. So it's possible that response["response"][2]["player"]["lastname"] contains exactly what you want, but maybe you fooled yourself by wrapping it in json.dumps() before printing.
Note: don't confuse Python string literals and JSON serialisation of strings. They share some common features, but they aren't the same (eg. JSON strings can't be single-quoted), and they serve a different purpose (the first are for writing strings in source code, the second are for encoding data for sending it accross the network).
Another note: You can avoid most of the escaping with ensure_ascii=False in the json.dumps() call:
>>> print(json.dumps("Mitrović", ensure_ascii=False))
"Mitrović"

Count the number of characters in your string & I'll bet you'll notice that the result of json is 13 characters:
"M-i-t-r-o-v-i-\-u-0-1-0-7", or "Mitrovi\\u0107"
When you copy "Mitrovi\u0107" you're coping 8 characters and the '\u0107' is a single unicode character.
That would suggest the endpoint is not sending properly json-escaped unicode, or somewhere in your doc you're reading it as ascii first. Carefully look at exactly what you're receiving.

Python: print unicode string stored as a variable [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
In Python (3.5.0), I'd like to print a string containig unicode symbols (more precisely, IPA symbols retrieved from Wiktionary in JSON format) to the screen or a file, e.g.
print("\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n")
correctly prints
ˈwɔːtəˌmɛlən
- however, whenever I use the string in a variable, e.g.
ipa = '\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'
print(ipa)
it just prints out the string as-is, i.e.
\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n
which isn't of much help.
I have tried out several ways to avoid this (like going via deocde/encode) but non of that helped.
I cannot work with
u'\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'
either since I am already retrieving the string as a variable (as the result of a regex-match) and at no point in my code enter the actual literals.
It might as well be that I made a mistake during the conversion from the JSON result; by now I have converted the byte stream into a string using str(f.read()), extracted the IPA part via regex (and done a replace on the double backslashes) and stored it in a string variable.
Edit:
This is the code I had so far:
def getIPAen(word):
url = "https://en.wiktionary.org/w/api.php?action=query&titles=" + word + "&prop=revisions&rvprop=content&format=json"
jsoncont = str((urllib.request.urlopen(url)).read())
jsonmatch = re.search("\{IPA\|/(.*?)/\|", jsoncont).group(1)
#print("jsomatch: " + jsonmatch)
ipa = jsonmatch.replace("\\\\", "\\")
#print("ipa: " + ipa)
print(ipa)
After modification with json.loads:
def getIPAen(word):
url = "https://en.wiktionary.org/w/api.php?action=query&titles=" + word + "&prop=revisions&rvprop=content&format=json"
jsoncont = str((urllib.request.urlopen(url)).read())
jsonmatch = re.search("\{IPA\|/(.*?)/\|", jsoncont).group(1)
#print("jsonmatch: " + jsonmatch)
jsonstr = "\"" + jsonmatch + "\""
#print("jsonstr: " + jsonstr)
jsonloads = json.loads(jsonstr)
#print("jsonloads: " + jsonloads)
print(jsonloads)
For both versions, when calling it with
getIPAen("watermelon")
what I get is:
\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n
Is there any way to have the string printed/written as already decoded, even when passed as a variable?

You don't have this value:
ipa = '\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'
because that value prints just fine:
>>> ipa = '\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'
>>> print(ipa)
ˈwɔːtəˌmɛlən
You at the very least have literal \ and u characters:
ipa = '\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n'
Those \\ sequences are one backslash each, but escaped. Since this is JSON, the string is probably also surrounded by double quotes:
ipa = '"\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n"'
Because that string has literal backslashes, that is exactly what is being printed:
>>> ipa = '"\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n"'
>>> print(ipa)
"\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n"
>>> ipa[1]
'\\'
>>> print(ipa[1])
\
>>> ipa[2]
'u'
Note how the value echoed shows a string literal you can copy and paste back into Python, so the \ character is escaped again for you.
That value is valid JSON, which also uses \uhhhh escape sequences. Decode it as JSON:
import json
print(json.loads(ipa))
Now you have a proper Python value:
>>> import json
>>> json.loads(ipa)
'ˈwɔːtəˌmɛlən'
>>> print(json.loads(ipa))
ˈwɔːtəˌmɛlən
Note that in Python 3, almost all codepoints are printed directly even when repl() creates a literal for you. The json.loads() result directly shows all text in the value, even though the majority is non-ASCII.
This value does not contain literal backslashes or u characters:
>>> result = json.loads(ipa)
>>> result[0]
'ˈ'
>>> result[1]
'w'
As a side note, when debugging issues like this, you really want to use the repr() and ascii() functions so you get representations that let you properly reproduce the value of a string:
>>> print(repr(ipa))
'"\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n"'
>>> print(ascii(ipa))
'"\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n"'
>>> print(repr(result))
'ˈwɔːtəˌmɛlən'
>>> print(ascii(result))
'\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'
Note that only ascii() on a string with actual Unicode codepoints beyond the Latin-1 range produces actual \uhhhh escape sequences. (For repl() output Python can still fall back to \uhhhh escapes if you terminal or console can't handle specific characters).
As for your update, just parse the whole response as JSON, and load the right data from that. Your code instead converts the bytes response body to a repr() (the str() call on bytes does not decode the data; instead you doubly escape escapes this way). Decode the bytes from the network as UTF-8, then feed that data to json.loads():
import json
import re
import urllib.request
from urllib.parse import quote_plus
baseurl = "https://en.wiktionary.org/w/api.php?action=query&titles={}&prop=revisions&rvprop=content&format=json"
def getIPAen(word):
url = baseurl.format(quote_plus(word))
jsondata = urllib.request.urlopen(url).read().decode('utf8')
data = json.loads(jsondata)
for page in data['query']['pages'].values():
for revision in page['revisions']:
if 'IPA' in revision['*']:
ipa = re.search(r"{IPA\|/(.*?)/\|", revision['*']).group(1)
print(ipa)
Note that I also make sure to quote the word value into the URL query string.
The above prints out any IPA it finds:
>>> getIPAen('watermelon')
ˈwɔːtəˌmɛlən
>>> getIPAen('chocolate')
ˈtʃɒk(ə)lɪt

How to use Python convert a unicode string to the real string [duplicate]

This question already has answers here:
Chinese and Japanese character support in python
(3 answers)
Closed 7 years ago.
I have used Python to get some info through urllib2, but the info is unicode string.
I've tried something like below:
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print unicode(a).encode("gb2312")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a.encode("utf-8").decode("utf-8")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print u""+a
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print str(a).decode("utf-8")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print str(a).encode("utf-8")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a.decode("utf-8").encode("gb2312")
but all results are the same:
\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728
And I want to get the following Chinese text:
方法，删除存储在

You need to convert the string to a unicode string.
First of all, the backslashes in a are auto-escaped:
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a # Prints \u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728
a # Prints '\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728'
So playing with the encoding / decoding of this escaped string makes no difference.
You can either use unicode literal or convert the string into a unicode string.
To use unicode literal, just add a u in the front of the string:
a = u"\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
To convert existing string into a unicode string, you can call unicode, with unicode_escape as the encoding parameter:
print unicode(a, encoding='unicode_escape') # Prints 方法，删除存储在
I bet you are getting the string from a JSON response, so the second method is likely to be what you need.
BTW, the unicode_escape encoding is a Python specific encoding which is used to
Produce a string that is suitable as Unicode literal in Python source
code

Where are you getting this data from? Perhaps you could share the method by which you are downloading and extracting it.
Anyway, it kind of looks like a remnant of some JSON encoded string? Based on that assumption, here is a very hacky (and not entirely serious) way to do it:
>>> a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
>>> a
'\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728'
>>> s = '"{}"'.format(a)
>>> s
'"\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728"'
>>> import json
>>> json.loads(s)
u'\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728'
>>> print json.loads(s)
方法，删除存储在
This involves recreating a valid JSON encoded string by wrapping the given string in a in double quotes, then decoding the JSON string into a Python unicode string.

Unicode strings to byte strings without the addition of backslashes

I'm learning python by doing the python challenge using python3.3 and I'm on question eight. There's a comment in the markup providing you with two bz2-compressed unicode strings outputting byte strings, one for username and one for password. There's also a link where you need the decompressed credentials to enter.
One way to easily solve this is just to manually copy the strings and assign it to two variables as byte strings and then just use the bz2 library to decompress it:
>>>un=b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
>>>print(bz2.decompress(un).decode('utf-8'))
huge
But that's not for me since I want the answer by just running my python file.
My code like this:
>>>import bz2, re, requests
>>>url = requests.get('http://www.pythonchallenge.com/pc/def/integrity.html')
>>>un = re.findall(r'un: \'(.*)\'',url.text)[0]
>>>correct=b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
>>>print(un,un is correct,sep='\n')
b'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084'
False
The problem is that when it converts from unicode string to byte string the escaping backslash gets added so that it cannot be read by bz2 module. I have tried everything I know and what got up when I searched.
How do I get it from unicode to byte so that it doesn't get changed?

Here it is a solution:
import urllib
import bz2
import re
def decode(line):
out = re.search(r"\'(.*?)\'",''.join(line)).group()
out = eval("b%s" % out)
return bz2.decompress(out)
#read lines that contain the encoded message
page = urllib.urlopen('http://www.pythonchallenge.com/pc/def/integrity.html').readlines()[20:22]
print "Click on the bee and insert: "
User_Name = decode(page[0])
print "User Name is: " + User_Name
Password = decode(page[1])
print "Password is: " + Password

The backslashes are present in the HTML source, so it's not surprising that the requests module preserves them. I don't have requests installed on my Python 3 environment, so I haven't been able to exactly replicate your situation, but it looks to me like if you start capturing the surrounding ' characters, you can use ast.literal_eval to parse the character sequence into a bytes array:
>>> test
"'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084'"
>>> import ast
>>> res = ast.literal_eval("b%s" % test)
>>> import bz2
>>> len(bz2.decompress(res))
4
There are probably other ways, but why not use Python's built in knowledge that the byte sequence b'\\xaf' can be parsed into a bytes array?

Converting Unicode objects with non-ASCII symbols in them into strings objects (in Python)

I want to send Chinese characters to be translated by an online service, and have the resulting English string returned. I'm using simple JSON and urllib for this.
And yes, I am declaring.
# -*- coding: utf-8 -*-
on top of my code.
Now everything works fine if I feed urllib a string type object, even if that object contains what would be Unicode information. My function is called translate.
For example:
stringtest1 = '無與倫比的美麗'
print translate(stringtest1)
results in the proper translation and doing
type(stringtest1)
confirms this to be a string object.
But if do
stringtest1 = u'無與倫比的美麗'
and try to use my translation function I get this error:
File "C:\Python27\lib\urllib.py", line 1275, in urlencode
v = quote_plus(str(v))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-8: ordinal not in range(128)
After researching a bit, it seems this is a common problem:
Problem: neither urllib2.quote nor urllib.quote encode the unicode strings arguments
urllib.quote throws exception on Unicode URL
Now, if I type in a script
stringtest1 = '無與倫比的美麗'
stringtest2 = u'無與倫比的美麗'
print 'stringtest1',stringtest1
print 'stringtest2',stringtest2
excution of it returns:
stringtest1 ç„¡èˆ‡å€«æ¯”çš„ç¾Žéº—
stringtest2 無與倫比的美麗
But just typing the variables in the console:
>>> stringtest1
'\xe7\x84\xa1\xe8\x88\x87\xe5\x80\xab\xe6\xaf\x94\xe7\x9a\x84\xe7\xbe\x8e\xe9\xba\x97'
>>> stringtest2
u'\u7121\u8207\u502b\u6bd4\u7684\u7f8e\u9e97'
gets me that.
My problem is that I don't control how the information to be translated comes to my function. And it seems I have to bring it in the Unicode form, which is not accepted by the function.
So, how do I convert one thing into the other?
I've read Stack Overflow question Convert Unicode to a string in Python (containing extra symbols).
But this is not what I'm after. Urllib accepts the string object but not the Unicode object, both containing the same information
Well, at least in the eyes of the web application I'm sending the unchanged information to, I'm not sure if they're are still equivalent things in Python.

When you get a unicode object and want to return a UTF-8 encoded byte string from it, use theobject.encode('utf8').
It seems strange that you don't know whether the incoming object is a str or unicode -- surely you do control the call sites to that function, too?! But if that is indeed the case, for whatever weird reason, you may need something like:
def ensureutf8(s):
if isinstance(s, unicode):
s = s.encode('utf8')
return s
which only encodes conditionally, that is, if it receives a unicode object, not if the object it receives is already a byte string. It returns a byte string in either case.
BTW, part of your confusion seems to be due to the fact that you don't know that just entering an expression at the interpreter prompt will show you its repr, which is not the same effect you get with print;-).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.