Python: print unicode string stored as a variable [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
In Python (3.5.0), I'd like to print a string containig unicode symbols (more precisely, IPA symbols retrieved from Wiktionary in JSON format) to the screen or a file, e.g.
print("\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n")
correctly prints
ˈwɔːtəˌmɛlən
- however, whenever I use the string in a variable, e.g.
ipa = '\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'
print(ipa)
it just prints out the string as-is, i.e.
\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n
which isn't of much help.
I have tried out several ways to avoid this (like going via deocde/encode) but non of that helped.
I cannot work with
u'\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'
either since I am already retrieving the string as a variable (as the result of a regex-match) and at no point in my code enter the actual literals.
It might as well be that I made a mistake during the conversion from the JSON result; by now I have converted the byte stream into a string using str(f.read()), extracted the IPA part via regex (and done a replace on the double backslashes) and stored it in a string variable.
Edit:
This is the code I had so far:
def getIPAen(word):
url = "https://en.wiktionary.org/w/api.php?action=query&titles=" + word + "&prop=revisions&rvprop=content&format=json"
jsoncont = str((urllib.request.urlopen(url)).read())
jsonmatch = re.search("\{IPA\|/(.*?)/\|", jsoncont).group(1)
#print("jsomatch: " + jsonmatch)
ipa = jsonmatch.replace("\\\\", "\\")
#print("ipa: " + ipa)
print(ipa)
After modification with json.loads:
def getIPAen(word):
url = "https://en.wiktionary.org/w/api.php?action=query&titles=" + word + "&prop=revisions&rvprop=content&format=json"
jsoncont = str((urllib.request.urlopen(url)).read())
jsonmatch = re.search("\{IPA\|/(.*?)/\|", jsoncont).group(1)
#print("jsonmatch: " + jsonmatch)
jsonstr = "\"" + jsonmatch + "\""
#print("jsonstr: " + jsonstr)
jsonloads = json.loads(jsonstr)
#print("jsonloads: " + jsonloads)
print(jsonloads)
For both versions, when calling it with
getIPAen("watermelon")
what I get is:
\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n
Is there any way to have the string printed/written as already decoded, even when passed as a variable?

You don't have this value:
ipa = '\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'
because that value prints just fine:
>>> ipa = '\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'
>>> print(ipa)
ˈwɔːtəˌmɛlən
You at the very least have literal \ and u characters:
ipa = '\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n'
Those \\ sequences are one backslash each, but escaped. Since this is JSON, the string is probably also surrounded by double quotes:
ipa = '"\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n"'
Because that string has literal backslashes, that is exactly what is being printed:
>>> ipa = '"\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n"'
>>> print(ipa)
"\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n"
>>> ipa[1]
'\\'
>>> print(ipa[1])
\
>>> ipa[2]
'u'
Note how the value echoed shows a string literal you can copy and paste back into Python, so the \ character is escaped again for you.
That value is valid JSON, which also uses \uhhhh escape sequences. Decode it as JSON:
import json
print(json.loads(ipa))
Now you have a proper Python value:
>>> import json
>>> json.loads(ipa)
'ˈwɔːtəˌmɛlən'
>>> print(json.loads(ipa))
ˈwɔːtəˌmɛlən
Note that in Python 3, almost all codepoints are printed directly even when repl() creates a literal for you. The json.loads() result directly shows all text in the value, even though the majority is non-ASCII.
This value does not contain literal backslashes or u characters:
>>> result = json.loads(ipa)
>>> result[0]
'ˈ'
>>> result[1]
'w'
As a side note, when debugging issues like this, you really want to use the repr() and ascii() functions so you get representations that let you properly reproduce the value of a string:
>>> print(repr(ipa))
'"\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n"'
>>> print(ascii(ipa))
'"\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n"'
>>> print(repr(result))
'ˈwɔːtəˌmɛlən'
>>> print(ascii(result))
'\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'
Note that only ascii() on a string with actual Unicode codepoints beyond the Latin-1 range produces actual \uhhhh escape sequences. (For repl() output Python can still fall back to \uhhhh escapes if you terminal or console can't handle specific characters).
As for your update, just parse the whole response as JSON, and load the right data from that. Your code instead converts the bytes response body to a repr() (the str() call on bytes does not decode the data; instead you doubly escape escapes this way). Decode the bytes from the network as UTF-8, then feed that data to json.loads():
import json
import re
import urllib.request
from urllib.parse import quote_plus
baseurl = "https://en.wiktionary.org/w/api.php?action=query&titles={}&prop=revisions&rvprop=content&format=json"
def getIPAen(word):
url = baseurl.format(quote_plus(word))
jsondata = urllib.request.urlopen(url).read().decode('utf8')
data = json.loads(jsondata)
for page in data['query']['pages'].values():
for revision in page['revisions']:
if 'IPA' in revision['*']:
ipa = re.search(r"{IPA\|/(.*?)/\|", revision['*']).group(1)
print(ipa)
Note that I also make sure to quote the word value into the URL query string.
The above prints out any IPA it finds:
>>> getIPAen('watermelon')
ˈwɔːtəˌmɛlən
>>> getIPAen('chocolate')
ˈtʃɒk(ə)lɪt

Related

convert a string to its codepoint in python

there are characters like '‌' that are not visible so I cant copy paste it. I want to convert any character to its codepoint like '\u200D'
another example is: 'abc' => '\u0061\u0062\u0063'
Allow me to rephrase your question. The header convert a string to its codepoint in python clearly did not get through to everyone, mostly, I think, because we can't imagine what you want it for.
What you want is a string containing a representation of Unicode escapes.
You can do that this way:
print(''.join("\\u{:04x}".format(b) for b in b'abc'))
\u0061\u0062\u0063
If you display that printed value as a string literal you will see doubled backslashes, because backslashes have to be escaped in a Python string. So it will look like this:
'\\u0061\\u0062\\u0063'
The reason for that is that if you simply put unescaped backslashes in your string literal, like this:
a = "\u0061\u0062\u0063"
when you display a at the prompt you will get:
>>> a
'abc'
'\u0061\u0062\u0063'.encode('utf-8') will encode the text to Unicode.
Edit:
Since python automatically converts the string to Unicode you can't see the value but you can create a function that will generate that.
def get_string_unicode(string_to_convert):
res = ''
for letter in string_to_convert:
res += '\\u' + (hex(ord(letter))[2:]).zfill(4)
return res
Result:
>>> get_string_unicode('abc')
'\\u0061\\u0062\\u0063'

Unicode to ASCII string in Python 2

Trying to make a wall display of current MET data for a selected airport.
This is my first use of a Raspberry Pi 3 and of Python.
The plan is to read from a net data provider and show selected data on a LCD display.
The LCD library seems to work only in Python2. Json data seems to be easier to handle in Python3.
This question python json loads unicode probably adresses my problem, but I do not anderstand what to do.
So, what shall I do to my code?
Minimal example demonstrating my problem:
#!/usr/bin/python
import I2C_LCD_driver
import urllib2
import urllib
import json
mylcd = I2C_LCD_driver.lcd()
mylcd.lcd_clear()
url = 'https://avwx.rest/api/metar/ESSB'
request = urllib2.Request(url)
response = urllib2.urlopen(request).read()
data = json.loads(response)
str1 = data["Altimeter"], data["Units"]["Altimeter"]
mylcd.lcd_display_string(str1, 1, 0)
The error is as follows:
$python Minimal.py
Traceback (most recent call last):
File "Minimal.py", line 18, in <module>
mylcd.lcd_display_string(str1, 1, 0)
File "/home/pi/I2C_LCD_driver.py", line 161, in lcd_display_string
self.lcd_write(ord(char), Rs)
TypeError: ord() expected a character, but string of length 4 found
It's a little bit hard to tell without seeing the inside of mylcd.lcd_display_string(), but I think the problem is here:
str1 = data["Altimeter"], data["Units"]["Altimeter"]
I suspect that you want str1 to contain something of type string, with a value like "132 metres". Try adding a print statement just after, so that you can see what str1 contains.
str1 = data["Altimeter"], data["Units"]["Altimeter"]
print( "str1 is: {0}, of type {1}.".format(str1, type(str1)) )
I think you will see a result like:
str1 is: ('foo', 'bar'), of type <type 'tuple'>.
The mention of "type tuple", the parentheses, and the comma indicate that str1 is not a string.
The problem is that the comma in the print statement does not do concatenation, which is perhaps what you are expecting. It joins the two values into a tuple. To concatenate, a flexible and sufficient method is to use the str.format() method:
str1 = "{0} {1}".format(data["Altimeter"], data["Units"]["Altimeter"])
print( "str1 is: {0}, of type {1}.".format(str1, type(str1)) )
Then I expect you will see a result like:
str1 is: 132 metres, of type <type 'str'>.
That value of type "str" should work fine with mylcd.lcd_display_string().
You are passing in a tuple, not a single string:
str1 = data["Altimeter"], data["Units"]["Altimeter"]
mylcd.lcd_display_string() expects a single string instead. Perhaps you meant to concatenate the two strings:
str1 = data["Altimeter"] + data["Units"]["Altimeter"]
Python 2 will implicitly encode Unicode strings to ASCII byte strings where a byte string is expected; if your data is not ASCII encodable you'd have to encode your data yourself, with a suitable codec, one that maps your Unicode codepoints to the right bytes for your LCD display (I believe you can have different character sets stored in ROM). This likely would involve a manual translation table as non-ASCII LCD ROM character sets seldom correspond with standard codecs.
If you just want to force the string to contain ASCII characters only, encode with the error handler set to 'ignore':
str1 = (data["Altimeter"] + data["Units"]["Altimeter"]).encode('ASCII', 'ignore')

How to use Python convert a unicode string to the real string [duplicate]

This question already has answers here:
Chinese and Japanese character support in python
(3 answers)
Closed 7 years ago.
I have used Python to get some info through urllib2, but the info is unicode string.
I've tried something like below:
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print unicode(a).encode("gb2312")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a.encode("utf-8").decode("utf-8")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print u""+a
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print str(a).decode("utf-8")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print str(a).encode("utf-8")
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a.decode("utf-8").encode("gb2312")
but all results are the same:
\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728
And I want to get the following Chinese text:
方法,删除存储在
You need to convert the string to a unicode string.
First of all, the backslashes in a are auto-escaped:
a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
print a # Prints \u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728
a # Prints '\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728'
So playing with the encoding / decoding of this escaped string makes no difference.
You can either use unicode literal or convert the string into a unicode string.
To use unicode literal, just add a u in the front of the string:
a = u"\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
To convert existing string into a unicode string, you can call unicode, with unicode_escape as the encoding parameter:
print unicode(a, encoding='unicode_escape') # Prints 方法,删除存储在
I bet you are getting the string from a JSON response, so the second method is likely to be what you need.
BTW, the unicode_escape encoding is a Python specific encoding which is used to
Produce a string that is suitable as Unicode literal in Python source
code
Where are you getting this data from? Perhaps you could share the method by which you are downloading and extracting it.
Anyway, it kind of looks like a remnant of some JSON encoded string? Based on that assumption, here is a very hacky (and not entirely serious) way to do it:
>>> a = "\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728"
>>> a
'\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728'
>>> s = '"{}"'.format(a)
>>> s
'"\\u65b9\\u6cd5\\uff0c\\u5220\\u9664\\u5b58\\u50a8\\u5728"'
>>> import json
>>> json.loads(s)
u'\u65b9\u6cd5\uff0c\u5220\u9664\u5b58\u50a8\u5728'
>>> print json.loads(s)
方法,删除存储在
This involves recreating a valid JSON encoded string by wrapping the given string in a in double quotes, then decoding the JSON string into a Python unicode string.

Unicode strings to byte strings without the addition of backslashes

I'm learning python by doing the python challenge using python3.3 and I'm on question eight. There's a comment in the markup providing you with two bz2-compressed unicode strings outputting byte strings, one for username and one for password. There's also a link where you need the decompressed credentials to enter.
One way to easily solve this is just to manually copy the strings and assign it to two variables as byte strings and then just use the bz2 library to decompress it:
>>>un=b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
>>>print(bz2.decompress(un).decode('utf-8'))
huge
But that's not for me since I want the answer by just running my python file.
My code like this:
>>>import bz2, re, requests
>>>url = requests.get('http://www.pythonchallenge.com/pc/def/integrity.html')
>>>un = re.findall(r'un: \'(.*)\'',url.text)[0]
>>>correct=b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
>>>print(un,un is correct,sep='\n')
b'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084'
False
The problem is that when it converts from unicode string to byte string the escaping backslash gets added so that it cannot be read by bz2 module. I have tried everything I know and what got up when I searched.
How do I get it from unicode to byte so that it doesn't get changed?
Here it is a solution:
import urllib
import bz2
import re
def decode(line):
out = re.search(r"\'(.*?)\'",''.join(line)).group()
out = eval("b%s" % out)
return bz2.decompress(out)
#read lines that contain the encoded message
page = urllib.urlopen('http://www.pythonchallenge.com/pc/def/integrity.html').readlines()[20:22]
print "Click on the bee and insert: "
User_Name = decode(page[0])
print "User Name is: " + User_Name
Password = decode(page[1])
print "Password is: " + Password
The backslashes are present in the HTML source, so it's not surprising that the requests module preserves them. I don't have requests installed on my Python 3 environment, so I haven't been able to exactly replicate your situation, but it looks to me like if you start capturing the surrounding ' characters, you can use ast.literal_eval to parse the character sequence into a bytes array:
>>> test
"'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084'"
>>> import ast
>>> res = ast.literal_eval("b%s" % test)
>>> import bz2
>>> len(bz2.decompress(res))
4
There are probably other ways, but why not use Python's built in knowledge that the byte sequence b'\\xaf' can be parsed into a bytes array?

Unescape Python Strings From HTTP

I've got a string from an HTTP header, but it's been escaped.. what function can I use to unescape it?
myemail%40gmail.com -> myemail#gmail.com
Would urllib.unquote() be the way to go?
I am pretty sure that urllib's unquote is the common way of doing this.
>>> import urllib
>>> urllib.unquote("myemail%40gmail.com")
'myemail#gmail.com'
There's also unquote_plus:
Like unquote(), but also replaces plus signs by spaces, as required for unquoting HTML form values.
In Python 3, these functions are urllib.parse.unquote and urllib.parse.unquote_plus.
The latter is used for example for query strings in the HTTP URLs, where the space characters () are traditionally encoded as plus character (+), and the + is percent-encoded to %2B.
In addition to these there is the unquote_to_bytes that converts the given encoded string to bytes, which can be used when the encoding is not known or the encoded data is binary data. However there is no unquote_plus_to_bytes, if you need it, you can do:
def unquote_plus_to_bytes(s):
if isinstance(s, bytes):
s = s.replace(b'+', b' ')
else:
s = s.replace('+', ' ')
return unquote_to_bytes(s)
More information on whether to use unquote or unquote_plus is available at URL encoding the space character: + or %20.
Yes, it appears that urllib.unquote() accomplishes that task. (I tested it against your example on codepad.)

Categories