Unicode strings to byte strings without the addition of backslashes - python

I'm learning python by doing the python challenge using python3.3 and I'm on question eight. There's a comment in the markup providing you with two bz2-compressed unicode strings outputting byte strings, one for username and one for password. There's also a link where you need the decompressed credentials to enter.
One way to easily solve this is just to manually copy the strings and assign it to two variables as byte strings and then just use the bz2 library to decompress it:
>>>un=b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
>>>print(bz2.decompress(un).decode('utf-8'))
huge
But that's not for me since I want the answer by just running my python file.
My code like this:
>>>import bz2, re, requests
>>>url = requests.get('http://www.pythonchallenge.com/pc/def/integrity.html')
>>>un = re.findall(r'un: \'(.*)\'',url.text)[0]
>>>correct=b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
>>>print(un,un is correct,sep='\n')
b'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084'
False
The problem is that when it converts from unicode string to byte string the escaping backslash gets added so that it cannot be read by bz2 module. I have tried everything I know and what got up when I searched.
How do I get it from unicode to byte so that it doesn't get changed?

Here it is a solution:
import urllib
import bz2
import re
def decode(line):
out = re.search(r"\'(.*?)\'",''.join(line)).group()
out = eval("b%s" % out)
return bz2.decompress(out)
#read lines that contain the encoded message
page = urllib.urlopen('http://www.pythonchallenge.com/pc/def/integrity.html').readlines()[20:22]
print "Click on the bee and insert: "
User_Name = decode(page[0])
print "User Name is: " + User_Name
Password = decode(page[1])
print "Password is: " + Password

The backslashes are present in the HTML source, so it's not surprising that the requests module preserves them. I don't have requests installed on my Python 3 environment, so I haven't been able to exactly replicate your situation, but it looks to me like if you start capturing the surrounding ' characters, you can use ast.literal_eval to parse the character sequence into a bytes array:
>>> test
"'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084'"
>>> import ast
>>> res = ast.literal_eval("b%s" % test)
>>> import bz2
>>> len(bz2.decompress(res))
4
There are probably other ways, but why not use Python's built in knowledge that the byte sequence b'\\xaf' can be parsed into a bytes array?

Related

Python decode_header splits the original string

Using Python 3, I'm trying to parse e-mails from an mbox file.
for message in mailbox.mbox('file'):
sender = message['From']
c = decode_header(sender)
The raw e-mail has this unique From: header
From: "=?UTF-8?Q?Mark_from_Site?=" <info#site.com>
Anyway, c is
[(b'"', None), (b'Mark from Site', 'utf-8'), (b'" <info#site.com>', None)]
In this case, the line is unexpectedly split following the quotation marks " in multiple elements.
Handling this may be cumbersome, because there may be an undefined number of elements (not always 3 like above) in the list, according to the number of ", and there may also be other causes for splitting.
When there is no string encoding (that is: when the header is pure ascii), there is no split and c is "Mark from Site" <info#site.com>.
Is there a way to avoid this splitting also for non-ascii encodings?
Or, otherwise, how to correctly parse this kind of headers?
What about doing the simplest thing, ie. converting all parts to Unicode and then glueing them together:
from = ''.join(t[0].decode(t[1] if t[1] else 'UTF-8') for t in decode_header(sender))
You can have the email.header module handle encoding for you by creating an instance of email.header.Header with your string and the charset it should be encoded in.
from email.header import Header
for message in mailbox.mbox('file'):
sender = Header(message['From'], "utf-8")
c = decode_header(sender)
str(email.header.make_header(email.header.decode_header(encoded_string)))
Not too obvious, but this should decode and correctly rebuild the header and convert it to a string. I also found this somewhere here on StackOverflow.
Not sure if it's the most elegant way, but seems to work for me.
See https://docs.python.org/3/library/email.header.html for the documentation of these functions.

Python: print unicode string stored as a variable [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
In Python (3.5.0), I'd like to print a string containig unicode symbols (more precisely, IPA symbols retrieved from Wiktionary in JSON format) to the screen or a file, e.g.
print("\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n")
correctly prints
ˈwɔːtəˌmɛlən
- however, whenever I use the string in a variable, e.g.
ipa = '\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'
print(ipa)
it just prints out the string as-is, i.e.
\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n
which isn't of much help.
I have tried out several ways to avoid this (like going via deocde/encode) but non of that helped.
I cannot work with
u'\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'
either since I am already retrieving the string as a variable (as the result of a regex-match) and at no point in my code enter the actual literals.
It might as well be that I made a mistake during the conversion from the JSON result; by now I have converted the byte stream into a string using str(f.read()), extracted the IPA part via regex (and done a replace on the double backslashes) and stored it in a string variable.
Edit:
This is the code I had so far:
def getIPAen(word):
url = "https://en.wiktionary.org/w/api.php?action=query&titles=" + word + "&prop=revisions&rvprop=content&format=json"
jsoncont = str((urllib.request.urlopen(url)).read())
jsonmatch = re.search("\{IPA\|/(.*?)/\|", jsoncont).group(1)
#print("jsomatch: " + jsonmatch)
ipa = jsonmatch.replace("\\\\", "\\")
#print("ipa: " + ipa)
print(ipa)
After modification with json.loads:
def getIPAen(word):
url = "https://en.wiktionary.org/w/api.php?action=query&titles=" + word + "&prop=revisions&rvprop=content&format=json"
jsoncont = str((urllib.request.urlopen(url)).read())
jsonmatch = re.search("\{IPA\|/(.*?)/\|", jsoncont).group(1)
#print("jsonmatch: " + jsonmatch)
jsonstr = "\"" + jsonmatch + "\""
#print("jsonstr: " + jsonstr)
jsonloads = json.loads(jsonstr)
#print("jsonloads: " + jsonloads)
print(jsonloads)
For both versions, when calling it with
getIPAen("watermelon")
what I get is:
\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n
Is there any way to have the string printed/written as already decoded, even when passed as a variable?
You don't have this value:
ipa = '\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'
because that value prints just fine:
>>> ipa = '\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'
>>> print(ipa)
ˈwɔːtəˌmɛlən
You at the very least have literal \ and u characters:
ipa = '\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n'
Those \\ sequences are one backslash each, but escaped. Since this is JSON, the string is probably also surrounded by double quotes:
ipa = '"\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n"'
Because that string has literal backslashes, that is exactly what is being printed:
>>> ipa = '"\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n"'
>>> print(ipa)
"\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n"
>>> ipa[1]
'\\'
>>> print(ipa[1])
\
>>> ipa[2]
'u'
Note how the value echoed shows a string literal you can copy and paste back into Python, so the \ character is escaped again for you.
That value is valid JSON, which also uses \uhhhh escape sequences. Decode it as JSON:
import json
print(json.loads(ipa))
Now you have a proper Python value:
>>> import json
>>> json.loads(ipa)
'ˈwɔːtəˌmɛlən'
>>> print(json.loads(ipa))
ˈwɔːtəˌmɛlən
Note that in Python 3, almost all codepoints are printed directly even when repl() creates a literal for you. The json.loads() result directly shows all text in the value, even though the majority is non-ASCII.
This value does not contain literal backslashes or u characters:
>>> result = json.loads(ipa)
>>> result[0]
'ˈ'
>>> result[1]
'w'
As a side note, when debugging issues like this, you really want to use the repr() and ascii() functions so you get representations that let you properly reproduce the value of a string:
>>> print(repr(ipa))
'"\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n"'
>>> print(ascii(ipa))
'"\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n"'
>>> print(repr(result))
'ˈwɔːtəˌmɛlən'
>>> print(ascii(result))
'\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'
Note that only ascii() on a string with actual Unicode codepoints beyond the Latin-1 range produces actual \uhhhh escape sequences. (For repl() output Python can still fall back to \uhhhh escapes if you terminal or console can't handle specific characters).
As for your update, just parse the whole response as JSON, and load the right data from that. Your code instead converts the bytes response body to a repr() (the str() call on bytes does not decode the data; instead you doubly escape escapes this way). Decode the bytes from the network as UTF-8, then feed that data to json.loads():
import json
import re
import urllib.request
from urllib.parse import quote_plus
baseurl = "https://en.wiktionary.org/w/api.php?action=query&titles={}&prop=revisions&rvprop=content&format=json"
def getIPAen(word):
url = baseurl.format(quote_plus(word))
jsondata = urllib.request.urlopen(url).read().decode('utf8')
data = json.loads(jsondata)
for page in data['query']['pages'].values():
for revision in page['revisions']:
if 'IPA' in revision['*']:
ipa = re.search(r"{IPA\|/(.*?)/\|", revision['*']).group(1)
print(ipa)
Note that I also make sure to quote the word value into the URL query string.
The above prints out any IPA it finds:
>>> getIPAen('watermelon')
ˈwɔːtəˌmɛlən
>>> getIPAen('chocolate')
ˈtʃɒk(ə)lɪt

Python - writing unicode strings to a file & beautiful soup

I'm using BeautifulSoup to parse some XML files. One of the fields in this file frequently uses Unicode characters. I've tried unsuccessfully to write the unicode to a file using encode.
The process so far is basically:
Get the name
gamename = items.find('name').string.strip()
Then incorporate the name into a list which is later converted into a string:
stringtoprint = userid, gamename.encode('utf-8') #
newstring = "INSERT INTO collections VALUES " + str(stringtoprint) + ";" +"\n"
Then write that string to a file.
listofgamesowned.write(newstring.encode("UTF-8"))
It seems that I won't have to .encode quite so often. I had tried encoding directly upon parsing out the name e.g. gamename = items.find('name').string.strip().encode('utf-8') - however, that did not seem to work.
Currently - 'Uudet L\xc3\xb6yt\xc3\xb6retket'
is being printed and saved rather than Uudet Löytöretket.
It seems if this were a string I was generating then I'd use something.write(u'Uudet L\xc3\xb6yt\xc3\xb6retket'); however, it's one element embedded in a string.
Unicode is an in-memory representation of a string. When you write out or read in you need to encode and decode.
Uudet L\xc3\xb6yt\xc3\xb6retket is the utf-8 encoded version of Uudet Löytöretket, so it is what you want to write out. When you want to read a string back from a file you need to decode it.
>>> print 'Uudet L\xc3\xb6yt\xc3\xb6retket'
Uudet Löytöretket
>>> print 'Uudet L\xc3\xb6yt\xc3\xb6retket'.decode('utf-8')
Uudet Löytöretket
Just remember to encode immediately before you output and decode immediately after you read it back.

EOF while scanning triple-quoted string literal

i've looked on the web and here but i didn't find an answer :
here is my code
zlib.decompress("""
xワᆳヤ=ラᄇHナs~Ʀᄑç\ムîà
Z#ÑÁÔQÇlxÇÆïPP~ýVãì゙M6ÛÐ|ê֭ᄁᄂヤ=)}éÓUe﬿ö3ᄎᄌú"}ʿïÿ÷1þ8ñ́U÷ᄏñíLÒVi:`ᄈᄎL!Ê҆p6-%Fë^ヘ÷à,Q.K!ユô`ÄA!ÑêweÌ ÊÚAロYøøÂjôóᅠÂcñ䊧fᆴùテúN :nüzAÝ7%ᄌcdUタᄌ3ôPۂタlンyHᆲᄑ$/yzᄒíàヌ'ÕÓ&`|S!<'ᄂ÷Zļᄐ2ホモ;ニ(ÅÛfb!úü$ナテᄒ,9ßhàPᄎᄄێフÑbØὛホQᄍ-Ü}(n;ᄄホLヤ\^ï9ᆭᄍラDdВéÞ|åPOGᄂÐÙ%â&AÔë)ÎTÐC ᄐïc枢í%Èï!フᄋëiq*ᄌVKÐNᄡ[ᄁfOq{OᆭÆÊ,0GᄂリmtツᄈOᄌΥ$#îヘqbYᄆメUニᄉÞáP`
ヨ×ᆵÃPwaレǩâ×)ハFcêÚ=!Åöᄊ
)AFñᄈ/cMᄃ!NóNΈór?pàÜòXw
Bvæ0ïçIÉoマ>5pᆭ-ØWÚNᄆùFᄆØPçÃþdᅠ;ル1[Oᄈホ~6ツᄈᆬŕìᄄޠ=øð#ネV﾿ᄅ)÷%ユÜib{HᄆKŅVlDCテîfÑWì÷ìáár.ワîv﾿<dᄎn~ú*ÁÕ7ýá}EsYᆵWᄂÈ:R×ãQңメ?Ø1vヘäツ~èR1ᄉÜ*ᄡónAᆬjmNoツユᄈÌښᆬf[8ᆭÛ>゙OWラ|ÌbDᄁÖ녡M=Ð÷èâミム'ÂÝÐ ;ë mᄎQÂäԤۢ:モᄆdᄎᄑLȂ1ᄈ_÷YZᆲNòÛ â\ロxÐlݵᆵムᆱøm5Ëá=ïoÍlMᆪ[×#Ypᅠトx[ÉÊyæツoモナz)ᆭᄀÝÏìò
""")
so it was a string that i got by zlib.compress an other string.
How can i decompress this string ?
Regards
Bussiere
The zlib.decompress should work if you pass it the output of zlib.compress.
Since the compressed string is really not text it is a binary string. It will not play friendly with displaying to the terminal as you have found.
You can use base64 encoding to give you something safe to drop into unittests, paste into code etc.
>>> import zlib
>>> a = zlib.compress('fooo')
>>> b = a.encode('base64')
>>> b
'eJxLy8/PBwAENgG0\n'
>>> c = 'eJxLy8/PBwAENgG0\n'.decode('base64')
>>> zlib.decompress(c)
'fooo'
>>> zlib.decompress(a)
'fooo'
a as an output is ok for binary transmission or saving to a file.
b is friendly to use with the clipboard, send in email, etc.
I would not have it in that representation. Use repr() in the other code to generate an ASCII-clean representation, and use that instead. Then just look for triple quotes in the result and break them up.

Unescape Python Strings From HTTP

I've got a string from an HTTP header, but it's been escaped.. what function can I use to unescape it?
myemail%40gmail.com -> myemail#gmail.com
Would urllib.unquote() be the way to go?
I am pretty sure that urllib's unquote is the common way of doing this.
>>> import urllib
>>> urllib.unquote("myemail%40gmail.com")
'myemail#gmail.com'
There's also unquote_plus:
Like unquote(), but also replaces plus signs by spaces, as required for unquoting HTML form values.
In Python 3, these functions are urllib.parse.unquote and urllib.parse.unquote_plus.
The latter is used for example for query strings in the HTTP URLs, where the space characters () are traditionally encoded as plus character (+), and the + is percent-encoded to %2B.
In addition to these there is the unquote_to_bytes that converts the given encoded string to bytes, which can be used when the encoding is not known or the encoded data is binary data. However there is no unquote_plus_to_bytes, if you need it, you can do:
def unquote_plus_to_bytes(s):
if isinstance(s, bytes):
s = s.replace(b'+', b' ')
else:
s = s.replace('+', ' ')
return unquote_to_bytes(s)
More information on whether to use unquote or unquote_plus is available at URL encoding the space character: + or %20.
Yes, it appears that urllib.unquote() accomplishes that task. (I tested it against your example on codepad.)

Categories