Unescape Python Strings From HTTP - python

I've got a string from an HTTP header, but it's been escaped.. what function can I use to unescape it?
myemail%40gmail.com -> myemail#gmail.com
Would urllib.unquote() be the way to go?

I am pretty sure that urllib's unquote is the common way of doing this.
>>> import urllib
>>> urllib.unquote("myemail%40gmail.com")
'myemail#gmail.com'
There's also unquote_plus:
Like unquote(), but also replaces plus signs by spaces, as required for unquoting HTML form values.

In Python 3, these functions are urllib.parse.unquote and urllib.parse.unquote_plus.
The latter is used for example for query strings in the HTTP URLs, where the space characters () are traditionally encoded as plus character (+), and the + is percent-encoded to %2B.
In addition to these there is the unquote_to_bytes that converts the given encoded string to bytes, which can be used when the encoding is not known or the encoded data is binary data. However there is no unquote_plus_to_bytes, if you need it, you can do:
def unquote_plus_to_bytes(s):
if isinstance(s, bytes):
s = s.replace(b'+', b' ')
else:
s = s.replace('+', ' ')
return unquote_to_bytes(s)
More information on whether to use unquote or unquote_plus is available at URL encoding the space character: + or %20.

Yes, it appears that urllib.unquote() accomplishes that task. (I tested it against your example on codepad.)

Related

Python: print unicode string stored as a variable [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 years ago.
Improve this question
In Python (3.5.0), I'd like to print a string containig unicode symbols (more precisely, IPA symbols retrieved from Wiktionary in JSON format) to the screen or a file, e.g.
print("\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n")
correctly prints
ˈwɔːtəˌmɛlən
- however, whenever I use the string in a variable, e.g.
ipa = '\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'
print(ipa)
it just prints out the string as-is, i.e.
\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n
which isn't of much help.
I have tried out several ways to avoid this (like going via deocde/encode) but non of that helped.
I cannot work with
u'\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'
either since I am already retrieving the string as a variable (as the result of a regex-match) and at no point in my code enter the actual literals.
It might as well be that I made a mistake during the conversion from the JSON result; by now I have converted the byte stream into a string using str(f.read()), extracted the IPA part via regex (and done a replace on the double backslashes) and stored it in a string variable.
Edit:
This is the code I had so far:
def getIPAen(word):
url = "https://en.wiktionary.org/w/api.php?action=query&titles=" + word + "&prop=revisions&rvprop=content&format=json"
jsoncont = str((urllib.request.urlopen(url)).read())
jsonmatch = re.search("\{IPA\|/(.*?)/\|", jsoncont).group(1)
#print("jsomatch: " + jsonmatch)
ipa = jsonmatch.replace("\\\\", "\\")
#print("ipa: " + ipa)
print(ipa)
After modification with json.loads:
def getIPAen(word):
url = "https://en.wiktionary.org/w/api.php?action=query&titles=" + word + "&prop=revisions&rvprop=content&format=json"
jsoncont = str((urllib.request.urlopen(url)).read())
jsonmatch = re.search("\{IPA\|/(.*?)/\|", jsoncont).group(1)
#print("jsonmatch: " + jsonmatch)
jsonstr = "\"" + jsonmatch + "\""
#print("jsonstr: " + jsonstr)
jsonloads = json.loads(jsonstr)
#print("jsonloads: " + jsonloads)
print(jsonloads)
For both versions, when calling it with
getIPAen("watermelon")
what I get is:
\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n
Is there any way to have the string printed/written as already decoded, even when passed as a variable?
You don't have this value:
ipa = '\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'
because that value prints just fine:
>>> ipa = '\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'
>>> print(ipa)
ˈwɔːtəˌmɛlən
You at the very least have literal \ and u characters:
ipa = '\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n'
Those \\ sequences are one backslash each, but escaped. Since this is JSON, the string is probably also surrounded by double quotes:
ipa = '"\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n"'
Because that string has literal backslashes, that is exactly what is being printed:
>>> ipa = '"\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n"'
>>> print(ipa)
"\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n"
>>> ipa[1]
'\\'
>>> print(ipa[1])
\
>>> ipa[2]
'u'
Note how the value echoed shows a string literal you can copy and paste back into Python, so the \ character is escaped again for you.
That value is valid JSON, which also uses \uhhhh escape sequences. Decode it as JSON:
import json
print(json.loads(ipa))
Now you have a proper Python value:
>>> import json
>>> json.loads(ipa)
'ˈwɔːtəˌmɛlən'
>>> print(json.loads(ipa))
ˈwɔːtəˌmɛlən
Note that in Python 3, almost all codepoints are printed directly even when repl() creates a literal for you. The json.loads() result directly shows all text in the value, even though the majority is non-ASCII.
This value does not contain literal backslashes or u characters:
>>> result = json.loads(ipa)
>>> result[0]
'ˈ'
>>> result[1]
'w'
As a side note, when debugging issues like this, you really want to use the repr() and ascii() functions so you get representations that let you properly reproduce the value of a string:
>>> print(repr(ipa))
'"\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n"'
>>> print(ascii(ipa))
'"\\u02c8w\\u0254\\u02d0t\\u0259\\u02ccm\\u025bl\\u0259n"'
>>> print(repr(result))
'ˈwɔːtəˌmɛlən'
>>> print(ascii(result))
'\u02c8w\u0254\u02d0t\u0259\u02ccm\u025bl\u0259n'
Note that only ascii() on a string with actual Unicode codepoints beyond the Latin-1 range produces actual \uhhhh escape sequences. (For repl() output Python can still fall back to \uhhhh escapes if you terminal or console can't handle specific characters).
As for your update, just parse the whole response as JSON, and load the right data from that. Your code instead converts the bytes response body to a repr() (the str() call on bytes does not decode the data; instead you doubly escape escapes this way). Decode the bytes from the network as UTF-8, then feed that data to json.loads():
import json
import re
import urllib.request
from urllib.parse import quote_plus
baseurl = "https://en.wiktionary.org/w/api.php?action=query&titles={}&prop=revisions&rvprop=content&format=json"
def getIPAen(word):
url = baseurl.format(quote_plus(word))
jsondata = urllib.request.urlopen(url).read().decode('utf8')
data = json.loads(jsondata)
for page in data['query']['pages'].values():
for revision in page['revisions']:
if 'IPA' in revision['*']:
ipa = re.search(r"{IPA\|/(.*?)/\|", revision['*']).group(1)
print(ipa)
Note that I also make sure to quote the word value into the URL query string.
The above prints out any IPA it finds:
>>> getIPAen('watermelon')
ˈwɔːtəˌmɛlən
>>> getIPAen('chocolate')
ˈtʃɒk(ə)lɪt

Why doesn't Url Decode convert + to space?

Why are the + not converted to spaces:
>>> import urllib
>>> url = 'Q=Who+am+I%3F'
>>> urllib.unquote(url)
'Q=Who+am+I?'
>>>
There are two variants; urllib.unquote() and urllib.unquote_plus(). Use the latter:
>>> import urllib
>>> url = 'Q=Who+am+I%3F'
>>> urllib.unquote_plus(url)
'Q=Who am I?'
That's because there are two variants of URL quoting; one for URL path segments, and one for URL query parameters; the latter uses a different specification. See Wikipedia:
When data that has been entered into HTML forms is submitted, the form field names and values are encoded and sent to the server in an HTTP request message using method GET or POST, or, historically, via email. The encoding used by default is based on a very early version of the general URI percent-encoding rules, with a number of modifications such as newline normalization and replacing spaces with "+" instead of "%20".
So forms using the application/x-www-form-urlencoded mime type in a GET or POST request use slightly different rules, one where spaces are encoded to +, but when encoding characters in a URL, %20 is used. When decoding you need to pick the right variant. You have form data (from the query part of the URL) so you need to use unquote_plus().
Now, if you are parsing a query string, you may want to use the urlparse.parse_qs() or urlparse.parse_qsl() functions; these not only will use the right unquote*() function, but parse out the parameters into a dictionary or list of key-value pairs as well:
>>> import urlparse
>>> urlparse.parse_qs(url)
{'Q': ['Who am I?']}
>>> urlparse.parse_qsl(url)
[('Q', 'Who am I?')]

Unicode strings to byte strings without the addition of backslashes

I'm learning python by doing the python challenge using python3.3 and I'm on question eight. There's a comment in the markup providing you with two bz2-compressed unicode strings outputting byte strings, one for username and one for password. There's also a link where you need the decompressed credentials to enter.
One way to easily solve this is just to manually copy the strings and assign it to two variables as byte strings and then just use the bz2 library to decompress it:
>>>un=b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
>>>print(bz2.decompress(un).decode('utf-8'))
huge
But that's not for me since I want the answer by just running my python file.
My code like this:
>>>import bz2, re, requests
>>>url = requests.get('http://www.pythonchallenge.com/pc/def/integrity.html')
>>>un = re.findall(r'un: \'(.*)\'',url.text)[0]
>>>correct=b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
>>>print(un,un is correct,sep='\n')
b'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084'
False
The problem is that when it converts from unicode string to byte string the escaping backslash gets added so that it cannot be read by bz2 module. I have tried everything I know and what got up when I searched.
How do I get it from unicode to byte so that it doesn't get changed?
Here it is a solution:
import urllib
import bz2
import re
def decode(line):
out = re.search(r"\'(.*?)\'",''.join(line)).group()
out = eval("b%s" % out)
return bz2.decompress(out)
#read lines that contain the encoded message
page = urllib.urlopen('http://www.pythonchallenge.com/pc/def/integrity.html').readlines()[20:22]
print "Click on the bee and insert: "
User_Name = decode(page[0])
print "User Name is: " + User_Name
Password = decode(page[1])
print "Password is: " + Password
The backslashes are present in the HTML source, so it's not surprising that the requests module preserves them. I don't have requests installed on my Python 3 environment, so I haven't been able to exactly replicate your situation, but it looks to me like if you start capturing the surrounding ' characters, you can use ast.literal_eval to parse the character sequence into a bytes array:
>>> test
"'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084'"
>>> import ast
>>> res = ast.literal_eval("b%s" % test)
>>> import bz2
>>> len(bz2.decompress(res))
4
There are probably other ways, but why not use Python's built in knowledge that the byte sequence b'\\xaf' can be parsed into a bytes array?

Sending a List through an URL

I have a list that I need to send through a URL to a third party vendor. I don't know what language they are using.
The list prints out like this:
[u'1', u'6', u'5']
I know that the u encodes the string in utf-8 right? So a couple of questions.
Can I send a list through a URL?
Will the u's show up on the other end when going through the URL?
If so, how do I remove them?
I am not sure what keywords to search to help me out, so any resources would be helpful too.
Can I send a list through a URL?
No. A URL is just text. If you want a way to package structured information in it, you'll have to agree that with the provider you're talking to.
One standard encoding for structure in URLs, that might or might not be what you need, is the use of multiple parameters with the same name in a query string. This format comes from HTML form submissions:
http://www.example.com/script?par=1&par=6&par=5
might be considered to represent a parameter par with a three-item list as its value. Or maybe not, it's up to the receiver to decide. For example in a PHP application you would have had to name the parameter par[] to get it to accept the array value.
I know that the u encodes the string in utf-8 right?
No. a u'...' string is a native Unicode string, where each index represents a whole character and not a byte in any particular encoding. If you want UTF-8 bytes, write u'...'.encode('utf-8') before URL-encoding. UTF-8 is a good default choice, but again: what encoding the receiving side wants is up to that application.
Will the u's show up on the other end when going through the URL?
u is part of the literal representation of the string, just the same as the ' quotes themselves. They are not part of the string value and would not be echoed by print or when joined into other strings, unless you deliberately asked for the literal representation by calling repr.
u'' is not utf-8, its python unicode strings for python 2.x
To send it through url, you need to encode them with utf8 like .encode('utf-8'), and also need to urlencode, and list cannot send it through URL, you need to make it as string.
Basically, you need to do it in following steps
python list -> unicode string -> utf8 string -> url encode -> send it through proper urllib api
Incorrect. unicode literals use Python's internal encoding, decided when it was compiled.
You can't send anything "through" URLs. Pick a protocol instead. And encode before sending, probably to UTF-8.

how to open a URL with non utf-8 arguments

Using Python I need to transfer non utf-8 encoded data (specifically shift-jis) to a URL via the query string.
How should I transfer the data? Quote it? Encode in utf-8?
Thanks
Query string parameters are byte-based. Whilst IRI-to-URI and typed non-ASCII characters will typically use UTF-8, there is nothing forcing you to send or receive your own parameters in that encoding.
So for Shift-JIS (actually typically cp932, the Windows extension of that encoding):
foo= u'\u65E5\u672C\u8A9E' # 日本語
url= 'http://www.example.jp/something?foo='+urllib.quote(foo.encode('cp932'))
In Python 3 you do it in the quote function itself:
foo= '\u65E5\u672C\u8A9E'
url= 'http://www.example.jp/something?foo='+urllib.parse.quote(foo, encoding= 'cp932')
I don't know what unicode has to do with this, since the query string is a string of bytes. You can use the quoting functions in urllib to quote plain strings so that they can be passed within query strings.
By the »query string« you mean HTTP GET like in http:/{URL}?data=XYZ?
You have encoding what ever data you have via base64.b64encode using -_ as alternative character to be URL safe as an option. See here.

Categories