percent encoding URL with python - python

When I enter a URL into maps.google.com such as https://dl.dropbox.com/u/94943007/file.kml , it will encode this URL into:
https:%2F%2Fdl.dropbox.com%2Fu%2F94943007%2Ffile.kml
I am wondering what is this encoding called and is there a way to encode a URL like this using python?
I tried this:
The process is called URL encoding:
>>> urllib.quote('https://dl.dropbox.com/u/94943007/file.kml', '')
'https%3A%2F%2Fdl.dropbox.com%2Fu%2F94943007%2Ffile.kml'
but did not get the expected results:
'https%3A//dl.dropbox.com/u/94943007/file.kml'
what i need is this:
https:%2F%2Fdl.dropbox.com%2Fu%2F94943007%2Ffile.kml
how do i encode this URL properly?
the documentation here:
https://developers.google.com/maps/documentation/webservices/
states:
All characters to be URL-encoded are encoded using a '%' character and
a two-character hex value corresponding to their UTF-8 character. For
example, 上海+中國 in UTF-8 would be URL-encoded as
%E4%B8%8A%E6%B5%B7%2B%E4%B8%AD%E5%9C%8B. The string ? and the
Mysterians would be URL-encoded as %3F+and+the+Mysterians.

Use
urllib.quote_plus(url, safe=':')
Since you don't want the colon encoded you need to specify that when calling urllib.quote():
>>> expected = 'https:%2F%2Fdl.dropbox.com%2Fu%2F94943007%2Ffile.kml'
>>> url = 'https://dl.dropbox.com/u/94943007/file.kml'
>>> urllib.quote(url, safe=':') == expected
True
urllib.quote() takes a keyword argument safe that defaults to / and indicates which characters are considered safe and therefore don't need to be encoded. In your first example you used '' which resulted in the slashes being encoded. The unexpected output you pasted below where the slashes weren't encoded probably was from a previous attempt where you didn't set the keyword argument safe at all.
Overriding the default of '/' and instead excluding the colon with ':' is what finally yields the desired result.
Edit: Additionally, the API calls for spaces to be encoded as plus signs. Therefore urllib.quote_plus() should be used (whose keyword argument safe doesn't default to '/').

Related

Converting dictionary to url encoded data for passing application/x-www-form-urlencoded in python

I have a dictionary like this
data = {
"name":"test name",
"file_urls":[
"/Users/tarequzzamankhan/Desktop/instagram.webp",
"/Users/tarequzzamankhan/Desktop/instagram.webp"
],
"file_type":"pdf",
"user_name":"Tareq",
"mobile":"018xxxxxxxxx",
"address":"Dhaka",
"email":"example#mail.com"
}
I want to convert this dict to a query string like:
payload='name=test%20name&file_urls=%5B%22%2FUsers%2Ftarequzzamankhan%2FDesktop%2Finstagram.webp%22%2C%20%22%2FUsers%2Ftarequzzamankhan%2FDesktop%2Finstagram.webp%22%5D&file_type=pdf&user_name=Tareq&mobile=018xxxxxxxxx&address=Dhaka&email=example%40mail.com'
For calling an api
I used urllib.parse.urlencode(data, doseq=True) but it can not generate the same string provided above, instead it generates:
payload='name=test+name&file_urls=%2FUsers%2Ftarequzzamankhan%2FDesktop%2Finstagram.webp&file_urls=%2FUsers%2Ftarequzzamankhan%2FDesktop%2Finstagram.webp&file_type=pdf&user_name=Tareq&mobile=018xxxxxxxxx&address=Dhaka&email=example%40mail.com'
The normal call to urllib.parse.urlencode should give you a string that's correct for sending as x-www-form-urlencoded data. However, to answer your question about the resulting string formatting...
urllib.parse.urlencode has a keyword argument called quote_via, which allows you to specify a method for how some characters like spaces or slashes are encoded, and by default they are done with urllib.parse.quote_plus where spaces are substituted with +. You can use urllib.parse.quote instead, which will get you a string like what you want.
The resulting string is a series of key=value pairs separated by '&'
characters, where both key and value are quoted using the quote_via
function. By default, quote_plus() is used to quote the values, which
means spaces are quoted as a '+' character and ‘/’ characters are
encoded as %2F, which follows the standard for GET requests
(application/x-www-form-urlencoded). An alternate function that can be
passed as quote_via is quote(), which will encode spaces as %20 and
not encode ‘/’ characters. For maximum control of what is quoted, use
quote and specify a value for safe.
import urllib.parse
data = {
"name":"test name",
"file_urls":[
"/Users/tarequzzamankhan/Desktop/instagram.webp",
"/Users/tarequzzamankhan/Desktop/instagram.webp"
],
"file_type":"pdf",
"user_name":"Tareq",
"mobile":"018xxxxxxxxx",
"address":"Dhaka",
"email":"example#mail.com"
}
# I didn't use doseq=True because you seem to want all the `file_urls` as a single argument
qs = urllib.parse.urlencode(data, quote_via=urllib.parse.quote)
print(qs)
Output:
In [24]: print(qs)
name=test%20name&file_urls=%5B%27%2FUsers%2Ftarequzzamankhan%2FDesktop%2Finstagram.webp%27%2C%20%27%2FUsers%2Ftarequzzamankhan%2FDesktop%2Finstagram.webp%27%5D&file_type=pdf&user_name=Tareq&mobile=018xxxxxxxxx&address=Dhaka&email=example%40mail.com

UTF-8 decoding doesn't decode special characters in python

Hi I have the following data (abstracted) that comes from an API.
"Product" : "T\u00e1bua 21X40"
I'm using the following code to decode the data byte:
var = json.loads(cleanhtml(str(json.dumps(response.content.decode('utf-8')))))
The cleanhtml is a regex function that I've created to remove html tags from the returned data (It's working correctly). Although, decode(utf-8) is not removing characters like \u00e1. My expected output is:
"Product" : "Tábua 21X40"
I've tried to use replace("\\u00e1", "á") but with no success. How can I replace this type of character and what type of character is this?
\u00e1 is another way of representing the á character when displaying the contents of a Python string.
If you open a Python interactive session and run print({"Product" : "T\u00e1bua 21X40"}) you'll see output of {'Product': 'Tábua 21X40'}. The \u00e1 doesn't exist in the string as those individual characters.
The \u escape sequence indicates that the following numbers specify a Unicode character.
Attempting to replace \u00e1 with á won't achieve anything because that's what it already is. Additionally, replace("\\u00e1", "á") is attempting to replace the individual characters of a slash, a u, etc and, as mentioned, they don't actually exist in the string in that way.
If you explain the problem you're encountering further then we may be able to help more, but currently it sounds like the string has the correct content but is just being displayed differently than you expect.
what type of character is this
Here
"Product" : "T\u00e1bua 21X40"
you might observe \u escape sequence, it is followed by 4 hex digits: 00e1, note that this is different represenation of same character, so
print("\u00e1" == "á")
output
True
These type of characters are called character entities. There are different types of entities and this is JSON entity. For demonstration, enter your string here and click unescape.
For your question, if you are using python then you can solve the issue by importing json module. Then you have to decode it as follows.
import json
string = json.loads('"T\u00e1bua 21X40"')
print(string)

Display Japanese characters in Visual Studio Code using Python

According to this older answer, Python 3 strings are UTF-8 compliant by default. But in my web scraper using BeautifulSoup, when I try to print or display a URL, the Japanese characters show up as '%E3%81%82' or '%E3%81%B3' instead of the actual characters.
This Japanese website is the one I'm collecting information from, more specifically the URLs that correspond with the links in the clickable letter buttons. When you hover over for example あa, your browser will show you that the link you're about to click on is https://kokugo.jitenon.jp/cat/gojuon.php?word=あ. However, extracting the ["href"] property of the link using BeautifulSoup, I get https://kokugo.jitenon.jp/cat/gojuon.php?word=%E3%81%82.
Both versions link to the same web page, but for the sake of debugging, I'm wondering if it's possible to make sure the displayed string contains the actual Japanese character. If not, how can I convert the string to accommodate this purpose?
It's called Percent-encoding:
Percent-encoding, also known as URL encoding, is a method to encode arbitrary data in a Uniform Resource Identifier (URI)
using only the limited US-ASCII characters legal within a URI.
Apply the unquote method from urllib.parse module:
urllib.parse.unquote(string, encoding='utf-8', errors='replace')
Replace %xx escapes by their single-character equivalent. The
optional encoding and errors parameters specify how to decode
percent-encoded sequences into Unicode characters, as accepted by the
bytes.decode() method.
string must be a str. Changed in version 3.9: string parameter
supports bytes and str objects (previously only str).
encoding defaults to 'utf-8'. errors defaults to 'replace',
meaning invalid sequences are replaced by a placeholder character.
Example:
from urllib.parse import unquote
encodedUrl = 'JapaneseChars%E3%81%82or%E3%81%B3'
decodedUrl = unquote( encodedUrl )
print( decodedUrl )
JapaneseCharsあorび
One can apply the unquote method to almost any string, even if already decoded:
print( unquote(decodedUrl) )
JapaneseCharsあorび

Parameter for GET request changed by Python requests

params = {'token': 'JVFQ%2FFb5Ri2aKNtzTjOoErWvAaHRHsWHc8x%2FKGS%2FKAuoS4IRJI161l1rz2ab7rovBzGB86bGsh8pmDVaW8jj6AiJ2jT2rLIyt%2Bbpm80MCOE%3D'}
rsp = requests.get("http://xxxx/access", params=params)
print rsp.url
print params
when print rsp.url, I get
http://xxxx/access?token=JVFQ%252FFb5Ri2aKNtzTjOoErWvAaHRHsWHc8x%252FKGS%252FKAuoS4IRJI161l1rz2ab7rovBzGB86bGsh8pmDVaW8jj6AiJ2jT2rLIyt%252Bbpm80MCOE%253D
JVFQ%2FF
JVFQ%252FF
The value of the ?token= in the url is different from params['token'].
Why does it change?
You passed in a URL encoded value, but requests encodes the value for you. As a result, the value is encoded twice; the % character is encoded to %25.
Don't pass in a URL-encoded value. Decode it manually if you must:
from urllib import unquote
params['token'] = unquote(params['token'])
URL's use a special type of syntax. The % character is a reserved character in URLs. It is used as an escape character to allow you to type other characters (such as space, #, and % itself).
Requests automatically encodes URLs to proper syntax when necessary. The % key had to be econded to "%25". In other words, the URL parameters never changed. They are the same. The URL was just encoded to proper syntax. Everywhere you put "%" it was encoded to the proper form of "%25"
You can check out URL Syntax info here if you want:
http://en.wikipedia.org/wiki/Uniform_resource_locator#Syntax
And you can encode/decode URLs here. Try encoding "%" or try decoding "%25" to see what you get:
http://www.url-encode-decode.com/

Unescape Python Strings From HTTP

I've got a string from an HTTP header, but it's been escaped.. what function can I use to unescape it?
myemail%40gmail.com -> myemail#gmail.com
Would urllib.unquote() be the way to go?
I am pretty sure that urllib's unquote is the common way of doing this.
>>> import urllib
>>> urllib.unquote("myemail%40gmail.com")
'myemail#gmail.com'
There's also unquote_plus:
Like unquote(), but also replaces plus signs by spaces, as required for unquoting HTML form values.
In Python 3, these functions are urllib.parse.unquote and urllib.parse.unquote_plus.
The latter is used for example for query strings in the HTTP URLs, where the space characters () are traditionally encoded as plus character (+), and the + is percent-encoded to %2B.
In addition to these there is the unquote_to_bytes that converts the given encoded string to bytes, which can be used when the encoding is not known or the encoded data is binary data. However there is no unquote_plus_to_bytes, if you need it, you can do:
def unquote_plus_to_bytes(s):
if isinstance(s, bytes):
s = s.replace(b'+', b' ')
else:
s = s.replace('+', ' ')
return unquote_to_bytes(s)
More information on whether to use unquote or unquote_plus is available at URL encoding the space character: + or %20.
Yes, it appears that urllib.unquote() accomplishes that task. (I tested it against your example on codepad.)

Categories