double url encoding for ascii charachters - python

What is the used library from python, or how in general the double URL encoding of characters can be done,
example :
character 'a' with URL encoding > %61
character 'a' with double URL encoding > %2561
how I can get the %2561 from 'a'?

Depends on your usecase. If you don't need a robust solution you can just use urllib
import urllib.parse as parser
encoded_string = "%2561"
decoded_string = parser.unquote(encoded_string) # this would be '%61'
double_decoded_string = parser.unquote(decoded_string) # this would be 'a'
You can also remove the decoded_string variable and instead call it like
double_encoded_string = parser.unquote(parser.unquote(encoded_string))
From there you could start with looping the decode while it contains encoded_chars etc.

Related

Converting %25C3%25BC in string to Umlaut python [duplicate]

In Python 2.7, given a URL like example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0, how can I decode it to the expected result, example.com?title==правовая+защита?
I tried url=urllib.unquote(url.encode("utf8")), but it seems to give a wrong result.
The data is UTF-8 encoded bytes escaped with URL quoting, so you want to decode, with urllib.parse.unquote(), which handles decoding from percent-encoded data to UTF-8 bytes and then to text, transparently:
from urllib.parse import unquote
url = unquote(url)
Demo:
>>> from urllib.parse import unquote
>>> url = 'example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0'
>>> unquote(url)
'example.com?title=правовая+защита'
The Python 2 equivalent is urllib.unquote(), but this returns a bytestring, so you'd have to decode manually:
from urllib import unquote
url = unquote(url).decode('utf8')
If you are using Python 3, you can use urllib.parse.unquote:
url = """example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0"""
import urllib.parse
urllib.parse.unquote(url)
gives:
'example.com?title=правовая+защита'
You can achieve an expected result with requests library as well:
import requests
url = "http://www.mywebsite.org/Data%20Set.zip"
print(f"Before: {url}")
print(f"After: {requests.utils.unquote(url)}")
Output:
$ python3 test_url_unquote.py
Before: http://www.mywebsite.org/Data%20Set.zip
After: http://www.mywebsite.org/Data Set.zip
Might be handy if you are already using requests, without using another library for this job.
In HTML the URLs can contain html entities.
This replaces them, too.
#from urllib import unquote #earlier python version
from urllib.request import unquote
from html import unescape
unescape(unquote('https://v.w.xy/p1/p22?userId=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx&confirmationToken=7uAf%2fxJoxRTFAZdxslCn2uwVR9vV7cYrlHs%2fl9sU%2frix9f9CnVx8uUT%2bu8y1%2fWCs99INKDnfA2ayhGP1ZD0z%2bodXjK9xL5I4gjKR2xp7p8Sckvb04mddf%2fiG75QYiRevgqdMnvd9N5VZp2ksBc83lDg7%2fgxqIwktteSI9RA3Ux9VIiNxx%2fZLe9dZSHxRq9AA'))
I know this is an old question, but I stumbled upon this via Google search and found that no one has proposed a solution with only built-in features.
So I quickly wrote my own.
Basically a url string can only contain these characters: A-Z, a-z, 0-9, -, ., _, ~, :, /, ?, #, [, ], #, !, $, &, ', (, ), *, +, ,, ;, %, and =, everything else are url encoded.
URL encoding is pretty straight forward, just a percent sign followed by the hexadecimal digits of the byte values corresponding to the codepoints of illegal characters.
So basically using a simple while loop to iterate the characters, add any character's byte as is if it is not a percent sign, increment index by one, else add the byte following the percent sign and increment index by three, accumulate the bytes and decoding them should work perfectly.
Here is the code:
def url_parse(url):
l = len(url)
data = bytearray()
i = 0
while i < l:
if url[i] != '%':
d = ord(url[i])
i += 1
else:
d = int(url[i+1:i+3], 16)
i += 3
data.append(d)
return data.decode('utf8')
I have tested it and it works perfectly.

Some characters replaced with % sign python [duplicate]

In Python 2.7, given a URL like example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0, how can I decode it to the expected result, example.com?title==правовая+защита?
I tried url=urllib.unquote(url.encode("utf8")), but it seems to give a wrong result.
The data is UTF-8 encoded bytes escaped with URL quoting, so you want to decode, with urllib.parse.unquote(), which handles decoding from percent-encoded data to UTF-8 bytes and then to text, transparently:
from urllib.parse import unquote
url = unquote(url)
Demo:
>>> from urllib.parse import unquote
>>> url = 'example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0'
>>> unquote(url)
'example.com?title=правовая+защита'
The Python 2 equivalent is urllib.unquote(), but this returns a bytestring, so you'd have to decode manually:
from urllib import unquote
url = unquote(url).decode('utf8')
If you are using Python 3, you can use urllib.parse.unquote:
url = """example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0"""
import urllib.parse
urllib.parse.unquote(url)
gives:
'example.com?title=правовая+защита'
You can achieve an expected result with requests library as well:
import requests
url = "http://www.mywebsite.org/Data%20Set.zip"
print(f"Before: {url}")
print(f"After: {requests.utils.unquote(url)}")
Output:
$ python3 test_url_unquote.py
Before: http://www.mywebsite.org/Data%20Set.zip
After: http://www.mywebsite.org/Data Set.zip
Might be handy if you are already using requests, without using another library for this job.
In HTML the URLs can contain html entities.
This replaces them, too.
#from urllib import unquote #earlier python version
from urllib.request import unquote
from html import unescape
unescape(unquote('https://v.w.xy/p1/p22?userId=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx&confirmationToken=7uAf%2fxJoxRTFAZdxslCn2uwVR9vV7cYrlHs%2fl9sU%2frix9f9CnVx8uUT%2bu8y1%2fWCs99INKDnfA2ayhGP1ZD0z%2bodXjK9xL5I4gjKR2xp7p8Sckvb04mddf%2fiG75QYiRevgqdMnvd9N5VZp2ksBc83lDg7%2fgxqIwktteSI9RA3Ux9VIiNxx%2fZLe9dZSHxRq9AA'))
I know this is an old question, but I stumbled upon this via Google search and found that no one has proposed a solution with only built-in features.
So I quickly wrote my own.
Basically a url string can only contain these characters: A-Z, a-z, 0-9, -, ., _, ~, :, /, ?, #, [, ], #, !, $, &, ', (, ), *, +, ,, ;, %, and =, everything else are url encoded.
URL encoding is pretty straight forward, just a percent sign followed by the hexadecimal digits of the byte values corresponding to the codepoints of illegal characters.
So basically using a simple while loop to iterate the characters, add any character's byte as is if it is not a percent sign, increment index by one, else add the byte following the percent sign and increment index by three, accumulate the bytes and decoding them should work perfectly.
Here is the code:
def url_parse(url):
l = len(url)
data = bytearray()
i = 0
while i < l:
if url[i] != '%':
d = ord(url[i])
i += 1
else:
d = int(url[i+1:i+3], 16)
i += 3
data.append(d)
return data.decode('utf8')
I have tested it and it works perfectly.

urllib2 read to Unicode

I need to store the content of a site that can be in any language. And I need to be able to search the content for a Unicode string.
I have tried something like:
import urllib2
req = urllib2.urlopen('http://lenta.ru')
content = req.read()
The content is a byte stream, so I can search it for a Unicode string.
I need some way that when I do urlopen and then read to use the charset from the headers to decode the content and encode it into UTF-8.
After the operations you performed, you'll see:
>>> req.headers['content-type']
'text/html; charset=windows-1251'
and so:
>>> encoding=req.headers['content-type'].split('charset=')[-1]
>>> ucontent = unicode(content, encoding)
ucontent is now a Unicode string (of 140655 characters) -- so for example to display a part of it, if your terminal is UTF-8:
>>> print ucontent[76:110].encode('utf-8')
<title>Lenta.ru: Главное: </title>
and you can search, etc, etc.
Edit: Unicode I/O is usually tricky (this may be what's holding up the original asker) but I'm going to bypass the difficult problem of inputting Unicode strings to an interactive Python interpreter (completely unrelated to the original question) to show how, once a Unicode string IS correctly input (I'm doing it by codepoints -- goofy but not tricky;-), search is absolutely a no-brainer (and thus hopefully the original question has been thoroughly answered). Again assuming a UTF-8 terminal:
>>> x=u'\u0413\u043b\u0430\u0432\u043d\u043e\u0435'
>>> print x.encode('utf-8')
Главное
>>> x in ucontent
True
>>> ucontent.find(x)
93
Note: Keep in mind that this method may not work for all sites, since some sites only specify character encoding inside the served documents (using http-equiv meta tags, for example).
To parse Content-Type http header, you could use cgi.parse_header function:
import cgi
import urllib2
r = urllib2.urlopen('http://lenta.ru')
_, params = cgi.parse_header(r.headers.get('Content-Type', ''))
encoding = params.get('charset', 'utf-8')
unicode_text = r.read().decode(encoding)
Another way to get the charset:
>>> import urllib2
>>> r = urllib2.urlopen('http://lenta.ru')
>>> r.headers.getparam('charset')
'utf-8'
Or in Python 3:
>>> import urllib.request
>>> r = urllib.request.urlopen('http://lenta.ru')
>>> r.headers.get_content_charset()
'utf-8'
Character encoding can also be specified inside html document e.g., <meta charset="utf-8">.

How to unquote URL quoted UTF-8 strings in Python

thestring = urllib.quote(thestring.encode('utf-8'))
This will encode it. How to decode it?
What about
backtonormal = urllib.unquote(thestring)
if you mean to decode a string from utf-8, you can first transform the string to unicode and then to any other encoding you would like (or leave it in unicode), like this
unicodethestring = unicode(thestring, 'utf-8')
latin1thestring = unicodethestring.encode('latin-1','ignore')
'ignore' meaning that if you encounter a character that is not in the latin-1 character set you ignore this character.

Unescape Python Strings From HTTP

I've got a string from an HTTP header, but it's been escaped.. what function can I use to unescape it?
myemail%40gmail.com -> myemail#gmail.com
Would urllib.unquote() be the way to go?
I am pretty sure that urllib's unquote is the common way of doing this.
>>> import urllib
>>> urllib.unquote("myemail%40gmail.com")
'myemail#gmail.com'
There's also unquote_plus:
Like unquote(), but also replaces plus signs by spaces, as required for unquoting HTML form values.
In Python 3, these functions are urllib.parse.unquote and urllib.parse.unquote_plus.
The latter is used for example for query strings in the HTTP URLs, where the space characters () are traditionally encoded as plus character (+), and the + is percent-encoded to %2B.
In addition to these there is the unquote_to_bytes that converts the given encoded string to bytes, which can be used when the encoding is not known or the encoded data is binary data. However there is no unquote_plus_to_bytes, if you need it, you can do:
def unquote_plus_to_bytes(s):
if isinstance(s, bytes):
s = s.replace(b'+', b' ')
else:
s = s.replace('+', ' ')
return unquote_to_bytes(s)
More information on whether to use unquote or unquote_plus is available at URL encoding the space character: + or %20.
Yes, it appears that urllib.unquote() accomplishes that task. (I tested it against your example on codepad.)

Categories