decode URL only non-ascii character

decode URL only non-ascii character - python

Now I'm working on Wikipedia. In many articles, I noticed some URLs, for example, https://www.google.com/search?q=%26%E0%B8%89%E0%B8%B1%E0%B8%99, are very long. The example URL can be replaced with "https://www.google.com/search?q=%26ฉัน" (ฉัน is a Thai word) which is shorter and cleaner. However, when I use urllib.unquote function to decode URL, it decodes even %26 and get "https://www.google.com/search?q=&ฉัน" as the result. As you might have noticed, this URL is useless; it doesn't make a valid link.
Therefore, I want to know how to get decode link while it is valid. I think that decoding only non-ascii character would get the valid URL. Is it correct? and how to do that?
Thanks :)

Easiest way, you can replace all URL encode sequence below %80 (%00-%7F) with some placeholder, do a URL decode, and replace the original URL encode sequence back into the placeholder.
Another way is look for UTF-8 sequences. Your URL appears to be encoded in UTF-8, and Wikipedia uses UTF-8. You can see the Wikipedia entry for UTF-8 for how UTF-8 characters are encoded.
So, when encoded in URLs, each valid non-ascii UTF-8 character would follow one of these patterns:
(%C0-%DF)(%80-%BF)
(%E0-%EF)(%80-%BF)(%80-%BF)
(%F0-%F7)(%80-%BF)(%80-%BF)(%80-%BF)
(%F8-%FB)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)
(%FC-%FD)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)(%80-%BF)
So you can match these patterns in the URL and unquote each character separately.
However, remember that not all URLs are encoded in UTF-8.
In some old websites, they still use other character sets, such as Windows-874 for Thai language.
In such cases, "ฉัน" for that particular website is encoded as "%A9%D1%B9" instead of "%E0%B8%89%E0%B8%B1%E0%B8%99". If you decode it using urllib.unquote you will get some garbled text like "?ѹ" instead of "ฉัน" and that could break the link.
So you have to be careful and check if the URL decoding break the link or not. Make sure that the URL you're decoding is in UTF-8.

Related

Finding second encoding of base64 XML string

I have some base64 encoded text fields in some XML data.
To get all the characters showing correctly, I think I need to find an additional encoding used on this text, which is not UTF-8 by the look of it. ?And maybe some other encoding aspect too, not sure..
I am not sure what order I should be encoding and decoding here - following https://www.geeksforgeeks.org/encoding-and-decoding-base64-strings-in-python/ I tried to first:
Encode the whole string with every possible Python2.7 encoding, then
decode with base64
(same result each time, no standard representation of problem characters)
Then I tried:
encode string with utf8
decode with base64
decode the bytes string with every possible Python2.7 encoding
However, none of these answer strings seem to get any standard representation of the problem characters, which should display as 'é' and 'ü'.
I enclose this example string, where I am sure what the final correct text should be.
Original base64 string: b64_encoded_bytes = 'R3KfbmRlciBGco5kjnJpYyBKb3Vzc2V0JiMxMzsmIzEzO3NlbGVjdGlvbiBjb21taXR0ZWUgZm9yIGFydGlzdCByZWNpZGVuY3k6IFZpbmNpYW5jZSBEZXNwcmV0LCBLb3lvIEtvdW9oLCBDaHJpc3RpbmUgbWFjZWwsIEhhbnMtVWxyaWNoIE9icmlzdCwgTmF0YT9hIFBldHJlP2luLUJhY2hlbGV6LCBQaGlsaXBwZSBWZXJnbmU='
Text string with correct 'é' and 'ü' characters at beginning, deduced from European language knowledge:
'Gründer Frédéric Jousset

selection committee for artist
recidency: Vinciance Despret, Koyo Kouoh, Christine macel, Hans-Ulrich
Obrist, Nata?a Petre?in-Bachelez, Philippe Vergne'
Note the '
' is HTML encoding of apparently new line character used in Windows, and '?' might also resolve to another correct character with correct encoding, or possibly '?' is actual display in original data.

It seems to be encoded with mac_roman:
>>> b64 = 'R3KfbmRlciBGco5kjnJpYyBKb3Vzc2V0JiMxMzsmIzEzO3NlbGVjdGlvbiBjb21taXR0ZWUgZm9yIGFydGlzdCByZWNpZGVuY3k6IFZpbmNpYW5jZSBEZXNwcmV0LCBLb3lvIEtvdW9oLCBDaHJpc3RpbmUgbWFjZWwsIEhhbnMtVWxyaWNoIE9icmlzdCwgTmF0YT9hIFBldHJlP2luLUJhY2hlbGV6LCBQaGlsaXBwZSBWZXJnbmU='
>>> bs = base64.b64decode(b64)
>>> bs
b'Gr\x9fnder Fr\x8ed\x8eric Jousset

selection committee for artist recidency: Vinciance Despret, Koyo Kouoh, Christine macel, Hans-Ulrich Obrist, Nata?a Petre?in-Bachelez, Philippe Vergne'
>>> print(bs.decode('mac_roman'))
Gründer Frédéric Jousset

selection committee for artist recidency: Vinciance Despret, Koyo Kouoh, Christine macel, Hans-Ulrich Obrist, Nata?a Petre?in-Bachelez, Philippe Vergne
The question marks in "Nata?a Petre?in-Bachelez" are present in the original data, presumably the result of a previous encoding/decoding problem.

lxml.html parsing and utf-8 with requests

i used requests to retrieve a url which contains some unicode characters, and want to do some processing with it , then write it out.
r=requests.get(url)
f=open('unicode_test_1.html','w');f.write(r.content);f.close()
html = lxml.html.fromstring(r.content)
htmlOut = lxml.html.tostring(html)
f=open('unicode_test_2.html','w');f.write(htmlOut);f.close()
in unicode_test_1.html, all chars looks fine, but in unicode_test_2.html, some chars changed to gibberish, why is that ?
i then tried
html = lxml.html.fromstring(r.text)
htmlOut = lxml.html.tostring(html,encoding='latin1')
f=open('unicode_test_2.html','w');f.write(htmlOut);f.close()
it seems it's working now. but i don't know why is this happening, always use latin1 ?
what's the difference between r.text and r.content, and why can't i write html out using encoding='utf-8' ?

You've not specified if you're using python 2 or 3. Encoding is handled quite differently depending on which version you're using. The following advice is more or less universal anyway.
The difference between r.text and r.content is in the Requests docs. Simply put Requests will attempt to figure out the character encoding for you and return Unicode after decoding it. This which is accessible via r.text. To get just the bytes use r.content.
You really need to get to grips with encodings. Read http://www.joelonsoftware.com/articles/Unicode.html and watch https://www.youtube.com/watch?v=sgHbC6udIqc to get started. Also, do a search for "Overcoming frustration: Correctly using unicode in python2" for additional help.
Just to clarify, it's not as simple as always use one encoding over another. Make a Unicode sandwich by doing any I/O in bytes and work with Unicode in your application. If you start with bytes (isinstance(mytext, str)) you need to know the encoding to decode to Unicode, if you start with Unicode (isinstance(mytext, unicode)) you should encode to UTF-8 as it will handle all the worlds characters.
Make sure your editor, files, server and database are configured to UTF-8 also otherwise you'll get more 'gibberish'.
If you want further help post the source files and output of your script.

how to url-safe encode a string with python? and urllib.quote is wrong

Hello i was wondering if you know any other way to encode a string to a url-safe, because urllib.quote is doing it wrong, the output is different than expected:
If i try
urllib.quote('á')
i get
'%C3%A1'
But thats not the correct output, it should be
%E1
As demostrated by the tool provided here this site
And this is not me being difficult, the incorrect output of quote is preventing the browser to found resources, if i try
urllib.quote('\images\á\some file.jpg')
And then i try with the javascript tool i mentioned i get this strings respectively
%5Cimages%5C%C3%A1%5Csome%20file.jpg
%5Cimages%5C%E1%5Csome%20file.jpg
Note how is almost the same but the url provided by quote doesn't work and the other one it does.
I tried messing with encode('utf-8) on the string provided to quote but it does not make a difference.
I tried with other spanish words with accents and the ñ they all are differently represented.
Is this a python bug?
Do you know some module that get this right?

According to RFC 3986, %C3%A1 is correct. Characters are supposed to be converted to an octet stream using UTF-8 before the octet stream is percent-encoded. The site you link is out of date.
See Why does the encoding's of a URL and the query string part differ? for more detail on the history of handling non-ASCII characters in URLs.

Ok, got it, i have to encode to iso-8859-1 like this
word = u'á'
word = word.encode('iso-8859-1')
print word

Python is interpreted in ASCII by default, so even though your file may be encoded differently, your UTF-8 char is interpereted as two ASCII chars.
Try putting a comment as the first of second line of your code like this to match the file encoding, and you might need to use u'á' also.
# coding: utf-8

What about using unicode strings and the numeric representation (ord) of the char?
>>> print '%{0:X}'.format(ord(u'á'))
%E1

In this question it seems some guy wrote a pretty large function to convert to ascii urls, thats what i need. But i was hoping there was some encoding tool in the std lib for the job.

Python Unicode CSV export (using Django)

I'm using a Django app to export a string to a CSV file. The string is a message that was submitted through a front end form. However, I've been getting this error when a unicode single quote is provided in the input.
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019'
in position 200: ordinal not in range(128)
I've been trying to convert the unicode to ascii using the code below, but still get a similar error.
UnicodeEncodeError: 'ascii' codec can't encode characters in
position 0-9: ordinal not in range(128)
I've sifted through dozens of websites and learned a lot about unicode, however, I'm still not able to convert this unicode to ascii. I don't care if the algorithm removes the unicode characters. The commented lines indicate some various options I've tried, but the error persists.
import csv
import unicodedata
...
#message = unicode( unicodedata.normalize(
# 'NFKD',contact.message).encode('ascii','ignore'))
#dmessage = (contact.message).encode('utf-8','ignore')
#dmessage = contact.message.decode("utf-8")
#dmessage = "%s" % dmessage
dmessage = contact.message
csv_writer.writerow([
dmessage,
])
Does anyone have any advice in removing unicode characters to I can export them to CSV? This seemingly easy problem has kept my head spinning. Any help is much appreciated.
Thanks,
Joe

You can't encode the Unicode character u'\u2019' (U+2019 Right Single Quotation Mark) into ASCII, because ASCII doesn't have that character in it. ASCII is only the basic Latin alphabet, digits and punctuation; you don't get any accented letters or ‘smart quotes’ like this character.
So you will have to choose another encoding. Now normally the sensible thing to do would be to export to UTF-8, which can hold any Unicode character. Unfortunately for you if your target users are using Office (and they probably are), they're not going to be able to read UTF-8-encoded characters in CSV. Instead Excel will read the files using the system default code page for that machine (also misleadingly known as the ‘ANSI’ code page), and end up with mojibake like â€™ instead of ’.
So that means you have to guess the user's system default code page if you want the characters to show up correctly. For Western users, that will be code page 1252. Users with non-Western Windows installs will see the wrong characters, but there's nothing you can do about that (other than organise a letter-writing campaign to Microsoft to just drop the stupid nonsense with ANSI already and use UTF-8 like everyone else).
Code page 1252 can contain U+2019 (’), but obviously there are many more characters it can't represent. To avoid getting UnicodeEncodeError for those characters you can use the ignore argument (or replace to replace them with question marks).
dmessage= contact.message.encode('cp1252', 'ignore')
alternatively, to give up and remove all non-ASCII characters, so that everyone gets an equally bad experience regardless of locale:
dmessage= contact.message.encode('ascii', 'ignore')

Encoding is a pain, but if you're working in django have you tried smart_unicode(str) from django.utils.encoding? I find that usually does the trick.
The only other option I've found is to use the built-in python encode() and decode() for strings, but you have to specify the encoding for those and honestly, it's a pain.

[caveat: I'm not a djangoist; django may have a better solution].
General non-django-specific answer:
If you have a smallish number of known non-ASCII characters and there are user-acceptable ASCII equivalents for them, you can set up a translation table and use the unicode.translate method:
smashcii = {
0x2019 : u"'",
# etc
#
smashed = input_string.translate(smashcii)

Why does it print funny characters? unicode problem?

The user entered the word
éclair
into the search box.
Showing results 1 - 10 of about 140 for �air.
Why does it show the weird question mark?
I'm using Django to display it:
Showing results 1 - 10 of about 140 for {{query|safe}}

It's an encoding problem. Most likely your form or the output page is not UTF-8 encoded.
This article is very good reading on the issue: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
You need to check the encoding of
the HTML page where the user input the word
the HTML page you are using to output the word
the multi-byte ability of the functions you use to work with the string (though that probably isn't a problem in Python)
If the search is going to apply to a data base, you will need to check the encoding of the database connection, as well as the encoding of your tables and columns.

This is the result when you interpret data that is not encoded in UTF-8 as UTF-8 encoded.
The interpreter expects from the code point of your first character of the word éclair a multibyte encoded character with a length of three characters, consumes the next two characters but can’t decode it (probably invalid byte sequence). For this case the REPLACEMENT CHARACTER � (U+FFFD) is shown.
So in your case you just need to really encode your data with UTF-8.

You are serving the page with the wrong character encoding (charset). Check that you are using the same encoding throughout all your application (for example UTF-8). This includes:
HTTP headers from web server (Content-Type: text/html;charset=utf-8)
Communication with database (i.e SET NAMES 'utf-8')

It would also be good to check your browser encoding setting.

I second the responses above. Some other things from the top of my head:
If you're using e.g. MySQL database, then it could be good to create your database using:
CREATE DATABASE x CHARACTER SET UTF8
You can also check this: http://docs.djangoproject.com/en/dev/ref/settings/#default-charset

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.