Decode/Unescape Unicode Entities in python

Decode/Unescape Unicode Entities in python - python

There are special characters in a string that comes with response, whatever I did, I could not make them look real.
"XMMdpyi92N2o%2fENOpJIS3fYRa1k%2bYHFccNSYo1IIkpk%2fMbVY3tlk2gCjgq1lU6KB"
I can get the real view when I decode this code with this site https://www.online-toolz.com/tools/text-unicode-entities-convertor.php
Some special characters are available in response like %2f and %2b and these characters are represented by a list here https://www.w3schools.com/tags/ref_urlencode.ASP
All I want to do is to automatically decode these characters that come with response.
I am still learning python, I need experience from anyone who has knowledge.

You probably are looking for urllib.parse.unquote:
>>> import urllib.parse
>>> urllib.parse.unquote("XMMdpyi92N2o%2fENOpJIS3fYRa1k%2bYHFccNSYo1IIkpk%2fMbVY3tlk2gCjgq1lU6KB")
'XMMdpyi92N2o/ENOpJIS3fYRa1k+YHFccNSYo1IIkpk/MbVY3tlk2gCjgq1lU6KB'

Related

python request library giving wrong value single quotes

Facing some issue in calling API using request library. Problem is described as follows
The code:.
r = requests.post(url, data=json.dumps(json_data), headers=headers)
When I perform r.text the apostrophe in the string is giving me as
like this Bachelor\u2019s Degree. This should actually give me the response as Bachelor's Degree.
I tried json.loads also but the single quote problem remains the same,
How to get the string value correctly.

What you see here ("Bachelor\u2019s Degree") is the string's inner representation, where "\u2019" is the unicode codepoint for "RIGHT SINGLE QUOTATION MARK". This is perfectly correct, there's nothing wrong here, if you print() this string you'll get what you expect:
>>> s = 'Bachelor\u2019s Degree'
>>> print(s)
Bachelor’s Degree
Learning about unicode and encodings might save you quite some time FWIW.
EDIT:
When I save in db and then on displaying on HTML it will cause issue
right?
Have you tried ?
Your database connector is supposed to encode it to the proper encoding (according to your fields, tables and client encoding settings).
wrt/ "displaying it on HTML", it mostly depends on whether you're using Python 2.7.x or Python 3.x AND on how you build your HTML, but if you're using some decent framework with a decent template engine (if not you should reconsider your stack) chances are it will work out of the box.
As I already mentionned, learning about unicode and encodings will save you a lot of time.

It's just using a UTF-8 encoding, it is not "wrong".
string = 'Bachelor\u2019s Degree'
print(string)
Bachelor’s Degree
You can decode and encode it again, but I can't see any reason why you would want to do that (this might not work in Python 2):
string = 'Bachelor\u2019s Degree'.encode().decode('utf-8')
print(string)
Bachelor’s Degree

From requests docs:
When you make a request, Requests makes educated guesses about the
encoding of the response based on the HTTP headers. The text encoding
guessed by Requests is used when you access r.text
On the response object, you may use .content instead of .text to get the response in UTF-8

Wrong encoding when displaying an HTML Request in Python

I do not understand why when I make a HTTP request using the Requests library, then I ask to display the command .text, special characters (such as accents) are encoded (é = é for example).
Yet when I try r.encoding, I get utf-8.
In addition, the problem occurs only on some websites. Sometimes I have the correct characters, but other times, not at all.
Try as follows:
r = requests.get("https://gks.gs/login")
print r.text
There encoded characters which are displayed, we can see Mot de passe oublié ?.
I do not understand why. Do you think it may be because of https? How to fix this please?

These are HTML character entity references, the easiest way to decode them is:
In Python 2.x:
>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('oublié')
'oublié'
In Python 3.x:
>>> import html.parser
>>> html.parser.HTMLParser().unescape('oublié')
'oublié'

These are HTML escape codes, defined in the HTML Coded Character Set. Even though a certain document may be encoded in UTF-8, HTML (and its grandparent, SGML) were defined back in the good old days of ASCII. A system accessing an HTML page on the WWW may or may not natively support extended characters, and the developers needed a way to define "advanced" characters for some users, while failing gracefully for other users whose systems could not support them. Since UTF-8 standardization was only a gleam in its founders' eyes at that point, an encoding system was developed to describe characters that weren't part of ASCII. It was up to the browser developers to implement a way of displaying those extended characters, either through glyphs or through extended fonts.

Encoding special characters using &sometihg; is "legal" in any HTML and despite of looking a bit strange, they are to be considered valid.
The text is supposed to be rendered by some HTML browser and it will result in correct result, regardless if you find these character encoded using given construct or directly.
For instructions how to convert these encoded characters see HTML Entity Codes to Text

Those are HTML escape codes, often referred to as HTML entities. As you see, HTML uses its own code to replace reserved symbols.
You can use the library HTMLParser
parser = HTMLParser.HTMLParser
parsed = parser.unescape(r.text)

Some problems of Python crawler

And I'm just suffering from the question about python crawler.
First, the websites have two different hexadecimal of Chinese chracters. I can convert one of them (which is E4BDA0E5A5BD), the other one is C4E3BAC3 which I have no method to convert, or maybe I am missing some methods. The two hexadecimal values are '你好' in Chinese.
Second, I have found a website which can convert the hexadecimal, and to my surprise the answer is exactly what I cannot covert by myself.
The url is http://www.uol123.com/hantohex.html
Then I made a question: how to get the result which is in the text box (well I don't know what it is called exactly). I used firefox + httpfox to observe the post's data, and I find that the result which is converted by the website is in the Content, here is the pic:
And then I print the post, it has POST Data, and some headers, but no info about Content.
Third, then I google how to use ajax, and I really found a code about how to use ajax.
Here is the url http://outofmemory.cn/code-snippet/1885/python-moni-ajax-request-get-ajax-request-response
But when I run this, it has an error which says "ValueError: No JSON object could be decoded."
And pardon that I am a newbie, so I cannot post images!!!
I am looking forward to your help sincerely.
Any help will be appreciated.

you're talking about different encodings for these chinese characters. there are at least three different widely used encodings guobiao (for mainland China), big5 (on Taiwan) and unicode (everywhere else).
here's how to convert your kanji into the different encodings:
>>> a = u'你好' -- your original characters
>>> a
u'\u4f60\u597d' -- in unicode
>>> a.encode('utf-8')
'\xe4\xbd\xa0\xe5\xa5\xbd' -- in UTF-8
>>> a.encode('big5')
'\xa7A\xa6n' -- in Taiwanese Big5
>>> a.encode('gb2312-80')
'\xc4\xe3\xba\xc3' -- in Guobiao
>>>
You may check other available encodings here.
Ah, almost forgot. to convert from Unicode into the encoding you use encode() method. to convert back from the encoded contents of the web site you may use decode() method. just don't forget to specify the correct encoding.

how to url-safe encode a string with python? and urllib.quote is wrong

Hello i was wondering if you know any other way to encode a string to a url-safe, because urllib.quote is doing it wrong, the output is different than expected:
If i try
urllib.quote('á')
i get
'%C3%A1'
But thats not the correct output, it should be
%E1
As demostrated by the tool provided here this site
And this is not me being difficult, the incorrect output of quote is preventing the browser to found resources, if i try
urllib.quote('\images\á\some file.jpg')
And then i try with the javascript tool i mentioned i get this strings respectively
%5Cimages%5C%C3%A1%5Csome%20file.jpg
%5Cimages%5C%E1%5Csome%20file.jpg
Note how is almost the same but the url provided by quote doesn't work and the other one it does.
I tried messing with encode('utf-8) on the string provided to quote but it does not make a difference.
I tried with other spanish words with accents and the ñ they all are differently represented.
Is this a python bug?
Do you know some module that get this right?

According to RFC 3986, %C3%A1 is correct. Characters are supposed to be converted to an octet stream using UTF-8 before the octet stream is percent-encoded. The site you link is out of date.
See Why does the encoding's of a URL and the query string part differ? for more detail on the history of handling non-ASCII characters in URLs.

Ok, got it, i have to encode to iso-8859-1 like this
word = u'á'
word = word.encode('iso-8859-1')
print word

Python is interpreted in ASCII by default, so even though your file may be encoded differently, your UTF-8 char is interpereted as two ASCII chars.
Try putting a comment as the first of second line of your code like this to match the file encoding, and you might need to use u'á' also.
# coding: utf-8

What about using unicode strings and the numeric representation (ord) of the char?
>>> print '%{0:X}'.format(ord(u'á'))
%E1

In this question it seems some guy wrote a pretty large function to convert to ascii urls, thats what i need. But i was hoping there was some encoding tool in the std lib for the job.

Sending a List through an URL

I have a list that I need to send through a URL to a third party vendor. I don't know what language they are using.
The list prints out like this:
[u'1', u'6', u'5']
I know that the u encodes the string in utf-8 right? So a couple of questions.
Can I send a list through a URL?
Will the u's show up on the other end when going through the URL?
If so, how do I remove them?
I am not sure what keywords to search to help me out, so any resources would be helpful too.

Can I send a list through a URL?
No. A URL is just text. If you want a way to package structured information in it, you'll have to agree that with the provider you're talking to.
One standard encoding for structure in URLs, that might or might not be what you need, is the use of multiple parameters with the same name in a query string. This format comes from HTML form submissions:
http://www.example.com/script?par=1&par=6&par=5
might be considered to represent a parameter par with a three-item list as its value. Or maybe not, it's up to the receiver to decide. For example in a PHP application you would have had to name the parameter par[] to get it to accept the array value.
I know that the u encodes the string in utf-8 right?
No. a u'...' string is a native Unicode string, where each index represents a whole character and not a byte in any particular encoding. If you want UTF-8 bytes, write u'...'.encode('utf-8') before URL-encoding. UTF-8 is a good default choice, but again: what encoding the receiving side wants is up to that application.
Will the u's show up on the other end when going through the URL?
u is part of the literal representation of the string, just the same as the ' quotes themselves. They are not part of the string value and would not be echoed by print or when joined into other strings, unless you deliberately asked for the literal representation by calling repr.

u'' is not utf-8, its python unicode strings for python 2.x
To send it through url, you need to encode them with utf8 like .encode('utf-8'), and also need to urlencode, and list cannot send it through URL, you need to make it as string.
Basically, you need to do it in following steps
python list -> unicode string -> utf8 string -> url encode -> send it through proper urllib api

Incorrect. unicode literals use Python's internal encoding, decided when it was compiled.
You can't send anything "through" URLs. Pick a protocol instead. And encode before sending, probably to UTF-8.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Decode/Unescape Unicode Entities in python - python

You probably are looking for urllib.parse.unquote: >>> import urllib.parse >>> urllib.parse.unquote("XMMdpyi92N2o%2fENOpJIS3fYRa1k%2bYHFccNSYo1IIkpk%2fMbVY3tlk2gCjgq1lU6KB") 'XMMdpyi92N2o/ENOpJIS3fYRa1k+YHFccNSYo1IIkpk/MbVY3tlk2gCjgq1lU6KB'

Related

python request library giving wrong value single quotes

Wrong encoding when displaying an HTML Request in Python

Some problems of Python crawler

how to url-safe encode a string with python? and urllib.quote is wrong

Sending a List through an URL

Categories

Resources