Wrong encoding when displaying an HTML Request in Python

Wrong encoding when displaying an HTML Request in Python - python

I do not understand why when I make a HTTP request using the Requests library, then I ask to display the command .text, special characters (such as accents) are encoded (é = é for example).
Yet when I try r.encoding, I get utf-8.
In addition, the problem occurs only on some websites. Sometimes I have the correct characters, but other times, not at all.
Try as follows:
r = requests.get("https://gks.gs/login")
print r.text
There encoded characters which are displayed, we can see Mot de passe oublié ?.
I do not understand why. Do you think it may be because of https? How to fix this please?

These are HTML character entity references, the easiest way to decode them is:
In Python 2.x:
>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('oublié')
'oublié'
In Python 3.x:
>>> import html.parser
>>> html.parser.HTMLParser().unescape('oublié')
'oublié'

These are HTML escape codes, defined in the HTML Coded Character Set. Even though a certain document may be encoded in UTF-8, HTML (and its grandparent, SGML) were defined back in the good old days of ASCII. A system accessing an HTML page on the WWW may or may not natively support extended characters, and the developers needed a way to define "advanced" characters for some users, while failing gracefully for other users whose systems could not support them. Since UTF-8 standardization was only a gleam in its founders' eyes at that point, an encoding system was developed to describe characters that weren't part of ASCII. It was up to the browser developers to implement a way of displaying those extended characters, either through glyphs or through extended fonts.

Encoding special characters using &sometihg; is "legal" in any HTML and despite of looking a bit strange, they are to be considered valid.
The text is supposed to be rendered by some HTML browser and it will result in correct result, regardless if you find these character encoded using given construct or directly.
For instructions how to convert these encoded characters see HTML Entity Codes to Text

Those are HTML escape codes, often referred to as HTML entities. As you see, HTML uses its own code to replace reserved symbols.
You can use the library HTMLParser
parser = HTMLParser.HTMLParser
parsed = parser.unescape(r.text)

Related

Decode/Unescape Unicode Entities in python

There are special characters in a string that comes with response, whatever I did, I could not make them look real.
"XMMdpyi92N2o%2fENOpJIS3fYRa1k%2bYHFccNSYo1IIkpk%2fMbVY3tlk2gCjgq1lU6KB"
I can get the real view when I decode this code with this site https://www.online-toolz.com/tools/text-unicode-entities-convertor.php
Some special characters are available in response like %2f and %2b and these characters are represented by a list here https://www.w3schools.com/tags/ref_urlencode.ASP
All I want to do is to automatically decode these characters that come with response.
I am still learning python, I need experience from anyone who has knowledge.

You probably are looking for urllib.parse.unquote:
>>> import urllib.parse
>>> urllib.parse.unquote("XMMdpyi92N2o%2fENOpJIS3fYRa1k%2bYHFccNSYo1IIkpk%2fMbVY3tlk2gCjgq1lU6KB")
'XMMdpyi92N2o/ENOpJIS3fYRa1k+YHFccNSYo1IIkpk/MbVY3tlk2gCjgq1lU6KB'

python request library giving wrong value single quotes

Facing some issue in calling API using request library. Problem is described as follows
The code:.
r = requests.post(url, data=json.dumps(json_data), headers=headers)
When I perform r.text the apostrophe in the string is giving me as
like this Bachelor\u2019s Degree. This should actually give me the response as Bachelor's Degree.
I tried json.loads also but the single quote problem remains the same,
How to get the string value correctly.

What you see here ("Bachelor\u2019s Degree") is the string's inner representation, where "\u2019" is the unicode codepoint for "RIGHT SINGLE QUOTATION MARK". This is perfectly correct, there's nothing wrong here, if you print() this string you'll get what you expect:
>>> s = 'Bachelor\u2019s Degree'
>>> print(s)
Bachelor’s Degree
Learning about unicode and encodings might save you quite some time FWIW.
EDIT:
When I save in db and then on displaying on HTML it will cause issue
right?
Have you tried ?
Your database connector is supposed to encode it to the proper encoding (according to your fields, tables and client encoding settings).
wrt/ "displaying it on HTML", it mostly depends on whether you're using Python 2.7.x or Python 3.x AND on how you build your HTML, but if you're using some decent framework with a decent template engine (if not you should reconsider your stack) chances are it will work out of the box.
As I already mentionned, learning about unicode and encodings will save you a lot of time.

It's just using a UTF-8 encoding, it is not "wrong".
string = 'Bachelor\u2019s Degree'
print(string)
Bachelor’s Degree
You can decode and encode it again, but I can't see any reason why you would want to do that (this might not work in Python 2):
string = 'Bachelor\u2019s Degree'.encode().decode('utf-8')
print(string)
Bachelor’s Degree

From requests docs:
When you make a request, Requests makes educated guesses about the
encoding of the response based on the HTTP headers. The text encoding
guessed by Requests is used when you access r.text
On the response object, you may use .content instead of .text to get the response in UTF-8

Some problems of Python crawler

And I'm just suffering from the question about python crawler.
First, the websites have two different hexadecimal of Chinese chracters. I can convert one of them (which is E4BDA0E5A5BD), the other one is C4E3BAC3 which I have no method to convert, or maybe I am missing some methods. The two hexadecimal values are '你好' in Chinese.
Second, I have found a website which can convert the hexadecimal, and to my surprise the answer is exactly what I cannot covert by myself.
The url is http://www.uol123.com/hantohex.html
Then I made a question: how to get the result which is in the text box (well I don't know what it is called exactly). I used firefox + httpfox to observe the post's data, and I find that the result which is converted by the website is in the Content, here is the pic:
And then I print the post, it has POST Data, and some headers, but no info about Content.
Third, then I google how to use ajax, and I really found a code about how to use ajax.
Here is the url http://outofmemory.cn/code-snippet/1885/python-moni-ajax-request-get-ajax-request-response
But when I run this, it has an error which says "ValueError: No JSON object could be decoded."
And pardon that I am a newbie, so I cannot post images!!!
I am looking forward to your help sincerely.
Any help will be appreciated.

you're talking about different encodings for these chinese characters. there are at least three different widely used encodings guobiao (for mainland China), big5 (on Taiwan) and unicode (everywhere else).
here's how to convert your kanji into the different encodings:
>>> a = u'你好' -- your original characters
>>> a
u'\u4f60\u597d' -- in unicode
>>> a.encode('utf-8')
'\xe4\xbd\xa0\xe5\xa5\xbd' -- in UTF-8
>>> a.encode('big5')
'\xa7A\xa6n' -- in Taiwanese Big5
>>> a.encode('gb2312-80')
'\xc4\xe3\xba\xc3' -- in Guobiao
>>>
You may check other available encodings here.
Ah, almost forgot. to convert from Unicode into the encoding you use encode() method. to convert back from the encoded contents of the web site you may use decode() method. just don't forget to specify the correct encoding.

lxml.html parsing and utf-8 with requests

i used requests to retrieve a url which contains some unicode characters, and want to do some processing with it , then write it out.
r=requests.get(url)
f=open('unicode_test_1.html','w');f.write(r.content);f.close()
html = lxml.html.fromstring(r.content)
htmlOut = lxml.html.tostring(html)
f=open('unicode_test_2.html','w');f.write(htmlOut);f.close()
in unicode_test_1.html, all chars looks fine, but in unicode_test_2.html, some chars changed to gibberish, why is that ?
i then tried
html = lxml.html.fromstring(r.text)
htmlOut = lxml.html.tostring(html,encoding='latin1')
f=open('unicode_test_2.html','w');f.write(htmlOut);f.close()
it seems it's working now. but i don't know why is this happening, always use latin1 ?
what's the difference between r.text and r.content, and why can't i write html out using encoding='utf-8' ?

You've not specified if you're using python 2 or 3. Encoding is handled quite differently depending on which version you're using. The following advice is more or less universal anyway.
The difference between r.text and r.content is in the Requests docs. Simply put Requests will attempt to figure out the character encoding for you and return Unicode after decoding it. This which is accessible via r.text. To get just the bytes use r.content.
You really need to get to grips with encodings. Read http://www.joelonsoftware.com/articles/Unicode.html and watch https://www.youtube.com/watch?v=sgHbC6udIqc to get started. Also, do a search for "Overcoming frustration: Correctly using unicode in python2" for additional help.
Just to clarify, it's not as simple as always use one encoding over another. Make a Unicode sandwich by doing any I/O in bytes and work with Unicode in your application. If you start with bytes (isinstance(mytext, str)) you need to know the encoding to decode to Unicode, if you start with Unicode (isinstance(mytext, unicode)) you should encode to UTF-8 as it will handle all the worlds characters.
Make sure your editor, files, server and database are configured to UTF-8 also otherwise you'll get more 'gibberish'.
If you want further help post the source files and output of your script.

BeautifulSoup 4 converting HTML entities to unicode, but getting junk characters when using print

I am trying to scrape text from the web using BeautifulSoup 4 to parse it out. I am running into an issue when printing bs4 processed text out to the console. Whenever I hit a character that was originally an HTML entity, like ’ I get garbage characters on the console. I believe bs4 is converting these entities to unicode correctly because if I try using another encoding to print out the text, it will complain about the appropriate lack of unicode mapping for a character (like u'\u2019.) I'm not sure why the print function gets confused over these characters. I've tried changing around fonts, which changes the garbage characters, and am on a Windows 7 machine with US-English locale. Here is my code for reference, any help is appreciated. Thanks in advance!
#!/usr/bin/python
import json
import urllib2
import cookielib
import bs4
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
url = "http://api.nytimes.com/svc/search/v2/articlesearch.json?q=Tiguan\
&page=0&api-key=blah"
response = opener.open(url)
articles = response.read()
decoded = json.loads(articles)
totalpages = decoded['response']['meta']['hits']/10
for page in range(totalpages + 1):
if page>0:
url = "http://api.nytimes.com/svc/search/v2/articlesearch.json?\
q=Tiguan&page=" + str(page) + "&api-key=blah"
response = opener.open(url)
articles = response.read()
decoded = json.loads(articles)
for url in decoded['response']['docs']:
print url['web_url']
urlstring = url['web_url']
art = opener.open(urlstring)
soup = bs4.BeautifulSoup(art.read())
goodstuff = soup.findAll('nyt_text')
for tag in goodstuff:
print tag.prettify().encode("UTF")

The problem has nothing to do with bs4, or HTML entities, or anything else. You could reproduce the exact same behavior, on most Windows systems, with a one-liner program to print out the same characters that are appearing as garbage when you try to print them, like this:
print u'\u2019'.encode('UTF-8')
The problem here is that, like the vast majority of Windows systems (and nothing else anyone uses in 2013), your default character set is not UTF-8, but something like CP1252.
So, when you encode your Unicode strings to UTF-8 and print those bytes to the console, the console interprets them as CP1252. Which, in this case, means you get â€™ instead of ’.
Changing fonts won't help. The UTF-8 encoding of \u2013 is the three bytes \xe2, \x80, and \x99, and the CP1252 meaning of those three bytes is â, €, and ™.
If you want to encode manually for the console, you need to encode to the right character set, the one your console actually uses. You may be able to get that as sys.stdout.encoding.
Of course you may get an exception trying to encode things for the right character set, because 8-bit character sets like CP1252 can only handle about 240 of the 110K characters in Unicode. The only way to handle that is to use the errors argument to encode to either ignore them or replace them with replacement characters.
Meanwhile, if you haven't read the Unicode HOWTO, you really need to. Especially if you plan to stick with Python 2.x and Windows.
If you're wondering why a few command-line programs seem to be able to get around these problems: Microsoft's solution to the character set problem is to create a whole parallel set of APIs that use 16-bit characters instead of 8-bit, and those APIs always use UTF-16. Unfortunately, many things, like the portable stdio wrappers that Microsoft provides for talking to the console and that Python 2.x relies on, only have the 8-bit API. Which means the problem isn't solved at all. Python 3.x no longer uses those wrappers, and there have been recurring discussions on making some future version talk UTF-16 to the console. But even if that happens in 3.4 (which seems very unlikely), that won't help you as long as you're using 2.x.

#abarnert's answer contains a good explanation of the issue.
In your particular case, you could just pass encoding parameter to prettify() instead of default utf-8.
If you are printing to console, you could try to print Unicode directly:
print soup.prettify(encoding=None, formatter='html') # print Unicode
It may fail. If you pass ascii; then BeautifulSoup may use numerical character references instead of non-ascii characters:
print soup.prettify('ascii', formatter='html')
It assumes that current Windows codepage is ascii-based encoding (most of them do). It should also work if the output is redirected to a file or another program via a pipe.
For portability, you could always print Unicode (encoding=None above) and use PYTHONIOENCODING to get appropriate character encoding e.g., utf-8 for files, pipes and ascii:xmlcharrefreplace to avoid garbage in a console.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.