Python correct encoding of Website (Beautiful Soup)

Python correct encoding of Website (Beautiful Soup) - python

I am trying to load a html-page and output the text, even though i am getting the webpage correctly, BeautifulSoup destroys somehow the encoding.
Source:
# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulSoup
url = "http://www.columbia.edu/~fdc/utf8/"
r = requests.get(url)
encodedText = r.text.encode("utf-8")
soup = BeautifulSoup(encodedText)
text = str(soup.findAll(text=True))
print text.decode("utf-8")
Excerpt Output:
...Odenw\xc3\xa4lderisch...
this should be Odenwälderisch

You are making two mistakes; you are mis-handling encoding, and you are treating a result list as something that can safely be converted to a string without loss of information.
First of all, don't use response.text! It is not BeautifulSoup at fault here, you are re-encoding a Mojibake. The requests library will default to Latin-1 encoding for text/* content types when the server doesn't explicitly specify an encoding, because the HTTP standard states that that is the default.
See the Encoding section of the Advanced documentation:
The only time Requests will not do this is if no explicit charset is present in the HTTP headers and the Content-Type header contains text. In this situation, RFC 2616 specifies that the default charset must be ISO-8859-1. Requests follows the specification in this case. If you require a different encoding, you can manually set the Response.encoding property, or use the raw Response.content.
Bold emphasis mine.
Pass in the response.content raw data instead:
soup = BeautifulSoup(r.content)
I see that you are using BeautifulSoup 3. You really want to upgrade to BeautifulSoup 4 instead; version 3 has been discontinued in 2012, and contains several bugs. Install the beautifulsoup4 project, and use from bs4 import BeautifulSoup.
BeautifulSoup 4 usually does a great job of figuring out the right encoding to use when parsing, either from a HTML <meta> tag or statistical analysis of the bytes provided. If the server does provide a characterset, you can still pass this into BeautifulSoup from the response, but do test first if requests used a default:
encoding = r.encoding if 'charset' in r.headers.get('content-type', '').lower() else None
parser = 'html.parser' # or lxml or html5lib
soup = BeautifulSoup(r.content, parser, from_encoding=encoding)
Last but not least, with BeautifulSoup 4, you can extract all text from a page using soup.get_text():
text = soup.get_text()
print text
You are instead converting a result list (the return value of soup.findAll()) to a string. This never can work because containers in Python use repr() on each element in the list to produce a debugging string, and for strings that means you get escape sequences for anything not a printable ASCII character.

It's not BeautifulSoup's fault. You can see this by printing out encodedText, before you ever use BeautifulSoup: the non-ASCII characters are already gibberish.
The problem here is that you are mixing up bytes and characters. For a good overview of the difference, read one of Joel's articles, but the gist is that bytes are, well, bytes (groups of 8 bits without any further meaning attached), whereas characters are the things that make up strings of text. Encoding turns characters into bytes, and decoding turns bytes back into characters.
A look at the requests documentation shows that r.text is made of characters, not bytes. You shouldn't be encoding it. If you try to do so, you will make a byte string, and when you try to treat that as characters, bad things will happen.
There are two ways to get around this:
Use the raw undecoded bytes, which are stored in r.content, as Martijn suggested. Then you can decode them yourself to turn them into characters.
Let requests do the decoding, but just make sure it uses the right codec. Since you know that's UTF-8 in this case, you can set r.encoding = 'utf-8'. If you do this before you access r.text, then when you do access r.text, it will have been properly decoded, and you get a character string. You don't need to mess with character encodings at all.
Incidentally, Python 3 makes it somewhat easier to maintain the difference between character strings and byte strings, because it requires you to use different types of objects to represent them.

There are a couple of errors in your code:
First of all, your attempt at re-encoding the text is not needed.
Requests can give you the native encoding of the page and BeautifulSoup can take this info and do the decoding itself:
# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulSoup
url = "http://www.columbia.edu/~fdc/utf8/"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html5lib")
Second of all, you have an encoding issue. You are probably trying to visualize the results on the terminal. What you will get is the unicode representation of the characters in the text for every character that is not in the ASCII set. You can check the results like this:
res = [item.encode("ascii","ignore") for item in soup.find_all(text=True)]

Related

Fetching URL and converting to UTF-8 Python

I would like to do my first project in python but I have problem with coding. When I fetch data it shows coded letters instead of my native letters, for example '\xc4\x87' instead of 'ć'. The code is below:
import urllib.request
import sys
page = urllib.request.urlopen("http://olx.pl/")
test = page.read()
print(test)
print(sys.stdin.encoding)
z = "ł"
print(z)
print(z.encode("utf-8"))
I know that code here is poor but I tried many options to change encoding. I wrote z = "ł" to check if it can print any 'special' letter and it shows. I tried to encode it and it works also as it should. Sys.stdin.encoding shows cp852.

The data you read from a urlopen() response is encoded data. You'd need to first decode that data using the right encoding.
You appear to have downloaded UTF-8 data; you'd have to decode that data first before you had text:
test = page.read().decode('utf8')
However, it is up to the server to tell you what data was received. Check for a characterset in the headers:
encoding = page.info().getparam('charset')
This can still be None; many data formats include the encoding as part of the format. XML for example is UTF-8 by default but the XML declaration at the start can contain information about what codec was used for that document. An XML parser would extract that information to ensure you get properly decoded Unicode text when parsing.
You may not be able to print that data; the 852 codepage can only handle 256 different codepoints, while the Unicode standard is far larger.

The urlopen is returning to you a bytes object. That means it's a raw, encoded stream of bytes. Python 3 prints that in a repr format, which uses escape codes for non-ASCII characters. To get the canonical unicode you would have to decode it. The right way to do that would be to inspect the header and look for the encoding declaration. But for this we can assume UTF-8 and you can simply decode it as such, not encode it.
import urllib.request
import sys
page = urllib.request.urlopen("http://olx.pl/")
test = page.read()
print(test.decode("utf-8")) # <- note change
Now, Python 3 defaults to UTF-8 source encoding. So you can embed non-ASCII like this if your editor supports unicode and saving as UTF-8.
z = "ł"
print(z)
Printing it will only work if your terminal supports UTF-8 encoding. On Linux and OSX they do, so this is not a problem there.

The others are correct, but I'd like to offer a simpler solution. Use requests. It's 3rd party, so you'll need to install it via pip:
pip install requests
But it's a lot simpler to use than the urllib libraries. For your particular case, it handles the decoding for you out of the box:
import requests
r = requests.get("http://olx.pl/")
print(r.encoding)
# UTF-8
print(type(r.text))
# <class 'str'>
print(r.text)
# The HTML
Breakdown:
get sends an HTTP GET request to the server and returns the respose.
We print the encoding requests thinks the text is in. It chooses this based on the response header Martijin mentions.
We show that r.text is already a decoded text type (unicode in Python 2 and str in Python 3)
Then we actually print the response.
Note that we don't have to print the encoding or type; I've just done so for diagnostic purposes to show what requests is doing. requests is designed to simplify a lot of other details of working with HTTP requests, and it does a good job of it.

BeautifulSoup4 cannot get the printing right. Python3

I'm currently in the learning process of Python3, I am scraping a site for some data, which works fine, but when it comes to printing out the p tags I just can't get it to work as I expect.
import urllib
import lxml
from urllib import request
from bs4 import BeautifulSoup
data = urllib.request.urlopen('www.site.com').read()
soup = BeautifulSoup(data, 'lxml')
stat = soup.find('div', {'style' : 'padding-left: 10px';})
dialog = stat.findChildren('p')
for child in dialog:
childtext = child.get_text()
#have tried child.string aswell (exactly the same result)
childlist.append(childtext.encode('utf-8', 'ignore')
#Have tried with str(childtext.encode('utf-8', 'ignore'))
print (childlist)
That all works, but the printing is "bytes"
b'This is a ptag.string'
b'\xc2\xa0 (probably &nbsp'
b'this is anotherone'
Real sample text that is ascii encoded:
b"Announcementb'Firefox users may encounter browser warnings encountering SSL SHA-1 certificates"
Note that Announcement is p and the rest is 'strong' under a p tag.
Same sample with utf-8 encode
b"Announcement\xc2\xa0\xe2\x80\x93\xc2\xa0b'Firefox users may encounter browser warnings encountering SSL SHA-1 "
I WISH to get:
"Announcement"
(newline / new item in list)
"Firefox users may encounter browser warnings encountering SSL SHA-1 certificates"
As you see, the incorrect chars are stripped in "ascii", but as some are that destroys some linebreaks and I have yet to figure out how to print that correctly, also, the b's are still there then!
I really can't figure out how to remove b's and encode or decode properly. I have tried every "solution" that I can google up.
HTML Content = utf-8
I would most rather not change the full data before processing because it will mess up my other work and I don't think it is needed.
Prettify does not work.
Any suggestions?

First, you're getting output of the form b'stuff' because you are calling .encode(), which returns a bytes object. If you want to print strings for reading, keep them as strings!
As a guess, I assume you're looking to print strings from HTML nicely, pretty much as they would be seen in a browser. For that, you need to decode the HTML string encoding, as described in this SO answer, which for Python 3.5 means:
import html
html.unescape(childtext)
Among other things, this will convert any sequences in the HTML string into '\xa0' characters, which are printed as spaces. However, if you want to break lines on these characters despite literally meaning "non-breaking space", you'll have to replace those with actual spaces before printing, e.g. using x.replace('\xa0', ' ').

lxml.html parsing and utf-8 with requests

i used requests to retrieve a url which contains some unicode characters, and want to do some processing with it , then write it out.
r=requests.get(url)
f=open('unicode_test_1.html','w');f.write(r.content);f.close()
html = lxml.html.fromstring(r.content)
htmlOut = lxml.html.tostring(html)
f=open('unicode_test_2.html','w');f.write(htmlOut);f.close()
in unicode_test_1.html, all chars looks fine, but in unicode_test_2.html, some chars changed to gibberish, why is that ?
i then tried
html = lxml.html.fromstring(r.text)
htmlOut = lxml.html.tostring(html,encoding='latin1')
f=open('unicode_test_2.html','w');f.write(htmlOut);f.close()
it seems it's working now. but i don't know why is this happening, always use latin1 ?
what's the difference between r.text and r.content, and why can't i write html out using encoding='utf-8' ?

You've not specified if you're using python 2 or 3. Encoding is handled quite differently depending on which version you're using. The following advice is more or less universal anyway.
The difference between r.text and r.content is in the Requests docs. Simply put Requests will attempt to figure out the character encoding for you and return Unicode after decoding it. This which is accessible via r.text. To get just the bytes use r.content.
You really need to get to grips with encodings. Read http://www.joelonsoftware.com/articles/Unicode.html and watch https://www.youtube.com/watch?v=sgHbC6udIqc to get started. Also, do a search for "Overcoming frustration: Correctly using unicode in python2" for additional help.
Just to clarify, it's not as simple as always use one encoding over another. Make a Unicode sandwich by doing any I/O in bytes and work with Unicode in your application. If you start with bytes (isinstance(mytext, str)) you need to know the encoding to decode to Unicode, if you start with Unicode (isinstance(mytext, unicode)) you should encode to UTF-8 as it will handle all the worlds characters.
Make sure your editor, files, server and database are configured to UTF-8 also otherwise you'll get more 'gibberish'.
If you want further help post the source files and output of your script.

python3: different charset support

I am using python 3.3 in Windows 7.
if "iso-8859-1" in str(source):
source = source.decode('iso-8859-1')
if "utf-8" in str(source):
source = source.decode('utf-8')
So, currently my application is valid for the above two charsets only ... but I want to cover every possible charset.
Actually, I'm finding these charsets manually from the source of the website, and I have experienced that all the websites in the world are not just from these two. Sometimes websites do not show their charset in their HTML source! So, my application fails to move ahead there!
What should I do to detect a charset automatically and decode according to it?
Please try to make me aware in-depth and with examples if possible. You can suggest important links too.

BeautifulSoup provides a function UnicodeDammit() that goes through a number of steps1 to determine the encoding of any string you give it, and converts it to unicode. It's pretty straightforward to use:
from bs4 import UnicodeDammit
unicode_string = UnicodeDammit(encoded_string)
If you use BeautifulSoup to process your HTML, it will automatically use UnicodeDammit to convert it to unicode for you.
1According to the documentation for BeautifulSoup 3, these are the actions UnicodeDammit takes:
Beautiful Soup tries the following encodings, in order of priority, to
turn your document into Unicode:
An encoding you pass in as the fromEncoding argument to the soup constructor.
An encoding discovered in the document itself: for instance, in an XML
declaration or (for HTML documents) an http-equiv META tag. If Beautiful
Soup finds this kind of encoding within the document, it parses the
document again from the beginning and gives the new encoding a try. The
only exception is if you explicitly specified an encoding, and that
encoding actually worked: then it will ignore any encoding it finds in the
document.
An encoding sniffed by looking at the first few bytes of the file. If an
encoding is detected at this stage, it will be one of the UTF-* encodings,
EBCDIC, or ASCII.
An encoding sniffed by the chardet library, if you have it installed.
UTF-8
Windows-1252
That explanation doesn't seem to be present in the BeautifulSoup 4 documentation, but presumably BS4's UnicodeDammit works in much the same way (though I haven't checked the source to be sure).

urllib2 read to Unicode

I need to store the content of a site that can be in any language. And I need to be able to search the content for a Unicode string.
I have tried something like:
import urllib2
req = urllib2.urlopen('http://lenta.ru')
content = req.read()
The content is a byte stream, so I can search it for a Unicode string.
I need some way that when I do urlopen and then read to use the charset from the headers to decode the content and encode it into UTF-8.

After the operations you performed, you'll see:
>>> req.headers['content-type']
'text/html; charset=windows-1251'
and so:
>>> encoding=req.headers['content-type'].split('charset=')[-1]
>>> ucontent = unicode(content, encoding)
ucontent is now a Unicode string (of 140655 characters) -- so for example to display a part of it, if your terminal is UTF-8:
>>> print ucontent[76:110].encode('utf-8')
<title>Lenta.ru: Главное: </title>
and you can search, etc, etc.
Edit: Unicode I/O is usually tricky (this may be what's holding up the original asker) but I'm going to bypass the difficult problem of inputting Unicode strings to an interactive Python interpreter (completely unrelated to the original question) to show how, once a Unicode string IS correctly input (I'm doing it by codepoints -- goofy but not tricky;-), search is absolutely a no-brainer (and thus hopefully the original question has been thoroughly answered). Again assuming a UTF-8 terminal:
>>> x=u'\u0413\u043b\u0430\u0432\u043d\u043e\u0435'
>>> print x.encode('utf-8')
Главное
>>> x in ucontent
True
>>> ucontent.find(x)
93
Note: Keep in mind that this method may not work for all sites, since some sites only specify character encoding inside the served documents (using http-equiv meta tags, for example).

To parse Content-Type http header, you could use cgi.parse_header function:
import cgi
import urllib2
r = urllib2.urlopen('http://lenta.ru')
_, params = cgi.parse_header(r.headers.get('Content-Type', ''))
encoding = params.get('charset', 'utf-8')
unicode_text = r.read().decode(encoding)
Another way to get the charset:
>>> import urllib2
>>> r = urllib2.urlopen('http://lenta.ru')
>>> r.headers.getparam('charset')
'utf-8'
Or in Python 3:
>>> import urllib.request
>>> r = urllib.request.urlopen('http://lenta.ru')
>>> r.headers.get_content_charset()
'utf-8'
Character encoding can also be specified inside html document e.g., <meta charset="utf-8">.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.