Python Beautiful Soup, saved text not displaying properly in original encoding

Python Beautiful Soup, saved text not displaying properly in original encoding - python

I'm having trouble with a saved file not displaying properly when using it's original encoding.
I'm downloading a web page, searching it for content I want and then writing that content to a file.
The encoding on the site is 'iso-8859-1' or so chrome and beautiful soup tell me and it appears perfectly when viewed using that encoding on the original site.
When I download the page and try to view it however I end up with strange characters (HTML Entities?) like these:
â€œ , â€™
If I manually set Chromes encoding to 'Utf-8' when viewing the saved page it appears normally, as does the original page if I set that to 'Utf-8'.
I'm not sure what to do with this, I would change the encoding before writing the text to a file but I get ascii errors when I try that.
Here is a sample page (possible adult content):
http://original.adultfanfiction.net/story.php?no=600106516
And the code I am using to get the text from the page:
site = requests.post(url, allow_redirects=False)
html = site.text
soup = BeautifulSoup(html)
rawStory = soup.findAll("td",{"colspan" : '3'})
story = str(rawStory)
return story
I turn the ResultSet into a string so that I can write it to a file, I don't know if that could be part of the problem, if I print the html to the console after requesting it but before doing anything to it it displays improperly in the console as well.

I'm 90% sure that your problem is just that you're asking BeautifulSoup for a UTF-8 fragment and then trying to use it as ISO-8859-1, which obviously isn't going to work. The documentation explains all of this pretty nicely.
You're calling str. As Non pretty printing explains:
If you just want a string, with no fancy formatting, you can call unicode() or str() on a BeautifulSoup object, or a Tag within it… The str() function returns a string encoded in UTF-8.
As Output encoding explains:
When you write out a document from Beautiful Soup, you get a UTF-8 document, even if the document wasn’t in UTF-8 to begin with.
It then follows up with an example of almost exactly what you're doing—parsing a Latin-1 HTML document and writing it back as UTF-8—and then immediately explaining how to fix it:
If you don’t want UTF-8, you can pass an encoding into prettify()… You can also call encode() on the BeautifulSoup object, or any element in the soup, just as if it were a Python string…
So, that's all you have to do.
However, you've got another problem before you get there. When you call findAll, you don't get back a tag, you get back a ResultSet, which is basically a list of tags. Just as calling str on a list of strings gives you brackets, commas, and the repr of each string (with extraneous quotes and backlash escapes for non-printable-ASCII characters) instead of the strings themselves, calling str on a ResultSet gives you something similar. And you obviously can't call Tag methods on a ResultSet.
Finally, I'm not sure what problem you're actually trying to solve. You're creating a HTML fragment. Ignoring the fact that a fragment isn't a valid document, and browsers shouldn't strictly speaking display it in the first place, it doesn't specify an encoding, meaning the browser can only get that information from some out-of-band place, like you selecting a menu item. Changing it to Latin-1 won't actually "fix" things, it'll just mean that now you get the right display when you pick Latin-1 in the menu and wrong when you pick UTF-8 instead of vice-versa. Why not actually create a full HTML document that actually has a meta http-equiv that actually means what you want it to mean, instead of trying to figure out how to trick Chrome into guessing what you want it to guess?

Related

BeautifulSoup4 cannot get the printing right. Python3

I'm currently in the learning process of Python3, I am scraping a site for some data, which works fine, but when it comes to printing out the p tags I just can't get it to work as I expect.
import urllib
import lxml
from urllib import request
from bs4 import BeautifulSoup
data = urllib.request.urlopen('www.site.com').read()
soup = BeautifulSoup(data, 'lxml')
stat = soup.find('div', {'style' : 'padding-left: 10px';})
dialog = stat.findChildren('p')
for child in dialog:
childtext = child.get_text()
#have tried child.string aswell (exactly the same result)
childlist.append(childtext.encode('utf-8', 'ignore')
#Have tried with str(childtext.encode('utf-8', 'ignore'))
print (childlist)
That all works, but the printing is "bytes"
b'This is a ptag.string'
b'\xc2\xa0 (probably &nbsp'
b'this is anotherone'
Real sample text that is ascii encoded:
b"Announcementb'Firefox users may encounter browser warnings encountering SSL SHA-1 certificates"
Note that Announcement is p and the rest is 'strong' under a p tag.
Same sample with utf-8 encode
b"Announcement\xc2\xa0\xe2\x80\x93\xc2\xa0b'Firefox users may encounter browser warnings encountering SSL SHA-1 "
I WISH to get:
"Announcement"
(newline / new item in list)
"Firefox users may encounter browser warnings encountering SSL SHA-1 certificates"
As you see, the incorrect chars are stripped in "ascii", but as some are that destroys some linebreaks and I have yet to figure out how to print that correctly, also, the b's are still there then!
I really can't figure out how to remove b's and encode or decode properly. I have tried every "solution" that I can google up.
HTML Content = utf-8
I would most rather not change the full data before processing because it will mess up my other work and I don't think it is needed.
Prettify does not work.
Any suggestions?

First, you're getting output of the form b'stuff' because you are calling .encode(), which returns a bytes object. If you want to print strings for reading, keep them as strings!
As a guess, I assume you're looking to print strings from HTML nicely, pretty much as they would be seen in a browser. For that, you need to decode the HTML string encoding, as described in this SO answer, which for Python 3.5 means:
import html
html.unescape(childtext)
Among other things, this will convert any sequences in the HTML string into '\xa0' characters, which are printed as spaces. However, if you want to break lines on these characters despite literally meaning "non-breaking space", you'll have to replace those with actual spaces before printing, e.g. using x.replace('\xa0', ' ').

Wrong encoding when displaying an HTML Request in Python

I do not understand why when I make a HTTP request using the Requests library, then I ask to display the command .text, special characters (such as accents) are encoded (é = é for example).
Yet when I try r.encoding, I get utf-8.
In addition, the problem occurs only on some websites. Sometimes I have the correct characters, but other times, not at all.
Try as follows:
r = requests.get("https://gks.gs/login")
print r.text
There encoded characters which are displayed, we can see Mot de passe oublié ?.
I do not understand why. Do you think it may be because of https? How to fix this please?

These are HTML character entity references, the easiest way to decode them is:
In Python 2.x:
>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('oublié')
'oublié'
In Python 3.x:
>>> import html.parser
>>> html.parser.HTMLParser().unescape('oublié')
'oublié'

These are HTML escape codes, defined in the HTML Coded Character Set. Even though a certain document may be encoded in UTF-8, HTML (and its grandparent, SGML) were defined back in the good old days of ASCII. A system accessing an HTML page on the WWW may or may not natively support extended characters, and the developers needed a way to define "advanced" characters for some users, while failing gracefully for other users whose systems could not support them. Since UTF-8 standardization was only a gleam in its founders' eyes at that point, an encoding system was developed to describe characters that weren't part of ASCII. It was up to the browser developers to implement a way of displaying those extended characters, either through glyphs or through extended fonts.

Encoding special characters using &sometihg; is "legal" in any HTML and despite of looking a bit strange, they are to be considered valid.
The text is supposed to be rendered by some HTML browser and it will result in correct result, regardless if you find these character encoded using given construct or directly.
For instructions how to convert these encoded characters see HTML Entity Codes to Text

Those are HTML escape codes, often referred to as HTML entities. As you see, HTML uses its own code to replace reserved symbols.
You can use the library HTMLParser
parser = HTMLParser.HTMLParser
parsed = parser.unescape(r.text)

lxml.html parsing and utf-8 with requests

i used requests to retrieve a url which contains some unicode characters, and want to do some processing with it , then write it out.
r=requests.get(url)
f=open('unicode_test_1.html','w');f.write(r.content);f.close()
html = lxml.html.fromstring(r.content)
htmlOut = lxml.html.tostring(html)
f=open('unicode_test_2.html','w');f.write(htmlOut);f.close()
in unicode_test_1.html, all chars looks fine, but in unicode_test_2.html, some chars changed to gibberish, why is that ?
i then tried
html = lxml.html.fromstring(r.text)
htmlOut = lxml.html.tostring(html,encoding='latin1')
f=open('unicode_test_2.html','w');f.write(htmlOut);f.close()
it seems it's working now. but i don't know why is this happening, always use latin1 ?
what's the difference between r.text and r.content, and why can't i write html out using encoding='utf-8' ?

You've not specified if you're using python 2 or 3. Encoding is handled quite differently depending on which version you're using. The following advice is more or less universal anyway.
The difference between r.text and r.content is in the Requests docs. Simply put Requests will attempt to figure out the character encoding for you and return Unicode after decoding it. This which is accessible via r.text. To get just the bytes use r.content.
You really need to get to grips with encodings. Read http://www.joelonsoftware.com/articles/Unicode.html and watch https://www.youtube.com/watch?v=sgHbC6udIqc to get started. Also, do a search for "Overcoming frustration: Correctly using unicode in python2" for additional help.
Just to clarify, it's not as simple as always use one encoding over another. Make a Unicode sandwich by doing any I/O in bytes and work with Unicode in your application. If you start with bytes (isinstance(mytext, str)) you need to know the encoding to decode to Unicode, if you start with Unicode (isinstance(mytext, unicode)) you should encode to UTF-8 as it will handle all the worlds characters.
Make sure your editor, files, server and database are configured to UTF-8 also otherwise you'll get more 'gibberish'.
If you want further help post the source files and output of your script.

Python: Youtube HTML full of BOMs

I'm trying to parse youtube comments using BeautifulSoup 4 in Python 2.7. When I try for any youtube video I get text full of BOMs, not just at the file start:
<p> thank you kind sirï»¿ :)</p>
One appears in almost every comment. This is not the case for other websites (guardian.co.uk). The code I'm using:
# Source (should be taken from file to allow updating but not during wip):
source_url = 'https://www.youtube.com/watch?v=aiYzrCjS02k&feature=related'
# Get html from source:
response = urllib2.urlopen(source_url)
html = response.read()
# html comes with BOM everywhere, which is real ***, get rid of it!
html = html.decode("utf-8-sig")
soup = BeautifulSoup(html)
strings = soup.findAll("div", {"class" : "comment-body"})
print strings
As you can see I've tried decoding but as soon as I soup it brings back the BOM character. Any ideas?

This seems to be invalid on YouTube's part, but you can't just tell them to fix it, you need a workaround.
So, here's a simple workaround:
# html comes with BOM everywhere, which is real ***, get rid of it!
html = html.replace(b'\xEF\xBB\xBF', b'')
html = html.decode("utf-8")
(The b prefixes are unnecessary but harmless for Python 2.7, but they'll make your code work in Python 3… on the other hand, they'll break it for Python 2.5, so if that's more important to you, get rid of them.)
Alternatively, you can first decode and then replace(u'\uFEFF', u''). This should have the exact same effect (decoding extra BOMs should work harmlessly). But I think it makes more sense to fix the UTF-8 then decode it, rather than trying to decode and then fixing the result.

Python 2.6 and unicode

So I am working for a web browser type of application for my client and I just implemented bookmarking functionality, but it doesn't work as expected. When user click "Bookmark page" a little form pops up, which takes title of a webpage and puts it in a line edit. The thing is, that if the website has some foreign or unusual symbols in it's title then Python throws an error how it can't encode the string. How could I get python to handle all possible strings, no matter if it has hieroglyphs or some other weird symbols?
Library used for GUI and embedded browser: PyQT

If you're using QWebView.title to get the title of the current web-page, then it will either return a QString or a python unicode string. Which one you get depends on the PyQt API version in use. For version 1 (which is the default for Python2), it will be a QString; for version 2 (which is the default for Python3), it will be a python unicode string. Whichever it is, in order to display it correctly in the line-edit, just set it directly:
lineEdit.setText(webview.title())
Since you appear to be using Python2, I'll assume that webview.title() is returning a QString. If you want to convert this to a python unicode string (e.g. in order to use it with sqlite), then you can do the following:
title = unicode(webview.title())
Note that you should not pass an encoding (such as "utf-8") as the second argument to unicode, as this is used for decoding byte strings to unicode strings.
If you do need to get a "utf-8" encoded byte string from a QString, then you can either do:
data = unicode(webview.title()).encode('utf-8')
or:
data = webview.title().toUtf8().data()

What are you using to parse the websites? I would recommend Beautiful Soup. It will try and determine the encoding of the web page and give you back unicode. Beautiful Soup's Parsing HTML section. Edit: Also take a look at the "Beautiful Soup Gives You Unicode, Dammit" section

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.