Python: Youtube HTML full of BOMs - python

I'm trying to parse youtube comments using BeautifulSoup 4 in Python 2.7. When I try for any youtube video I get text full of BOMs, not just at the file start:
<p> thank you kind sir :)</p>
One appears in almost every comment. This is not the case for other websites (guardian.co.uk). The code I'm using:
# Source (should be taken from file to allow updating but not during wip):
source_url = 'https://www.youtube.com/watch?v=aiYzrCjS02k&feature=related'
# Get html from source:
response = urllib2.urlopen(source_url)
html = response.read()
# html comes with BOM everywhere, which is real ***, get rid of it!
html = html.decode("utf-8-sig")
soup = BeautifulSoup(html)
strings = soup.findAll("div", {"class" : "comment-body"})
print strings
As you can see I've tried decoding but as soon as I soup it brings back the BOM character. Any ideas?

This seems to be invalid on YouTube's part, but you can't just tell them to fix it, you need a workaround.
So, here's a simple workaround:
# html comes with BOM everywhere, which is real ***, get rid of it!
html = html.replace(b'\xEF\xBB\xBF', b'')
html = html.decode("utf-8")
(The b prefixes are unnecessary but harmless for Python 2.7, but they'll make your code work in Python 3… on the other hand, they'll break it for Python 2.5, so if that's more important to you, get rid of them.)
Alternatively, you can first decode and then replace(u'\uFEFF', u''). This should have the exact same effect (decoding extra BOMs should work harmlessly). But I think it makes more sense to fix the UTF-8 then decode it, rather than trying to decode and then fixing the result.

Related

BeautifulSoup4 cannot get the printing right. Python3

I'm currently in the learning process of Python3, I am scraping a site for some data, which works fine, but when it comes to printing out the p tags I just can't get it to work as I expect.
import urllib
import lxml
from urllib import request
from bs4 import BeautifulSoup
data = urllib.request.urlopen('www.site.com').read()
soup = BeautifulSoup(data, 'lxml')
stat = soup.find('div', {'style' : 'padding-left: 10px';})
dialog = stat.findChildren('p')
for child in dialog:
childtext = child.get_text()
#have tried child.string aswell (exactly the same result)
childlist.append(childtext.encode('utf-8', 'ignore')
#Have tried with str(childtext.encode('utf-8', 'ignore'))
print (childlist)
That all works, but the printing is "bytes"
b'This is a ptag.string'
b'\xc2\xa0 (probably &nbsp'
b'this is anotherone'
Real sample text that is ascii encoded:
b"Announcementb'Firefox users may encounter browser warnings encountering SSL SHA-1 certificates"
Note that Announcement is p and the rest is 'strong' under a p tag.
Same sample with utf-8 encode
b"Announcement\xc2\xa0\xe2\x80\x93\xc2\xa0b'Firefox users may encounter browser warnings encountering SSL SHA-1 "
I WISH to get:
"Announcement"
(newline / new item in list)
"Firefox users may encounter browser warnings encountering SSL SHA-1 certificates"
As you see, the incorrect chars are stripped in "ascii", but as some are that destroys some linebreaks and I have yet to figure out how to print that correctly, also, the b's are still there then!
I really can't figure out how to remove b's and encode or decode properly. I have tried every "solution" that I can google up.
HTML Content = utf-8
I would most rather not change the full data before processing because it will mess up my other work and I don't think it is needed.
Prettify does not work.
Any suggestions?
First, you're getting output of the form b'stuff' because you are calling .encode(), which returns a bytes object. If you want to print strings for reading, keep them as strings!
As a guess, I assume you're looking to print strings from HTML nicely, pretty much as they would be seen in a browser. For that, you need to decode the HTML string encoding, as described in this SO answer, which for Python 3.5 means:
import html
html.unescape(childtext)
Among other things, this will convert any sequences in the HTML string into '\xa0' characters, which are printed as spaces. However, if you want to break lines on these characters despite literally meaning "non-breaking space", you'll have to replace those with actual spaces before printing, e.g. using x.replace('\xa0', ' ').

lxml.html parsing and utf-8 with requests

i used requests to retrieve a url which contains some unicode characters, and want to do some processing with it , then write it out.
r=requests.get(url)
f=open('unicode_test_1.html','w');f.write(r.content);f.close()
html = lxml.html.fromstring(r.content)
htmlOut = lxml.html.tostring(html)
f=open('unicode_test_2.html','w');f.write(htmlOut);f.close()
in unicode_test_1.html, all chars looks fine, but in unicode_test_2.html, some chars changed to gibberish, why is that ?
i then tried
html = lxml.html.fromstring(r.text)
htmlOut = lxml.html.tostring(html,encoding='latin1')
f=open('unicode_test_2.html','w');f.write(htmlOut);f.close()
it seems it's working now. but i don't know why is this happening, always use latin1 ?
what's the difference between r.text and r.content, and why can't i write html out using encoding='utf-8' ?
You've not specified if you're using python 2 or 3. Encoding is handled quite differently depending on which version you're using. The following advice is more or less universal anyway.
The difference between r.text and r.content is in the Requests docs. Simply put Requests will attempt to figure out the character encoding for you and return Unicode after decoding it. This which is accessible via r.text. To get just the bytes use r.content.
You really need to get to grips with encodings. Read http://www.joelonsoftware.com/articles/Unicode.html and watch https://www.youtube.com/watch?v=sgHbC6udIqc to get started. Also, do a search for "Overcoming frustration: Correctly using unicode in python2" for additional help.
Just to clarify, it's not as simple as always use one encoding over another. Make a Unicode sandwich by doing any I/O in bytes and work with Unicode in your application. If you start with bytes (isinstance(mytext, str)) you need to know the encoding to decode to Unicode, if you start with Unicode (isinstance(mytext, unicode)) you should encode to UTF-8 as it will handle all the worlds characters.
Make sure your editor, files, server and database are configured to UTF-8 also otherwise you'll get more 'gibberish'.
If you want further help post the source files and output of your script.

Python Beautiful Soup, saved text not displaying properly in original encoding

I'm having trouble with a saved file not displaying properly when using it's original encoding.
I'm downloading a web page, searching it for content I want and then writing that content to a file.
The encoding on the site is 'iso-8859-1' or so chrome and beautiful soup tell me and it appears perfectly when viewed using that encoding on the original site.
When I download the page and try to view it however I end up with strange characters (HTML Entities?) like these:
“ , ’
If I manually set Chromes encoding to 'Utf-8' when viewing the saved page it appears normally, as does the original page if I set that to 'Utf-8'.
I'm not sure what to do with this, I would change the encoding before writing the text to a file but I get ascii errors when I try that.
Here is a sample page (possible adult content):
http://original.adultfanfiction.net/story.php?no=600106516
And the code I am using to get the text from the page:
site = requests.post(url, allow_redirects=False)
html = site.text
soup = BeautifulSoup(html)
rawStory = soup.findAll("td",{"colspan" : '3'})
story = str(rawStory)
return story
I turn the ResultSet into a string so that I can write it to a file, I don't know if that could be part of the problem, if I print the html to the console after requesting it but before doing anything to it it displays improperly in the console as well.
I'm 90% sure that your problem is just that you're asking BeautifulSoup for a UTF-8 fragment and then trying to use it as ISO-8859-1, which obviously isn't going to work. The documentation explains all of this pretty nicely.
You're calling str. As Non pretty printing explains:
If you just want a string, with no fancy formatting, you can call unicode() or str() on a BeautifulSoup object, or a Tag within it… The str() function returns a string encoded in UTF-8.
As Output encoding explains:
When you write out a document from Beautiful Soup, you get a UTF-8 document, even if the document wasn’t in UTF-8 to begin with.
It then follows up with an example of almost exactly what you're doing—parsing a Latin-1 HTML document and writing it back as UTF-8—and then immediately explaining how to fix it:
If you don’t want UTF-8, you can pass an encoding into prettify()… You can also call encode() on the BeautifulSoup object, or any element in the soup, just as if it were a Python string…
So, that's all you have to do.
However, you've got another problem before you get there. When you call findAll, you don't get back a tag, you get back a ResultSet, which is basically a list of tags. Just as calling str on a list of strings gives you brackets, commas, and the repr of each string (with extraneous quotes and backlash escapes for non-printable-ASCII characters) instead of the strings themselves, calling str on a ResultSet gives you something similar. And you obviously can't call Tag methods on a ResultSet.
Finally, I'm not sure what problem you're actually trying to solve. You're creating a HTML fragment. Ignoring the fact that a fragment isn't a valid document, and browsers shouldn't strictly speaking display it in the first place, it doesn't specify an encoding, meaning the browser can only get that information from some out-of-band place, like you selecting a menu item. Changing it to Latin-1 won't actually "fix" things, it'll just mean that now you get the right display when you pick Latin-1 in the menu and wrong when you pick UTF-8 instead of vice-versa. Why not actually create a full HTML document that actually has a meta http-equiv that actually means what you want it to mean, instead of trying to figure out how to trick Chrome into guessing what you want it to guess?

How to return plain text from Beautiful Soup instead of unicode

I am using BeautifulSoup4 to scrape this web page, however I'm getting the weird unicode text that BeautifulSoup returns.
Here is my code:
site = "http://en.wikipedia.org/wiki/"+a+"_"+str(b)
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
req.add_header('Accept-enconding', 'gzip') #Header to check for gzip
page = urllib2.urlopen(req)
if page.info().get('Content-Encoding') == 'gzip': #IF checks gzip
data = page.read()
data = StringIO.StringIO(data)
gzipper = gzip.GzipFile(fileobj=data)
html = gzipper.read()
soup = BeautifulSoup(html, fromEncoding='gbk')
else:
soup = BeautifulSoup(page)
section = soup.find('span', id='Events').parent
events = section.find_next('ul').find_all('li')
print soup.originalEncoding
for x in events:
print x
Bascially I want x to be in plain English. I get, instead, things that look like this:
<li>153 BC – Roman consuls begin their year in office.</li>
There's only one example in this particular string, but you get the idea.
Related: I go on to cut up this string with some regex and other string cutting methods, should I switch this to plain text before or after I cut it up? I'm assuming it doesn't matter but seeing as I'm defering to SO anyways, I thought I'd ask.
If anyone knows how to fix this, I'd appreciate it. Thanks
EDIT: Thanks J.F. for the tip, I now used this after my for loop:
for x in events:
x = x.encode('ascii')
x = str(x)
#Find Content
regex2 = re.compile(">[^>]*<")
textList = re.findall(regex2, x)
text = "".join(textList)
text = text.replace(">", "")
text = text.replace("<", "")
contents.append(text)
However, I still get things like this:
2013 – At least 60 people are killed and 200 injured in a stampede after celebrations at Félix Houphouët-Boigny Stadium in Abidjan, Ivory Coast.
EDIT:
Here is how I make my excel spreadsheet (csv) and send in my list
rows = zip(days, contents)
with open("events.csv", "wb") as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)
So the csv file is created during the program and everything is imported after the lists are generated. I just need to it to be readable text at that point.
fromEncoding (which has been renamed to from_encoding for compliance with PEP8) tells the parser how to interpret the data in the input. What you (your browser or urllib) receive from the server is just a stream of bytes. In order to make sense of it, i.e. in order to build a sequence of abstract characters from this stream of bytes (this process is called decoding), one has to know how the information was encoded. This piece of information is required and you have to provide it in order to make sure that your code behaves properly. Wikipedia tells you how they encode the data, it's stated right at the top of the source of each of their web pages, e.g.
<meta charset="UTF-8" />
Hence, the bytestream received from Wikipedia's web servers should be interpreted with the UTF-8 codec. You should invoke
soup = BeautifulSoup(html, from_encoding='utf-8')
instead of BeautifulSoup(html, fromEncoding='gbk'), which tries to decode the bytestream with some Chinese character codec (I guess you blindly copied that piece of code from here).
You really need to make sure that you understand the basic concept of text encodings. Actually, you want unicode in the output, which is an abstract representation of a sequence of characters/symbols. In this context, there is no such thing as "plain English".
There is no such thing as plain text. What you see are bytes interpreted as text using incorrect character encoding i.e., the encoding of the strings is different from the one used by your terminal unless the error were introduced earlier by using incorrect character encoding for the web page.
print x calls str(x) that returns UTF-8 encoded string for BeautifulSoup objects.
Try:
print unicode(x)
Or:
print x.encode('ascii')

How to solve UnicodeEncodeError while working with Cyrillic (Russian) letters?

I try to read a RSS-feed using feed parser.
import feedparser
url = 'http://example.com/news.xml'
d=feedparser.parse(url)
f = open('rss.dat','w')
for e in d.entries:
title = e.title
print >>f, address
f.close()
It works fine with English RSS-feeds but I get a UnicodeEncodeError if I try to display a title written in Cyrillic letters. It happens when I:
Try to write a title into a file.
Try to display a title into the screen.
Try to use it in URL to access a web page.
My question is how to solve this problem easily. I would love to have a solution as simple as this:
new_title = some_function(title)
May be there is a way to replace every Cyrillic symbol by its HTML code?
FeedParser itself works fine with encodings, except in the case when it is wrongly declared. Refer to http://code.google.com/p/feedparser/issues/detail?id=114 for a possible explanation. It seems Python 2.5 uses ascii as default encoding, and causes problems.
Can you paste the actual feed URL, to see how the encoding is declared there. If it appear that the declare encoding is wrong - you'll have to find a way to instruct FeedParser to override the default value.
EDIT: Okay, it seems the error is in the print statement.
Use
f.write(title.encode('utf-8'))

Categories