Issues with BeautifulSoup parsing - python

I am trying to parse an html page with BeautifulSoup, but it appears that BeautifulSoup doesn't like the html or that page at all. When I run the code below, the method prettify() returns me only the script block of the page (see below). Does anybody has an idea why it happens?
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1"
html = "".join(urllib2.urlopen(url).readlines())
print "-- HTML ------------------------------------------"
print html
print "-- BeautifulSoup ---------------------------------"
print BeautifulSoup(html).prettify()
The is the output produced by BeautifulSoup.
-- BeautifulSoup ---------------------------------
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<script language="JavaScript">
<!--
function highlight(img) {
document[img].src = "/marketing/sony/images/en/" + img + "_on.gif";
}
function unhighlight(img) {
document[img].src = "/marketing/sony/images/en/" + img + "_off.gif";
}
//-->
</script>
Thanks!
UPDATE: I am using the following version, which appears to be the latest.
__author__ = "Leonard Richardson (leonardr#segfault.org)"
__version__ = "3.1.0.1"
__copyright__ = "Copyright (c) 2004-2009 Leonard Richardson"
__license__ = "New-style BSD"

Try with version 3.0.7a as Łukasz suggested. BeautifulSoup 3.1 was designed to be compatible with Python 3.0 so they had to change the parser from SGMLParser to HTMLParser which seems more vulnerable to bad HTML.
From the changelog for BeautifulSoup 3.1:
"Beautiful Soup is now based on HTMLParser rather than SGMLParser, which is gone in Python 3. There's some bad HTML that SGMLParser handled but HTMLParser doesn't"

Try lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup, so it might work better for you. It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
Ian Blicking agrees.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

BeautifulSoup isn't magic: if the incoming HTML is too horrible then it isn't going to work.
In this case, the incoming HTML is exactly that: too broken for BeautifulSoup to figure out what to do. For instance it contains markup like:
SCRIPT type=""javascript""
(Notice the double quoting.)
The BeautifulSoup docs contains a section what you can do if BeautifulSoup can't parse you markup. You'll need to investigate those alternatives.

Samj: If I get things like
HTMLParser.HTMLParseError: bad end tag: u"</scr' + 'ipt>"
I just remove the culprit from markup before I serve it to BeautifulSoup and all is dandy:
html = urllib2.urlopen(url).read()
html = html.replace("</scr' + 'ipt>","")
soup = BeautifulSoup(html)

I had problems parsing the following code too:
<script>
function show_ads() {
document.write("<div><sc"+"ript type='text/javascript'src='http://pagead2.googlesyndication.com/pagead/show_ads.js'></scr"+"ipt></div>");
}
</script>
HTMLParseError: bad end tag: u'', at line 26, column 127
Sam

I tested this script on BeautifulSoup version '3.0.7a' and it returns what appears to be correct output. I don't know what changed between '3.0.7a' and '3.1.0.1' but give it a try.

import urllib
from BeautifulSoup import BeautifulSoup
>>> page = urllib.urlopen('http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1')
>>> soup = BeautifulSoup(page)
>>> soup.prettify()
In my case by executing the above statements, it returns the entire HTML page.

Related

how to detect the language of webpage content by using python

I have to test a bunch of URLs whether those webpages have respective translation content or not. Is there any way to return the language of content in a webpage by using the Python language? Like if the page is in Chinese, then it should return `"Chinese"``.
I checked it with langdetect module, but not able to get the results I desire. These URls are in web xml format. The content is showing under <releasehigh>
Here is a simple example demonstrating use of BeautifulSoup to extract HTML body text and langdetect for the language detection:
from bs4 import BeautifulSoup
from langdetect import detect
with open("foo.html", "rb") as f:
soup = BeautifulSoup(f, "lxml")
[s.decompose() for s in soup("script")] # remove <script> elements
body_text = soup.body.get_text()
print(detect(body_text))
You can extract a chunk of content then use some python language detection like langdetect or guess-language.
Maybe you have a header like this one :
<HTML xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr" lang="fr">
If it's the case you can see with lang="fr" that this is a french web page. If it's not the case, guessing the language of a text is not trivial.
You can use BeautifulSoup to extract the language from HTML source code.
<html class="no-js" lang="cs">
Extract the lang field from source code:
from bs4 import BeautifulSoup
import requests
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
print(soup.html["lang"])

Python web scraping returns error

I am currently learning Python and have tried to pick up web scraping. I have been using example code that I got from some tutorials, but I have encountered a problem with one of the sites I was looking at. The following code was supposed to return the title of the website:
import urllib
import re
urls = ["http://www.libyaherald.com"]
i=0
regex='<title>(.+?)</title>'
pattern = re.compile(regex)
while i< len(urls):
htmlfile = urllib.urlopen(urls[i])
htmltext = htmlfile.read()
titles=re.findall(pattern,htmltext)
print titles
i+=1
The title for Libya Herald website returned back an error. I checked the source code for Libya Herald and the DOC TYPE is <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">.
Does the doc type have something to do with me not being able to scrape from it?
As #Puciek said, with regex is gonna be very difficult to scrape html. I would recommend you start using some package, a very easo to use and install is BeautifulSoup.
Once you install it you can try this simple example:
from bs4 import BeautifulSoup
import requests
html = requests.get('http://www.libyaherald.com').text
bs = BeautifulSoup(html)
title = bs.find('title').text
print title
For serious python web scraping I strongly suggest Scrapy.
And as far as I know, when it comes to html parsing, regex is not a recommended way. Try BeautifulSoup (BS4) like Pizza guy said :)

Parsing HTML using Python

I'm looking for an HTML Parser module for Python that can help me get the tags in the form of Python lists/dictionaries/objects.
If I have a document of the form:
<html>
<head>Heading</head>
<body attr1='val1'>
<div class='container'>
<div id='class'>Something here</div>
<div>Something else</div>
</div>
</body>
</html>
then it should give me a way to access the nested tags via the name or id of the HTML tag so that I can basically ask it to get me the content/text in the div tag with class='container' contained within the body tag, or something similar.
If you've used Firefox's "Inspect element" feature (view HTML) you would know that it gives you all the tags in a nice nested manner like a tree.
I'd prefer a built-in module but that might be asking a little too much.
I went through a lot of questions on Stack Overflow and a few blogs on the internet and most of them suggest BeautifulSoup or lxml or HTMLParser but few of these detail the functionality and simply end as a debate over which one is faster/more efficent.
So that I can ask it to get me the content/text in the div tag with class='container' contained within the body tag, Or something similar.
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)
You don't need performance descriptions I guess - just read how BeautifulSoup works. Look at its official documentation.
I guess what you're looking for is pyquery:
pyquery: a jquery-like library for python.
An example of what you want may be like:
from pyquery import PyQuery
html = # Your HTML CODE
pq = PyQuery(html)
tag = pq('div#id') # or tag = pq('div.class')
print tag.text()
And it uses the same selectors as Firefox's or Chrome's inspect element. For example:
The inspected element selector is 'div#mw-head.noprint'. So in pyquery, you just need to pass this selector:
pq('div#mw-head.noprint')
Here you can read more about different HTML parsers in Python and their performance. Even though the article is a bit dated it still gives you a good overview.
Python HTML parser performance
I'd recommend BeautifulSoup even though it isn't built in. Just because it's so easy to work with for those kinds of tasks. Eg:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://www.google.com/')
soup = BeautifulSoup(page)
x = soup.body.find('div', attrs={'class' : 'container'}).text
Compared to the other parser libraries lxml is extremely fast:
http://blog.dispatched.ch/2010/08/16/beautifulsoup-vs-lxml-performance/
http://www.ianbicking.org/blog/2008/03/python-html-parser-performance.html
And with cssselect it’s quite easy to use for scraping HTML pages too:
from lxml.html import parse
doc = parse('http://www.google.com').getroot()
for div in doc.cssselect('a'):
print '%s: %s' % (div.text_content(), div.get('href'))
lxml.html Documentation
I recommend lxml for parsing HTML. See "Parsing HTML" (on the lxml site).
In my experience Beautiful Soup messes up on some complex HTML. I believe that is because Beautiful Soup is not a parser, rather a very good string analyzer.
I recommend using justext library:
https://github.com/miso-belica/jusText
Usage:
Python2:
import requests
import justext
response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
print paragraph.text
Python3:
import requests
import justext
response = requests.get("http://bbc.com/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
print (paragraph.text)
I would use EHP
https://github.com/iogf/ehp
Here it is:
from ehp import *
doc = '''<html>
<head>Heading</head>
<body attr1='val1'>
<div class='container'>
<div id='class'>Something here</div>
<div>Something else</div>
</div>
</body>
</html>
'''
html = Html()
dom = html.feed(doc)
for ind in dom.find('div', ('class', 'container')):
print ind.text()
Output:
Something here
Something else

Read HEAD contents from HTML

i need small script in python. Need to read custom block in a web file.
#!/usr/bin/python
# -*- coding: utf-8 -*-
import urllib2
req = urllib2.Request('http://target.com')
response = urllib2.urlopen(req)
the_page = response.read()
print the_page # Here is all page source with html tags, but
# i need read only section from <head> to </head>
# example the http://target.com source is:
# <html>
# <body>
# <head>
# ... need to read this section ...
# </head>
# ... page source ...
# </body>
# </html>
How read the custom section?
To parse HTML, we use a parser, such as BeautifulSoup.
Of course you can parse it using a regular expression, but that is something you should never do. Just because it works for some cases doesn't mean it is the standard way of doing it or is the proper way of doing it. If you are interested in knowing why, read this excellent answer here on SO.
Start with the BeautifulSoup tutorial and see how to parse the required information. It is pretty easy to do it. We are not going to do it for you, that is for you to read and learn!
Just to give you a heads up, you have the_page which contains the HTML data.
>> from BeautifulSoup import BeautifulSoup
>> soup = BeautifulSoup(the_page)
Now follow the tutorial and see how to get everything within the head tag.
from BeautifulSoup import BeautifulSoup
import urllib2
page = urllib2.urlopen('http://www.example.com')
soup = BeautifulSoup(page.read())
print soup.find('head')
outputs
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Example Web Page</title>
</head>
One solution would be to use the awesome python library Beautiful Soup. It allows you do parse the html/xml pretty easily, and will try to help out when the documents are broken or invalid.

How can i grab CData out of BeautifulSoup

I have a website that I'm scraping that has a similar structure the following. I'd like to be able to grab the info out of the CData block.
I'm using BeautifulSoup to pull other info off the page, so if the solution can work with that, it would help keep my learning curve down as I'm a python novice.
Specifically, I want to get at the two different types of data hidden in the CData statement. the first which is just text I'm pretty sure I can throw a regex at it and get what I need. For the second type, if i could drop the data that has html elements into it's own beautifulsoup, I can parse that.
I'm just learning python and beautifulsoup, so I'm struggling to find the magical incantation that will give me just the CData by itself.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
Cows and Sheep
</title>
</head>
<body>
<div id="main">
<div id="main-precontents">
<div id="main-contents" class="main-contents">
<script type="text/javascript">
//<![CDATA[var _ = g_cow;_[7654]={cowname_enus:'cows rule!',leather_quality:99,icon:'cow_level_23'};_[37357]={sheepname_enus:'baa breath',wool_quality:75,icon:'sheep_level_23'};_[39654].cowmeat_enus = '<table><tr><td><b class="q4">cows rule!</b><br></br>
<!--ts-->
get it now<table width="100%"><tr><td>NOW</td><th>NOW</th></tr></table><span>244 Cows</span><br></br>67 leather<br></br>68 Brains
<!--yy-->
<span class="q0">Cow Bonus: +9 Cow Power</span><br></br>Sheep Power 60 / 60<br></br>Sheep 88<br></br>Cow Level 555</td></tr></table>
<!--?5695:5:40:45-->
';
//]]>
</script>
</div>
</div>
</div>
</body>
</html>
One thing you need to be careful of BeautifulSoup grabbing CData is not to use a lxml parser.
By default, the lxml parser will strip CDATA sections from the tree and replace them by their plain text content, Learn more here
#Trying it with html.parser
>>> from bs4 import BeautifulSoup
>>> import bs4
>>> s='''<?xml version="1.0" ?>
<foo>
<bar><![CDATA[
aaaaaaaaaaaaa
]]></bar>
</foo>'''
>>> soup = BeautifulSoup(s, "html.parser")
>>> soup.find(text=lambda tag: isinstance(tag, bs4.CData)).string.strip()
'aaaaaaaaaaaaa'
>>>
BeautifulSoup sees CData as a special case (subclass) of "navigable strings". So for example:
import BeautifulSoup
txt = '''<foobar>We have
<![CDATA[some data here]]>
and more.
</foobar>'''
soup = BeautifulSoup.BeautifulSoup(txt)
for cd in soup.findAll(text=True):
if isinstance(cd, BeautifulSoup.CData):
print 'CData contents: %r' % cd
In your case of course you could look in the subtree starting at the div with the 'main-contents' ID, rather than all over the document tree.
You could try this:
from BeautifulSoup import BeautifulSoup
// source.html contains your html above
f = open('source.html')
soup = BeautifulSoup(''.join(f.readlines()))
s = soup.findAll('script')
cdata = s[0].contents[0]
That should give you the contents of cdata.
Update
This may be a little cleaner:
from BeautifulSoup import BeautifulSoup
import re
// source.html contains your html above
f = open('source.html')
soup = BeautifulSoup(''.join(f.readlines()))
cdata = soup.find(text=re.compile("CDATA"))
Just personal preference, but I like the bottom one a little better.
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(content)
for x in soup.find_all('item'):
print re.sub('[\[CDATA\]]', '', x.string)
For anyone using BeautifulSoup4, Alex Martelli's solution works but do this:
from bs4 import BeautifulSoup, CData
soup = BeautifulSoup(txt)
for cd in soup.findAll(text=True):
if isinstance(cd, Cdata):
print 'CData contents: %r' % cd

Categories