BeautifulSoup not reading ill-formed html - python

I was learning BeautifulSoup. It wasn't reading some of the sites properly. I found that the reason was some html attributes were ill-formed. For example:
from bs4 import BeautifulSoup
html = """
<html>
<head><title>Test</title></head>
<body>
<p id="paraone"align="center">some content <b>para1</b>.<!--there is no space before 'align' attribute -->
<p id="paratwo" align="blah">some content <b>para2</b>
</html>
"""
soup = BeautifulSoup(html)
print "soup:", soup
I think BeautifulSoup designed not to read ill-formed html. If so, is there any other module to read the above given html? Can't we parse ill-formed web sites?

Related

BS4 breaks HTML trying to repair it

BS4 corrects faulty html. Usually this is not a problem. I tried parsing, altering and saving the html of this page: ulisses-regelwiki.de/index.php/sonderfertigkeiten.html
In this case the repairing changes the representation. After the repairing many lines of the page are no longer centered, but leftaligned instead.
Since I have to work with the broken html of said page, I cannot simply repair the html code.
How can I prevent bs4 from repairing the html or fix the "correction" somehow?
(this minimal example just shows bs4 repairing broken html-code; I couldn't create a minimal example where bs4 does this in a wrong way like with the page mentioned above)
#!/usr/bin/env python3
from bs4 import BeautifulSoup
html = '''
<!DOCTYPE html>
<center>
Some Test content
<!-- A comment -->
<center>
'''
def is_string_only(t):
return type(t) is NavigableString
soup = BeautifulSoup(html, 'lxml') #or html.parse
print(str(soup))
Try this lib.
from simplified_scrapy import SimplifiedDoc
html = '''
<!DOCTYPE html>
<center>
Some Test content
<!-- A comment -->
<center>
'''
doc = SimplifiedDoc(html)
print (doc.html)
Here are more examples: https://github.com/yiyedata/simplified-scrapy-demo/tree/master/doc_examples

Beautifulsoup how to select all the 'a' tags

I am a newbie to BeautifulSoup and Python. Here is my HTML:
<html>
<head></head>
<body>
Google
Yahoo
</body>
</html>
Now my code:
from bs4 import BeautifulSoup
# Getting page souped inside Requests, this part is not necessary
soup = BeautifulSoup(html,'html.parser')
print(soup.find('a'))
This is giving just one link, but I want to get all.
Thanks in advance
You are using .find(), that will only return the first found, then you have to use .find_all() instead to get a list of the a tags.
print(soup.find_all('a'))
To get href's by for loop:
for link in soup.find_all('a'):
print(link.href)

Tags are converted to HTML entities?

I'm trying to use BeautifulSoup to parse some dirty HTML. One such HTML is http://f10.5post.com/forums/showthread.php?t=1142017
What happens is that, firstly, the tree misses a large chunk of the page. Secondly, tostring(tree) would convert tags like <div> on half of the page to HTML entities like </div>. For instance
Original:
<div class="smallfont" align="centre">All times are GMT -4. The time now is <span class="time">02:12 PM</span>.</div>`
toString(tree) gives
<div class="smallfont" align="center">All times are GMT -4. The time now is <span class="time">02:12 PM</span>.</div>
Here's my code:
from BeautifulSoup import BeautifulSoup
import urllib2
page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page)
print soup
Thanks
Use beautifulsoup4 and an extremely lenient html5lib parser:
import urllib2
from bs4 import BeautifulSoup # NOTE: importing beautifulsoup4 here
page = urllib2.urlopen("http://f10.5post.com/forums/showthread.php?t=1142017")
soup = BeautifulSoup(page, "html5lib")
print soup

Retrieve contents from broken <a> tags using Beautiful Soup

I am trying to parse a website and retrieve the texts that contain Hyper link.
For eg:
This is an Example
I need to retrieve "This is an Example", which I am able to do for pages that dont have broken tags. I am unable to retrieve in following case:
<html>
<body>
<a href = "http:\\www.google.com">Google<br>
Example
</body>
</html>
In such cases it the code is unable to retrieve Google because of the broken tag that links google and only gives me "Example". Is there a way to also retrieve "Google"?
My code is here:
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
f = open("sol.html","r")
soup = BeautifulSoup(f,parse_only=SoupStrainer('a'))
for link in soup.findAll('a',text=True):
print link.renderContents();
Please note sol.html contains the above given html code itself.
Thanks
- AJ
Remove text=True from your code and it should work just fine:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('''
... <html>
... <body>
... <a href = "http:\\www.google.com">Google<br>
... Example
... </body>
... </html>
... ''')
>>> [a.get_text().strip() for a in soup.find_all('a')]
[u'Google', u'Example']
>>> [a.get_text().strip() for a in soup.find_all('a', text=True)]
[u'Example']
Try this code:
from BeautifulSoup import BeautifulSoup
text = '''
<html>
<body>
<a href = "http:\\www.google.com">Google<br>
Example
</body>
</html>
'''
soup = BeautifulSoup(text)
for link in soup.findAll('a'):
if link.string != None:
print link.string
Here's the output when i ran the code:
Example
Just replace text with text = open('sol.html').read(), or whatever it is you need to go there.

Parsing HTML using Python

I'm looking for an HTML Parser module for Python that can help me get the tags in the form of Python lists/dictionaries/objects.
If I have a document of the form:
<html>
<head>Heading</head>
<body attr1='val1'>
<div class='container'>
<div id='class'>Something here</div>
<div>Something else</div>
</div>
</body>
</html>
then it should give me a way to access the nested tags via the name or id of the HTML tag so that I can basically ask it to get me the content/text in the div tag with class='container' contained within the body tag, or something similar.
If you've used Firefox's "Inspect element" feature (view HTML) you would know that it gives you all the tags in a nice nested manner like a tree.
I'd prefer a built-in module but that might be asking a little too much.
I went through a lot of questions on Stack Overflow and a few blogs on the internet and most of them suggest BeautifulSoup or lxml or HTMLParser but few of these detail the functionality and simply end as a debate over which one is faster/more efficent.
So that I can ask it to get me the content/text in the div tag with class='container' contained within the body tag, Or something similar.
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)
You don't need performance descriptions I guess - just read how BeautifulSoup works. Look at its official documentation.
I guess what you're looking for is pyquery:
pyquery: a jquery-like library for python.
An example of what you want may be like:
from pyquery import PyQuery
html = # Your HTML CODE
pq = PyQuery(html)
tag = pq('div#id') # or tag = pq('div.class')
print tag.text()
And it uses the same selectors as Firefox's or Chrome's inspect element. For example:
The inspected element selector is 'div#mw-head.noprint'. So in pyquery, you just need to pass this selector:
pq('div#mw-head.noprint')
Here you can read more about different HTML parsers in Python and their performance. Even though the article is a bit dated it still gives you a good overview.
Python HTML parser performance
I'd recommend BeautifulSoup even though it isn't built in. Just because it's so easy to work with for those kinds of tasks. Eg:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://www.google.com/')
soup = BeautifulSoup(page)
x = soup.body.find('div', attrs={'class' : 'container'}).text
Compared to the other parser libraries lxml is extremely fast:
http://blog.dispatched.ch/2010/08/16/beautifulsoup-vs-lxml-performance/
http://www.ianbicking.org/blog/2008/03/python-html-parser-performance.html
And with cssselect it’s quite easy to use for scraping HTML pages too:
from lxml.html import parse
doc = parse('http://www.google.com').getroot()
for div in doc.cssselect('a'):
print '%s: %s' % (div.text_content(), div.get('href'))
lxml.html Documentation
I recommend lxml for parsing HTML. See "Parsing HTML" (on the lxml site).
In my experience Beautiful Soup messes up on some complex HTML. I believe that is because Beautiful Soup is not a parser, rather a very good string analyzer.
I recommend using justext library:
https://github.com/miso-belica/jusText
Usage:
Python2:
import requests
import justext
response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
print paragraph.text
Python3:
import requests
import justext
response = requests.get("http://bbc.com/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
print (paragraph.text)
I would use EHP
https://github.com/iogf/ehp
Here it is:
from ehp import *
doc = '''<html>
<head>Heading</head>
<body attr1='val1'>
<div class='container'>
<div id='class'>Something here</div>
<div>Something else</div>
</div>
</body>
</html>
'''
html = Html()
dom = html.feed(doc)
for ind in dom.find('div', ('class', 'container')):
print ind.text()
Output:
Something here
Something else

Categories