Parsing HTML using Python

Parsing HTML using Python - python

I'm looking for an HTML Parser module for Python that can help me get the tags in the form of Python lists/dictionaries/objects.
If I have a document of the form:
<html>
<head>Heading</head>
<body attr1='val1'>
<div class='container'>
<div id='class'>Something here</div>
<div>Something else</div>
</div>
</body>
</html>
then it should give me a way to access the nested tags via the name or id of the HTML tag so that I can basically ask it to get me the content/text in the div tag with class='container' contained within the body tag, or something similar.
If you've used Firefox's "Inspect element" feature (view HTML) you would know that it gives you all the tags in a nice nested manner like a tree.
I'd prefer a built-in module but that might be asking a little too much.
I went through a lot of questions on Stack Overflow and a few blogs on the internet and most of them suggest BeautifulSoup or lxml or HTMLParser but few of these detail the functionality and simply end as a debate over which one is faster/more efficent.

So that I can ask it to get me the content/text in the div tag with class='container' contained within the body tag, Or something similar.
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
html = #the HTML code you've written above
parsed_html = BeautifulSoup(html)
print(parsed_html.body.find('div', attrs={'class':'container'}).text)
You don't need performance descriptions I guess - just read how BeautifulSoup works. Look at its official documentation.

I guess what you're looking for is pyquery:
pyquery: a jquery-like library for python.
An example of what you want may be like:
from pyquery import PyQuery
html = # Your HTML CODE
pq = PyQuery(html)
tag = pq('div#id') # or tag = pq('div.class')
print tag.text()
And it uses the same selectors as Firefox's or Chrome's inspect element. For example:
The inspected element selector is 'div#mw-head.noprint'. So in pyquery, you just need to pass this selector:
pq('div#mw-head.noprint')

Here you can read more about different HTML parsers in Python and their performance. Even though the article is a bit dated it still gives you a good overview.
Python HTML parser performance
I'd recommend BeautifulSoup even though it isn't built in. Just because it's so easy to work with for those kinds of tasks. Eg:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://www.google.com/')
soup = BeautifulSoup(page)
x = soup.body.find('div', attrs={'class' : 'container'}).text

Compared to the other parser libraries lxml is extremely fast:
http://blog.dispatched.ch/2010/08/16/beautifulsoup-vs-lxml-performance/
http://www.ianbicking.org/blog/2008/03/python-html-parser-performance.html
And with cssselect it’s quite easy to use for scraping HTML pages too:
from lxml.html import parse
doc = parse('http://www.google.com').getroot()
for div in doc.cssselect('a'):
print '%s: %s' % (div.text_content(), div.get('href'))
lxml.html Documentation

I recommend lxml for parsing HTML. See "Parsing HTML" (on the lxml site).
In my experience Beautiful Soup messes up on some complex HTML. I believe that is because Beautiful Soup is not a parser, rather a very good string analyzer.

I recommend using justext library:
https://github.com/miso-belica/jusText
Usage:
Python2:
import requests
import justext
response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
print paragraph.text
Python3:
import requests
import justext
response = requests.get("http://bbc.com/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
print (paragraph.text)

I would use EHP
https://github.com/iogf/ehp
Here it is:
from ehp import *
doc = '''<html>
<head>Heading</head>
<body attr1='val1'>
<div class='container'>
<div id='class'>Something here</div>
<div>Something else</div>
</div>
</body>
</html>
'''
html = Html()
dom = html.feed(doc)
for ind in dom.find('div', ('class', 'container')):
print ind.text()
Output:
Something here
Something else

Related

with BeautifulSoup extract text from div in a href in loop

<div class="ELEMENT1">
<div class="ELEMENT2">
<div class="ELEMENT3">valeur1</div>
<div class="ELEMENT4">
<svg class="ELEMENT5 ">
<a href="ELEMENT6» target="ELEMENT7" class="ELEMENT8">
<div>TEXT</div
Hello to all,
My request is the following
From the following piece of code, I want to create a loop that allows me
to extract TEXT if and only if div class = ELEMENT 4 AND svg class = ELEMENT 5 (because there are other different ones)
thank you for your help
eddy

you'll need to import urllib2 or some other library that allows you to fetch a urls html structure. Then you need to import beautiful soup as well. Scrape the url and store into a variable. Then reformat the output in any way that serves your needs.
For example:
import urllib2
from bs4 import beautifulSoup
page = urlopen("the_url")
content = BeautifulSoup(page.read().decode("utf-8")) #decode data (utf-8)
filter = content.find_all("div") #finds all div elements in the body
Then you could use regexp to find the actual text inside the element.
Good luck on your assignment!

How do i update content of a particular tag in local html page using bs4 python?

I am trying to change content of a <p> tag in a HTML DOM using. I have used bs4 to successfully select the particular tag and update it but how do i save it back to HTML file?.
Python code
import bs4
exampleFile = open('stefan/index.html')
exampleSoup = bs4.BeautifulSoup(exampleFile.read())
test="testing testing"
elems = exampleSoup.select('#page_main_text')
print(elems[0].getText())
tag=exampleSoup.p
tag.string=test
print(tag)
HTML
<p class="grey" id="page_main_text">It was an awesome experience to grow up with you. I can recall all the memories that we shared together. We laughed together and cried together. All these things are to remind you – happy birthday.</p>

problems scraping web page using python

Hi I'm quite new to python and my boss has asked me to scrape this data however it is not my strong point so i was wondering how i would go about this.
The text that I'm after also changes in the quote marks every few minutes so I'm also not sure how to locate that.
I am using beautiful soup at the moment and Lxml however if there are better alternatives I'm happy to try them
This is the inspected element of the webpage:
div class = "sometext"
<h3> somemoretext </h3>
<p>
<span class = "title" title="text i want">text i want</span>
<br>
</p>
I have tried using:
from lxml import html
import requests
from bs4 import BeautifulSoup
page = requests.get('the url')
soup = BeautifulSoup(page.text)
r = soup.findAll('//span[#class="title"]/text()')
print r
Thank you in advance,any help would be appreciated!

First do this to get what you are looking at in the soup:
soup = BeautifulSoup(page)
print soup
That way you can double check that you are actually dealing will what you think you are dealing with.
Then do this:
r = soup.findAll('span', attrs={"class":"title"})
for span in r:
print span.text
This will get all the span tags with a class=title, and then text will print out all the text in between the tags.
Edited to Add
Note that esecules' answer will get you the title within the tag (<span class = "title" title="text i want">) whereas mine will get the title from the text (<span class = "title" >text i want</span>)

perhaps find is the method you really need since you're only ever looking for one element. docs
r = soup.find('div', 'sometext').find('span','title')['title']

if you're familiar with XPath and you don't need feature that specific to BeautifulSoup, then using lxml only is enough (or maybe even better since lxml is known to be faster) :
from lxml import html
import requests
page = requests.get('the url')
root = html.fromstring(page.text)
r = root.xpath('//span[#class="title"]/text()')
print r

Printing certain HTML Python Mechanize

Im making a small python script for auto logon to a website. But i'm stuck.
I'm looking to print into terminal a small part of the html, located within this tag in the html file on the site:
<td class=h3 align='right'> John Appleseed</td><td> <img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></td>
But how do I extract and print just the name, John Appleseed?
I'm using Pythons' Mechanize on a mac, by the way.

Mechanize is only good for fetching the html. Once you want to extract information from the html, you could use for example BeautifulSoup. (See also my answer to a similar question: Web mining or scraping or crawling? What tool/library should I use?)
Depending on where the <td> is located in the html (it's unclear from your question), you could use the following code:
html = ... # this is the html you've fetched
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
# use this (gets all <td> elements)
cols = soup.findAll('td')
# or this (gets only <td> elements with class='h3')
cols = soup.findAll('td', attrs={"class" : 'h3'})
print cols[0].renderContents() # print content of first <td> element

As you have not provided the full HTML of the page, the only option right now is either using string.find() or regular expressions.
But, the standard way of finding this is using xpath. See this question: How to use Xpath in Python?
You can obtain the xpath for an element using "inspect element" feature of firefox.
For ex, if you want to find the XPATH for username in stackoverflow site.
Open firefox and login to the website & RIght-click on username(shadyabhi in my case) and select Inspect Element.
Keep your mouse over tag or right click it and "Copy xpath".

You can use a parser to extract any information in a document. I suggest you to use lxml module.
Here you have an example:
from lxml import etree
from StringIO import StringIO
parser = etree.HTMLParser()
tree = etree.parse(StringIO("""<td class=h3 align='right'> John Appleseed</td><td> <img border=0 src="../tbs_v7_0/images/myaccount.gif" alt="My Account"></td>"""),parser)
>>> tree.xpath("string()").strip()
u'John Appleseed'
More information about lxml here

Issues with BeautifulSoup parsing

I am trying to parse an html page with BeautifulSoup, but it appears that BeautifulSoup doesn't like the html or that page at all. When I run the code below, the method prettify() returns me only the script block of the page (see below). Does anybody has an idea why it happens?
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1"
html = "".join(urllib2.urlopen(url).readlines())
print "-- HTML ------------------------------------------"
print html
print "-- BeautifulSoup ---------------------------------"
print BeautifulSoup(html).prettify()
The is the output produced by BeautifulSoup.
-- BeautifulSoup ---------------------------------
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<script language="JavaScript">
<!--
function highlight(img) {
document[img].src = "/marketing/sony/images/en/" + img + "_on.gif";
}
function unhighlight(img) {
document[img].src = "/marketing/sony/images/en/" + img + "_off.gif";
}
//-->
</script>
Thanks!
UPDATE: I am using the following version, which appears to be the latest.
__author__ = "Leonard Richardson (leonardr#segfault.org)"
__version__ = "3.1.0.1"
__copyright__ = "Copyright (c) 2004-2009 Leonard Richardson"
__license__ = "New-style BSD"

Try with version 3.0.7a as Łukasz suggested. BeautifulSoup 3.1 was designed to be compatible with Python 3.0 so they had to change the parser from SGMLParser to HTMLParser which seems more vulnerable to bad HTML.
From the changelog for BeautifulSoup 3.1:
"Beautiful Soup is now based on HTMLParser rather than SGMLParser, which is gone in Python 3. There's some bad HTML that SGMLParser handled but HTMLParser doesn't"

Try lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup, so it might work better for you. It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.
Ian Blicking agrees.
There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

BeautifulSoup isn't magic: if the incoming HTML is too horrible then it isn't going to work.
In this case, the incoming HTML is exactly that: too broken for BeautifulSoup to figure out what to do. For instance it contains markup like:
SCRIPT type=""javascript""
(Notice the double quoting.)
The BeautifulSoup docs contains a section what you can do if BeautifulSoup can't parse you markup. You'll need to investigate those alternatives.

Samj: If I get things like
HTMLParser.HTMLParseError: bad end tag: u"</scr' + 'ipt>"
I just remove the culprit from markup before I serve it to BeautifulSoup and all is dandy:
html = urllib2.urlopen(url).read()
html = html.replace("</scr' + 'ipt>","")
soup = BeautifulSoup(html)

I had problems parsing the following code too:
<script>
function show_ads() {
document.write("<div><sc"+"ript type='text/javascript'src='http://pagead2.googlesyndication.com/pagead/show_ads.js'></scr"+"ipt></div>");
}
</script>
HTMLParseError: bad end tag: u'', at line 26, column 127
Sam

I tested this script on BeautifulSoup version '3.0.7a' and it returns what appears to be correct output. I don't know what changed between '3.0.7a' and '3.1.0.1' but give it a try.

import urllib
from BeautifulSoup import BeautifulSoup
>>> page = urllib.urlopen('http://www.futureshop.ca/catalog/subclass.asp?catid=10607&mfr=&logon=&langid=FR&sort=0&page=1')
>>> soup = BeautifulSoup(page)
>>> soup.prettify()
In my case by executing the above statements, it returns the entire HTML page.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing HTML using Python - python

I recommend lxml for parsing HTML. See "Parsing HTML" (on the lxml site). In my experience Beautiful Soup messes up on some complex HTML. I believe that is because Beautiful Soup is not a parser, rather a very good string analyzer.

Related

with BeautifulSoup extract text from div in a href in loop

How do i update content of a particular tag in local html page using bs4 python?

problems scraping web page using python

Printing certain HTML Python Mechanize

Issues with BeautifulSoup parsing

Categories

Resources