HTML Parser In Python without fixing HTML - python

I need to parse through html but I do not need the python parsing library to attempt to "fix" the html. Any suggestions on a tool or method to use (in python)? In my situation, if the html is malformed then my script needs to end the processing. I tried BeautifulSoup but it fixed things that I did not want it to fix. I'm creating a tool to parse template files and output another converted template style.

The book Foundations of Python Network Programming has a detailed comparison of what it looks like to scrape the same web page with Beautiful Soup and with the lxml library; but, in general, you will find that lxml is faster, more effective, and has an API which adheres closely to a Python standard (the ElementTree API, which comes with the Python Standard Library). See this blog post by the inimitable Ian Bicking for an idea of why you should be looking at lxml instead of the old-fashioned Beautiful Soup library for parsing HTML:
http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-scraping-library/

I believe BeautifulStoneSoup can do this if you pass in a list of selfclosing tags
The most common shortcoming of BeautifulStoneSoup is that it doesn't
know about self-closing tags. HTML has a fixed set of self-closing
tags, but with XML it depends on what the DTD says. You can tell
BeautifulStoneSoup that certain tags are self-closing by passing in
their names as the selfClosingTags argument to the constructor:
from BeautifulSoup import BeautifulStoneSoup
xml = "<tag>Text 1<selfclosing>Text 2"
print BeautifulStoneSoup(xml).prettify()
# <tag>
# Text 1
# <selfclosing>
# Text 2
# </selfclosing>
# </tag>
print BeautifulStoneSoup(xml, selfClosingTags=['selfclosing']).prettify()
# <tag>
# Text 1
# <selfclosing />
# Text 2
# </tag>

Related

Correctly parse empty html tags using beautiful soup

HTML has a concept of empty elements, as listed on MDN. However, beautiful soup doesn't seem to handle them properly:
import bs4
soup = bs4.BeautifulSoup(
'<div><input name=the-input><label for=the-input>My label</label></div>',
'html.parser'
)
print(soup.contents)
I get:
[<div><input name="the-input"><label for="the-input">My label</label></input></div>]
I.e. the input has wrapped the label.
Question: Is there any way to get beautiful soup to parse this properly? Or is there an official explanation of this behaviour somewhere I haven't found yet?
At the very least I'd expect something like:
[<div><input name="the-input"></input><label for="the-input">My label</label></div>]
I.e. the input automatically closed before the label.
As stated in their documentation html5lib parses the document as the web browser does (Like lxmlin this case). It'll try to fix your document tree by adding/closing tags when needed.
In your example I've used lxml as the parser and it gave the following result:
soup = bs4.BeautifulSoup(
'<div><input name=the-input><label for=the-input>My label</label></div>',
'lxml'
)
print(soup.body.contents)
[<div><input name="the-input"/><label for="the-input">My label</label></div>]
Note that lxml added html & body tags because they weren't present in the source, that is why I've printed the body contents.
I would say soup is doing what it can for fixing this html structure, it is actually helpful in some occasions.
Anyway, for your case I would say to use lxml, which will parse the html structure as you want, or maybe give a try to parsel

Python BeautifulSoup giving different results

I am trying to parse an xml file using BeautifulSoup. Consider a sampleinpt xml file as follows:
<DOC>
<DOCNO>1</DOCNO>
....
</DOC>
<DOC>
<DOCNO>2</DOCNO>
....
</DOC>
...
This file consists for 130 <DOC> tags. However, when I tried to parse it using BeautifulSoup's findAll function, it retrieves a random number of tags (usually between 15 - 25) but never 130. The code I used was as follows:
from bs4 import BeautifulSoup
z = open("filename").read()
soup = BeautifulSoup(z, "lxml")
print len(soup.findAll('doc'))
#more code involving manipulation of results
Can anybody tell me what wrong am I doing? Thanks in advance!
You are telling BeautifulSoup to use the HTML parser provided by lxml. If you have an XML document, you should stick to the XML parser option:
soup = BeautifulSoup(z, 'xml')
otherwise the parser will attempt to 'repair' the XML to fit HTML rules. XML parsing in BeautifulSoup is also handled by the lxml library.
Note that XML is case sensitive so you'll need to search for the DOC element now.
For XML documents it may be that the ElementTree API offered by lxml is more productive; it supports XPath queries for example, while BeautifulSoup does not.
However, from your sample it looks like there is no one top level element; it is as if your document consists of a whole series of XML documents instead. This makes your input invalid, and a parser may just stick to only parsing the first element as the top-level document instead.

Python: Separating an HTML snippets to paragraphs

I have a snippet of HTML that contains paragraphs. (I mean p tags.) I want to split the string into the different paragraphs. For instance:
'''
<p class="my_class">Hello!</p>
<p>What's up?</p>
<p style="whatever: whatever;">Goodbye!</p>
'''
Should become:
['<p class="my_class">Hello!</p>',
'<p>What's up?</p>'
'<p style="whatever: whatever;">Goodbye!</p>']
What would be a good way to approach this?
If your string only contains paragraphs, you may be able to get away with a nicely crafted regex and re.split(). However, if your string is more complex HTML, or not always valid HTML, you might want to look at the BeautifulSoup package.
Usage goes like:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(some_html)
paragraphs = list(unicode(x) for x in soup.findAll('p'))
Use lxml.html to parse the HTML into the form you want. This is essentially the same advice as the people who are recommending BeautifulSoup, except lxml is still being actively developed and BeatifulSoup development has slowed.
Use BeautifulSoup to parse the HTML and iterate over the paragraphs.
The xml.etree (std lib) or lxml.etree (enhanced) make this easy to do, but I'm not going to get the answer cred for this because I don't remember the exact syntax. I keep mixing it up with similar packages and have to look it up afresh every time.

How to get the content of a Html page in Python

I have downloaded the web page into an html file. I am wondering what's the simplest way to get the content of that page. By content, I mean I need the strings that a browser would display.
To be clear:
Input:
<html><head><title>Page title</title></head>
<body><p id="firstpara" align="center">This is paragraph <b>one</b>.
<p id="secondpara" align="blah">This is paragraph <b>two</b>.
</html>
Output:
Page title This is paragraph one. This is paragraph two.
putting together:
from BeautifulSoup import BeautifulSoup
import re
def removeHtmlTags(page):
p = re.compile(r'''<(?:"[^"]*"['"]*|'[^']*'['"]*|[^'">])+>''')
return p.sub('', page)
def removeHtmlTags2(page):
soup = BeautifulSoup(page)
return ''.join(soup.findAll(text=True))
Related
Python HTML removal
Extracting text from HTML file using Python
What is a light python library that can eliminate HTML tags? (and only text)
Remove HTML tags in AppEngine Python Env (equivalent to Ruby’s Sanitize)
RegEx match open tags except XHTML self-contained tags (famous don't use regex to parse html rant)
Parse the HTML with Beautiful Soup.
To get all the text, without the tags, try:
''.join(soup.findAll(text=True))
Personally, I use lxml because it's a swiss-army knife...
from lxml import html
print html.parse('http://someurl.at.domain').xpath('//body')[0].text_content()
This tells lxml to retrieve the page, locate the <body> tag then extract and print all the text.
I do a lot of page parsing and a regex is the wrong solution most of the time, unless it's a one-time-only need. If the author of the page changes their HTML you run a good risk of your regex breaking. A parser is a lot more likely to continue working.
The big problem with a parser is learning how to access the sections of the document you are after, but there are a lot of XPATH tools you can use inside your browser that simplify the task.
You want to look at Extracting data from HTML documents - Dive into Python because HERE it does (almost)exactly what you want.
The best modules for this task are lxml or html5lib; Beautifull Soap is imho not worth to use anymore. And for recursive models regular expressions are definitly the wrong method.
If I am getting your question correctly, this can simply be done by using urlopen function of urllib. Just have a look at this function to open an url and read the response which will be the html code of that page.
The quickest way to get a usable sample of what a browser would display is to remove any tags from the html and print the rest. This can, for example, be done using python's re.

Preventing BeautifulSoup from converting my XML tags to lowercase

I am using BeautifulStoneSoup to parse an XML document and change some attributes. I noticed that it automatically converts all XML tags to lowercase. For example, my source file has <DocData> elements, which BeautifulSoup converts to <docdata>.
This appears to be causing problems since the program I am feeding my modified XML document to does not seem to accept the lowercase versions. Is there a way to prevent this behavior in BeautifulSoup?
No, that's not a built-in option. The source is pretty straightforward, though. It looks like you want to change the value of encodedName in Tag.__str__.
Simple Answer
change (default html.parser) to xml parser
code: soup = BeautifulSoup(yourXmlStr, 'xml')
Detailed Explanation
refer my answer in another post

Categories