Set lxml as default BeautifulSoup parser - python

I'm working on a web scraping project and have ran into problems with speed. To try to fix it, I want to use lxml instead of html.parser as BeautifulSoup's parser. I've been able to do this:
soup = bs4.BeautifulSoup(html, 'lxml')
but I don't want to have to repeatedly type 'lxml' every time I call BeautifulSoup. Is there a way I can set which parser to use once at the beginning of my program?

According to the Specifying the parser to use documentation page:
The first argument to the BeautifulSoup constructor is a string or an
open filehandle–the markup you want parsed. The second argument is how
you’d like the markup parsed.
If you don’t specify anything, you’ll get the best HTML parser that’s
installed. Beautiful Soup ranks lxml’s parser as being the best, then
html5lib’s, then Python’s built-in parser.
In other words, just installing lxml in the same python environment makes it a default parser.
Though note, that explicitly stating a parser is considered a best-practice approach. There are differences between parsers that can result into subtle errors which would be difficult to debug if you are letting BeautifulSoup choose the best parser by itself. You would also have to remember that you need to have lxml installed. And, if you would not have it installed, you would not even notice it - BeautifulSoup would just get the next available parser without throwing any errors.
If you still don't want to specify the parser explicitly, at least make a note for future yourself or others who would use the code you've written in the project's README/documentation, and list lxml in your project requirements alongside with beautifulsoup4.
Besides: "Explicit is better than implicit."

Obviously take a look at the accepted answer first. It is pretty good, and as for this technicality:
but I don't want to have to repeatedly type 'lxml' every time I call
BeautifulSoup. Is there a way I can set which parser to use once at
the beginning of my program?
If I understood your question correctly, I can think of two approaches that will save you some keystrokes: - Define a wrapper function, or - Create a partial function.
# V1 - define a wrapper function - most straight-forward.
import bs4
def bs_parse(html):
return bs4.BeautifulSoup(html, 'lxml')
# ...
html = ...
bs_parse(html)
Or if you feel like showing off ...
import bs4
from functools import partial
bs_parse = partial(bs4.BeautifulSoup, features='lxml')
# ...
html = ...
bs_parse(html)

Related

lxml xpath and find return nothing

Python 2.7
I assume I'm missing something incredibly basic having to do with lxml but I have no idea what it is. By way of background, I have not used lxml much before but have used Xpaths extensively in Selenium and have also done a bit of parsing with BS4.
So, I'm making a call to this API that returns some XML as a string. Easy enough:
from lxml import etree
from io import StringIO
myXML = 'xml here'
tree = etree.parse(StringIO(myXML))
print tree.xpath('/IKnowThisTagExistsInMyXML')
It always returns [] or None. I've tried tree.find() and tree.findall() as well, to no avail.
I'm hoping someone has seen this before and can tell me what's going on.
By using an XPath of /IKnowThisTagExistsInMyXML this assumes the tag IKnowThisTagExistsInMyXML is at the top-level of your XML Document; which I really doubt it is.
Trying search your XMl Document for this tag instead by doing:
print tree.xpath('//*/IKnowThisTagExistsInMyXML')
See: XPath Syntax

How can I use the built-in Python read() function in Julia using PyCall?

I'm using Julia, and right now I'm trying to use the PyCall package so that I can use the BeautifulSoup module for web parsing. My Julia code looks something like
using PyCall
pyinitialize("python3")
#pyimport bs4 #need BeautifulSoup
#pyimport urllib.request as urllib #need urlopen
url_base = "blah"
html = urllib.urlopen(url_base).read()
soup = bs4.BeautifulSoup(html, "lxml")
However, when I try to run it, I get complaints about the read() function. I first thought that read() would be a built-in Python function, but pybuiltin("read") didn't work.
I'm not sure what Python module I can import to get the read function. I tried importing the io module and using io.read(), but that didn't work. Additionally, using Julia's built-in read functions didn't work, since urllib.urlopen(url_base) is a PyObject.
You have a typo:
html = urllib.urlopen(url_base).read()
should be
html = urllib.urlopen(url_base)[:read]()
See the PyCall documentation:
Important: The biggest difference from Python is that object attributes/members are accessed with o[:attribute] rather than o.attribute, so that o.method(...) in Python is replaced by o[:method](...) in Julia. Also, you use get(o, key) rather than o[key]. (However, you can access integer indices via o[i] as in Python, albeit with 1-based Julian indices rather than 0-based Python indices.)
You need to split out to read the response. Instead of:
html = urllib.urlopen(url_base).read()
Try:
with urllib.urlopen(url_base) as response:
html = response.read()
Python 3 goes a long way improving clarity and readability.

BeautifulSoup Object is different from request content

I make a call to get function in Python using requests module. I pass this request content to BeautifulSoup. But when I print this BeautifulSoup object it is quite different from request content. Some of the tags are missing. Some of them are repeated. Why does it happen like that? For example:
req1=requests.get(url,headers)
print req1.content
s1=BeautifulSoup(req1.content)
print s1
At least, this is because HTML can be not perfectly-formed and BeautifulSoup's underlying parser would make an attempt to fix it. The behavior varies from parser to parser, see more at:
Differences between parsers

BeautifulSoup fails to parse long view state

I try to use BeautifulSoup4 to parse the html retrieved from http://exporter.nih.gov/ExPORTER_Catalog.aspx?index=0 If I print out the resulting soup, it ends like this:
kZXI9IjAi"/></form></body></html>
Searching for the last characters 9IjaI in the raw html, I found that it's in the middle of a huge viewstate. BeautifulSoup seems to have a problem with this. Any hint what I might be doing wrong or how to parse such a page?
BeautifulSoup uses a pluggable HTML parser to build the 'soup'; you need to try out different parsers, as each will treat a broken page differently.
I had no problems parsing that page with any of the parsers, however:
>>> from beautifulsoup4 import BeautifulSoup
>>> import requests
>>> r = requests.get('http://exporter.nih.gov/ExPORTER_Catalog.aspx?index=0')
>>> for parser in ('html.parser', 'lxml', 'html5lib'):
... print repr(str(BeautifulSoup(r.text, parser))[-60:])
...
';\r\npageTracker._trackPageview();\r\n</script>\n</body>\n</html>\n'
'();\r\npageTracker._trackPageview();\r\n</script>\n</body></html>'
'();\npageTracker._trackPageview();\n</script>\n\n\n</body></html>'
Make sure you have the latest BeautifulSoup4 package installed, I have seen consistent problems in the 4.1 series solved in 4.2.

Clean up ugly WYSIWYG HTML code? Python or *nix utility

I'm finally upgrading (rewriting ;) ) my first Django app, but I am migrating all the content.
I foolishly gave users a full WYSIWYG editor for certain tasks, the HTML code produced is of course terribly ugly with more extra tags than content.
Does anyone know of a library or external shell app I could use to clean up the code?
I use tidy sometimes, but as far as I know that doesn't do what I'm asking. I want to simplify all the extra span and other garbage tags. I cleaned the most offensive offending styles with some regex, but I it would take a really long time to do anything more using just regex.
Any ideas?
You could also take a look at Bleach a white-list based HTML sanitizer. It uses html5lib to do what Kyle posted, but you'll get a lot more control over which elements and attributes are allowed in the final output.
Beautiful Soup will probably get you a more complete solution, but you might be able to get some cleanup done more simply with html5lib (if you're OK with html5 rules):
import html5lib
from html5lib import sanitizer, treebuilders, treewalkers, serializer
my_html = "<i>Some html fragment</I>" #intentional 'I'
html_parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
dom_tree = html_parser.parseFragment(my_html)
walker = treewalkers.getTreeWalker("dom")
stream = walker(dom_tree)
s = serializer.htmlserializer.HTMLSerializer(omit_optional_tags=False, quote_attr_values=True)
cleaned_html = s.render(stream)
cleaned_html == '<i>Some html fragment</i>"
You can also sanitize the html by initializing your html_parser like this:
html_parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"), tokenizer=sanitizer.HTMLSanitizer)
The standard answer is Beautiful Soup.
"Extra span" and "garbage tags" is something you'll need to define very, very carefully so you can remove the tags without removing content.
I would suggest you do two things.
Fix your app so that users don't provide HTML under any circumstances. Django can use RST markup which is much more user-friendly. http://docs.djangoproject.com/en/1.3/ref/templates/builtins/#django-contrib-markup
Write a Beautiful Soup parser and transform the user's content into RST markup. Keep the structural elements (headings, lists, etc.) and lose the formatting to the extent possible.

Categories