I try to use BeautifulSoup4 to parse the html retrieved from http://exporter.nih.gov/ExPORTER_Catalog.aspx?index=0 If I print out the resulting soup, it ends like this:
kZXI9IjAi"/></form></body></html>
Searching for the last characters 9IjaI in the raw html, I found that it's in the middle of a huge viewstate. BeautifulSoup seems to have a problem with this. Any hint what I might be doing wrong or how to parse such a page?
BeautifulSoup uses a pluggable HTML parser to build the 'soup'; you need to try out different parsers, as each will treat a broken page differently.
I had no problems parsing that page with any of the parsers, however:
>>> from beautifulsoup4 import BeautifulSoup
>>> import requests
>>> r = requests.get('http://exporter.nih.gov/ExPORTER_Catalog.aspx?index=0')
>>> for parser in ('html.parser', 'lxml', 'html5lib'):
... print repr(str(BeautifulSoup(r.text, parser))[-60:])
...
';\r\npageTracker._trackPageview();\r\n</script>\n</body>\n</html>\n'
'();\r\npageTracker._trackPageview();\r\n</script>\n</body></html>'
'();\npageTracker._trackPageview();\n</script>\n\n\n</body></html>'
Make sure you have the latest BeautifulSoup4 package installed, I have seen consistent problems in the 4.1 series solved in 4.2.
Related
HTML has a concept of empty elements, as listed on MDN. However, beautiful soup doesn't seem to handle them properly:
import bs4
soup = bs4.BeautifulSoup(
'<div><input name=the-input><label for=the-input>My label</label></div>',
'html.parser'
)
print(soup.contents)
I get:
[<div><input name="the-input"><label for="the-input">My label</label></input></div>]
I.e. the input has wrapped the label.
Question: Is there any way to get beautiful soup to parse this properly? Or is there an official explanation of this behaviour somewhere I haven't found yet?
At the very least I'd expect something like:
[<div><input name="the-input"></input><label for="the-input">My label</label></div>]
I.e. the input automatically closed before the label.
As stated in their documentation html5lib parses the document as the web browser does (Like lxmlin this case). It'll try to fix your document tree by adding/closing tags when needed.
In your example I've used lxml as the parser and it gave the following result:
soup = bs4.BeautifulSoup(
'<div><input name=the-input><label for=the-input>My label</label></div>',
'lxml'
)
print(soup.body.contents)
[<div><input name="the-input"/><label for="the-input">My label</label></div>]
Note that lxml added html & body tags because they weren't present in the source, that is why I've printed the body contents.
I would say soup is doing what it can for fixing this html structure, it is actually helpful in some occasions.
Anyway, for your case I would say to use lxml, which will parse the html structure as you want, or maybe give a try to parsel
I'm working on a web scraping project and have ran into problems with speed. To try to fix it, I want to use lxml instead of html.parser as BeautifulSoup's parser. I've been able to do this:
soup = bs4.BeautifulSoup(html, 'lxml')
but I don't want to have to repeatedly type 'lxml' every time I call BeautifulSoup. Is there a way I can set which parser to use once at the beginning of my program?
According to the Specifying the parser to use documentation page:
The first argument to the BeautifulSoup constructor is a string or an
open filehandle–the markup you want parsed. The second argument is how
you’d like the markup parsed.
If you don’t specify anything, you’ll get the best HTML parser that’s
installed. Beautiful Soup ranks lxml’s parser as being the best, then
html5lib’s, then Python’s built-in parser.
In other words, just installing lxml in the same python environment makes it a default parser.
Though note, that explicitly stating a parser is considered a best-practice approach. There are differences between parsers that can result into subtle errors which would be difficult to debug if you are letting BeautifulSoup choose the best parser by itself. You would also have to remember that you need to have lxml installed. And, if you would not have it installed, you would not even notice it - BeautifulSoup would just get the next available parser without throwing any errors.
If you still don't want to specify the parser explicitly, at least make a note for future yourself or others who would use the code you've written in the project's README/documentation, and list lxml in your project requirements alongside with beautifulsoup4.
Besides: "Explicit is better than implicit."
Obviously take a look at the accepted answer first. It is pretty good, and as for this technicality:
but I don't want to have to repeatedly type 'lxml' every time I call
BeautifulSoup. Is there a way I can set which parser to use once at
the beginning of my program?
If I understood your question correctly, I can think of two approaches that will save you some keystrokes: - Define a wrapper function, or - Create a partial function.
# V1 - define a wrapper function - most straight-forward.
import bs4
def bs_parse(html):
return bs4.BeautifulSoup(html, 'lxml')
# ...
html = ...
bs_parse(html)
Or if you feel like showing off ...
import bs4
from functools import partial
bs_parse = partial(bs4.BeautifulSoup, features='lxml')
# ...
html = ...
bs_parse(html)
I'm trying to pull NextBus data, specifically bus GPS real-time seen here: http://webservices.nextbus.com/service/publicXMLFeed?command=vehicleLocations&a=sf-muni&r=N&t=0
In it, there are tags that look like this:
<vehicle id="1534" routeTag="N" dirTag="N__OB1" lat="37.76931" lon="-122.43249"
secsSinceReport="99" predictable="true" heading="265" speedKmHr="37"/>
I'm learning python and have walked through to successfully pull a tag based on the attribute. But I'm struggling for any attribute besides id.
So this works:
soup.findAll("vehicle", {"id":"1521"})[1]
But this returns an empty set
soup.findAll("vehicle", {"routeTag":"N"})
Any reason why?
Also, as I mentioned I'm brandnew to Python so if you have a favorite scraping tutorial feel free to leave a comment!
To make it work, you should pass xml to BeautifulSoup constructor:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://webservices.nextbus.com/service/publicXMLFeed?command=vehicleLocations&a=sf-muni&r=N&t=0'
soup = BeautifulSoup(urlopen(url), "xml")
print soup.find_all("vehicle", {"routeTag":"N"})
prints:
[
<vehicle heading="-4" id="1431" lat="37.72223" lon="-122.44694" predictable="false" routeTag="N" secsSinceReport="65" speedKmHr="0"/>,
...
]
Or, thanks to #Martijn's comment, perform search in lower-case:
print soup.find_all("vehicle", {"routetag": "N"})
Also, note that you should use BeautifulSoup4 and find_all() method - the 3rd BeautifulSoup version is not maintained.
I am new to python(using 2.7.3). I was trying to do web scraping using python but I am ot getting the expected outputs:
import urllib
import re
regex='<title>(.+?)<\title>'
pattern=re.compile(regex)
dummy="fsdfsdf<title>Test<\title>dsf"
html=urllib.urlopen('http://www.google.com')
text=html.read()
print pattern.findall(text)
print pattern.findall(dummy)
while the second print statement is working fine but the first one should print Google but it is giving a blank list.
Try changing:
regex='<title>(.+?)<\title>'
to
regex='<title>(.+?)</title>'
You mistyped the slash:
regex='<title>(.+?)<\title>'
should be:
regex='<title>(.+?)</title>'
HTML uses a forward slash in closing tags.
That said, don't use regular expressions to parse HTML. Matching HTML with such expressions get too complicated, too fast.
Use a HTML parser instead, Python has several to choose from. I recommend you use BeautifulSoup, a popular 3rd party library.
BeautifulSoup example:
from bs4 import BeautifulSoup
response = urllib.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
title = soup.find('title').text
I have a snippet of HTML that contains paragraphs. (I mean p tags.) I want to split the string into the different paragraphs. For instance:
'''
<p class="my_class">Hello!</p>
<p>What's up?</p>
<p style="whatever: whatever;">Goodbye!</p>
'''
Should become:
['<p class="my_class">Hello!</p>',
'<p>What's up?</p>'
'<p style="whatever: whatever;">Goodbye!</p>']
What would be a good way to approach this?
If your string only contains paragraphs, you may be able to get away with a nicely crafted regex and re.split(). However, if your string is more complex HTML, or not always valid HTML, you might want to look at the BeautifulSoup package.
Usage goes like:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(some_html)
paragraphs = list(unicode(x) for x in soup.findAll('p'))
Use lxml.html to parse the HTML into the form you want. This is essentially the same advice as the people who are recommending BeautifulSoup, except lxml is still being actively developed and BeatifulSoup development has slowed.
Use BeautifulSoup to parse the HTML and iterate over the paragraphs.
The xml.etree (std lib) or lxml.etree (enhanced) make this easy to do, but I'm not going to get the answer cred for this because I don't remember the exact syntax. I keep mixing it up with similar packages and have to look it up afresh every time.