I recently switched from Beautifulsoup to lxml because lxml can work with broken HTML, which is my case. I wanted to know what is the equivalent or a programatic form of acomplishing Beautifulsoup find(). You see in BS I am able to find a tree node by searching like this:
bs = BeautifulSoup(html)
bs.find('span', {'class': 'some-class-name'})
lxml find() just searching the current level on the tree, what if I want to search in all the tree nodes ?
Thanks
You can use cssselect:
root = lxml.html.fromstring(html)
root.cssselect('span.some-class-name')
or xpath:
root.xpath('.//span[#class="some-class-name"]')
Both cssselect, xpath methods return a list of matched element like findAll/find_all method in BeautifulSoup.
If you didn't want to bother learning the api for lxml or xpath expressions, then here's another option:
From: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser [...]
And to specify a specific parser to use:
BeautifulSoup(markup, "lxml")
Related
HTML has a concept of empty elements, as listed on MDN. However, beautiful soup doesn't seem to handle them properly:
import bs4
soup = bs4.BeautifulSoup(
'<div><input name=the-input><label for=the-input>My label</label></div>',
'html.parser'
)
print(soup.contents)
I get:
[<div><input name="the-input"><label for="the-input">My label</label></input></div>]
I.e. the input has wrapped the label.
Question: Is there any way to get beautiful soup to parse this properly? Or is there an official explanation of this behaviour somewhere I haven't found yet?
At the very least I'd expect something like:
[<div><input name="the-input"></input><label for="the-input">My label</label></div>]
I.e. the input automatically closed before the label.
As stated in their documentation html5lib parses the document as the web browser does (Like lxmlin this case). It'll try to fix your document tree by adding/closing tags when needed.
In your example I've used lxml as the parser and it gave the following result:
soup = bs4.BeautifulSoup(
'<div><input name=the-input><label for=the-input>My label</label></div>',
'lxml'
)
print(soup.body.contents)
[<div><input name="the-input"/><label for="the-input">My label</label></div>]
Note that lxml added html & body tags because they weren't present in the source, that is why I've printed the body contents.
I would say soup is doing what it can for fixing this html structure, it is actually helpful in some occasions.
Anyway, for your case I would say to use lxml, which will parse the html structure as you want, or maybe give a try to parsel
It's been a while since I've used regex, and I feel like this should be simple to figure out.
I have a web page full of links that looks like the string_to_match in the below code. I want to grab just the numbers in the links, like number "58" in the string_to_match. For the life of me I can't figure it out.
import re
string_to_match = 'Roster'
re.findall('Roster',string_to_match)
Instead of using regular expressions, you can use a combination of HTML parsing (using BeautifulSoup parser) to locate the desired link and extract the href attribute value and URL parsing, which in this case, we'll use regular expressions for:
import re
from bs4 import BeautifulSoup
data = """
<body>
Roster
</body>
"""
soup = BeautifulSoup(data, "html.parser")
link = soup.find("a", text="Roster")["href"]
print(re.search(r"teamId=(\d+)", link).group(1))
Prints 58.
I would recommend using BeautifulSoup or lxml, it's worth the learning curve.
...But if you still want to use regexp
re.findall('href="[^"]*teamId=(\d+)',string_to_match)
So lets say I have the following base url http://example.com/Stuff/preview/v/{id}/fl/1/t/. There are a number of urls with different {id}s on the page being parsed. I want to find all the links matching this template in an HTML page.
I can use xpath to just match to a part of the template//a[contains(#href,preview/v] or just use regexes, but I was wondering if anyone knew a more elegant way to match to the entire template using xpath and regexes so its fast and the matches are definitely correct.
Thanks.
Edit. I timed it on a sample page. With my internet connection and 100 trials the iteration takes 0.467 seconds on average and BeautifulSoup takes 0.669 seconds.
Also if you have Scrapy its one can use Selectors.
data=get(url).text
sel = Selector(text=data, type="html")
a=sel.xpath('//a[re:test(#href,"/Stuff/preview/v/\d+/fl/1/t/")]//#href').extract()
Average time on this is also 0.467
You cannot use regexes in the xpath expressions using lxml, since lxml supports xpath 1.0 and xpath 1.0 doesn't support regular expression search.
Instead, you can find all the links on a page using iterlinks(), iterate over them and check the href attribute value:
import re
import lxml.html
tree = lxml.html.fromstring(data)
pattern = re.compile("http://example.com/Stuff/preview/v/\d+/fl/1/t/")
for element, attribute, link, pos in tree.iterlinks():
if not pattern.match(link):
continue
print link
An alternative option would be to use BeautifulSoup html parser:
import re
from bs4 import BeautifulSoup
data = "your html"
soup = BeautifulSoup(data)
pattern = re.compile("http://example.com/Stuff/preview/v/\d+/fl/1/t/")
print soup.find_all('a', {'href': pattern})
To make BeautifulSoup parsing faster you can let it use lxml:
soup = BeautifulSoup(data, "lxml")
Also, you can make use of a SoupStrainer class that lets you parse only specific web page parts instead of a whole page.
Hope that helps.
I want to load the html contents of the page in an xml tree and remove the elements in it using lxml in python. I just want to know how would I remove the elements from the content?
You can use the combination of BeautifulSoup4 and lxml to reach your goal easily.
To parse your HTML into tree / soup. You just need to have all the ingredients installed and do.
from bs4 import BeautifulSoup
html = """..."""
soup = BeautifulSoup(html, 'lxml')
...
You modify the tree, here is a whole list of references teaching you how to modify the contents/attribute of a tag etc.
BeautifulSoup/Modify The tree
Here is an example I did to modify the contents of anchor tag
I am using the python module HTMLParser.py
I am able to parse HTML correctly but is there an option to change a HTML elements data(innerText)?
Do you know how I can do this with the module HTMLParser?
No, the HTMLParser does just that: it parses through your HTML.
You're probably looking for Beautiful Soup. It'll create a ParseTree--a Pythonic tree of objects representing the HTML elements of your document. Then, you can search up the object (element) you want, assign it a new value, and voila!
Stolen shamelessly from the documentation:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("<b>Argh!</b>")
soup.find(text="Argh!").replaceWith("Hooray!")
print soup
# <b>Hooray!</b>