I want to load the html contents of the page in an xml tree and remove the elements in it using lxml in python. I just want to know how would I remove the elements from the content?
You can use the combination of BeautifulSoup4 and lxml to reach your goal easily.
To parse your HTML into tree / soup. You just need to have all the ingredients installed and do.
from bs4 import BeautifulSoup
html = """..."""
soup = BeautifulSoup(html, 'lxml')
...
You modify the tree, here is a whole list of references teaching you how to modify the contents/attribute of a tag etc.
BeautifulSoup/Modify The tree
Here is an example I did to modify the contents of anchor tag
Related
I have beautiful soup code that looks like:
for item in beautifulSoupObj.find_all('cite'):
pagelink.append(item.get_text())
the problem is, the html code I'm trying to parse looks like:
<cite>https://www.<strong>websiteurl.com/id=6</strong></cite>
My current selector above would get everything, including strong tags in it.
Thus, how can I parse only:
https://www.websiteurl.com/id=6
Note <cite> appears multiple times throughout the page, and I want to extract, and print everything.
Thank you.
Extracting only the text portion is easy as doing .text on the object.
We can use basic BeautifulSoup methods to traverse the tree hierarchy.
Helpful explanation on how to do that: HERE
from bs4 import BeautifulSoup
html = '''<cite>https://www.<strong>websiteurl.com/id=6</strong></cite>'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.cite.text)
# is the same as soup.find('cite').text
I am using BeautifulSoup to extract information from HTML files. I would like to be able to capture the location of the information, that is the offset within the HTML file of the tag that of a BS tag object.
Is there a way to do this?
I am currently using the lxml parser as it is the default.
If I'm reading your question correctly, you are parsing some html with BeautifulSoup and then using the soup to identify a tag. Once you have the tag, you are trying to find the index position of the tag within the original html string.
The problem with capturing the index position of a tag using BeautifulSoup is that the soup will alter the structure of the html based on the given parser. The lxml parsing might not provide a character for character representation, especially after finding a tag within the soup.
It's iffy if this will consistently work, but you might try using a string's find method to find the position of your tag's text contents, which should remain largely unchanged.
#!python
# html is a string containing your html document
soup = BeautifulSoup(html,'lxml')
# target is the tag you want to find
target = soup.find('p')
# now we locate the text of the target inside of the html document
html.find((target.text))
This method will not start at the beginning of the tag, but should be able to locate the tag's contents within the html.
If you wanted to know the index of a tag in the body of your soup, that would be much more feasible.
HTML has a concept of empty elements, as listed on MDN. However, beautiful soup doesn't seem to handle them properly:
import bs4
soup = bs4.BeautifulSoup(
'<div><input name=the-input><label for=the-input>My label</label></div>',
'html.parser'
)
print(soup.contents)
I get:
[<div><input name="the-input"><label for="the-input">My label</label></input></div>]
I.e. the input has wrapped the label.
Question: Is there any way to get beautiful soup to parse this properly? Or is there an official explanation of this behaviour somewhere I haven't found yet?
At the very least I'd expect something like:
[<div><input name="the-input"></input><label for="the-input">My label</label></div>]
I.e. the input automatically closed before the label.
As stated in their documentation html5lib parses the document as the web browser does (Like lxmlin this case). It'll try to fix your document tree by adding/closing tags when needed.
In your example I've used lxml as the parser and it gave the following result:
soup = bs4.BeautifulSoup(
'<div><input name=the-input><label for=the-input>My label</label></div>',
'lxml'
)
print(soup.body.contents)
[<div><input name="the-input"/><label for="the-input">My label</label></div>]
Note that lxml added html & body tags because they weren't present in the source, that is why I've printed the body contents.
I would say soup is doing what it can for fixing this html structure, it is actually helpful in some occasions.
Anyway, for your case I would say to use lxml, which will parse the html structure as you want, or maybe give a try to parsel
I recently switched from Beautifulsoup to lxml because lxml can work with broken HTML, which is my case. I wanted to know what is the equivalent or a programatic form of acomplishing Beautifulsoup find(). You see in BS I am able to find a tree node by searching like this:
bs = BeautifulSoup(html)
bs.find('span', {'class': 'some-class-name'})
lxml find() just searching the current level on the tree, what if I want to search in all the tree nodes ?
Thanks
You can use cssselect:
root = lxml.html.fromstring(html)
root.cssselect('span.some-class-name')
or xpath:
root.xpath('.//span[#class="some-class-name"]')
Both cssselect, xpath methods return a list of matched element like findAll/find_all method in BeautifulSoup.
If you didn't want to bother learning the api for lxml or xpath expressions, then here's another option:
From: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser [...]
And to specify a specific parser to use:
BeautifulSoup(markup, "lxml")
I am using the python module HTMLParser.py
I am able to parse HTML correctly but is there an option to change a HTML elements data(innerText)?
Do you know how I can do this with the module HTMLParser?
No, the HTMLParser does just that: it parses through your HTML.
You're probably looking for Beautiful Soup. It'll create a ParseTree--a Pythonic tree of objects representing the HTML elements of your document. Then, you can search up the object (element) you want, assign it a new value, and voila!
Stolen shamelessly from the documentation:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("<b>Argh!</b>")
soup.find(text="Argh!").replaceWith("Hooray!")
print soup
# <b>Hooray!</b>