parse nested html lists using lxml in python - python

I am trying to parse the elements of an html list which looks like this:
<ol>
<li>r1</li>
<li>r2
<ul>
<li>n1</li>
<li>n2</li>
</ul>
</li>
<li>r3
<ul>
<li>d1
<ol>
<li>e1</li>
<li>e2</li>
</ol>
</li>
<li>d2</li>
</ul>
</li>
<li>r4</li>
</ol>
I am fine with parsing this for the most part, but the biggest problem for me is in getting the dom text back. Unfortunately lxml's node.text_content() returns the text form of the complete tree under it. Can I obtain the text content of just that element using lxml, or would I need to use string manipulation or regex for that?
For eg: the node with d1 returns "d1e1e2", whereas, I want it to return just d1.

Each node has an attribute called text. That's what you are looking for.
e.g.:
for node in root.iter("*"):
print node.text
# print node.tail # e.g.: <div> <span> abc </span> def </div> => abc def

Related

Extract text from children of next nodes with XPath and Scrapy

With Python Scrapy, I am trying to get contents in a webpage whose nodes look like this:
<div id="title">Title</div>
<ul>
<li>
<span>blahblah</span>
<div>blahblah</div>
<p>CONTENT TO EXTRACT</p>
</li>
<li>
<span>blahblah</span>
<div>blahblah</div>
<p>CONTENT TO EXTRACT</p>
</li>
...
</ul>
I'm a newbie with XPath and couldn't get it for now. My last try was something like:
contents = response.xpath('[#id="title"]/following-sibling::ul[1]//li//p.text()')
... but it seems I cannot use /following-sibling after [#id="title"].
Any idea?
Try this XPath
contents = response.xpath('//div[#id="title"]/following-sibling::ul[1]/li/p/text()')
It selects both "CONTENT TO EXTRACT" text nodes.
One XPath would be:
response.xpath('//*[#id="title"]/following-sibling::ul[1]//p/text()).getall()
which get text from every <p> tag child or grand child of nearest <ul> tag to node with id = "title".
XPath syntax
Try this using css selector.
response.css('#title ::text).extract()

get sub elements with xpath of lxml.html (Python)

I am trying to get sub element with lxml.html, the code is as below.
import lxml.html as LH
html = """
<ul class="news-list2">
<li>
<div class="txt-box">
<p class="info">Number:<label>cewoilgas</label></p>
</div>
</li>
<li>
<div class="txt-box">
<p class="info">Number:<label>NHYQZX</label>
</p>
</div>
</li>
<li>
<div class="txt-box">
<p class="info">Number:<label>energyinfo</label>
</p>
</div>
</li>
<li>
<div class="txt-box">
<p class="info">Number:<label>calgary_information</label>
</p>
</div>
</li>
<li>
<div class="txt-box">
<p class="info">Number:<label>oilgas_pro</label>
</p>
</div>
</li>
</ul>
"""
To get the sub element in li:
htm = LH.fromstring(html)
for li in htm.xpath("//ul/li"):
print li.xpath("//p/label/text()")
Curious why the outcome is
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
And I also found the solution is:
htm = LH.fromstring(html)
for li in htm.xpath("//ul/li"):
print li.xpath(".//p/label/text()")
the result is:
['cewoilgas']
['NHYQZX']
['energyinfo']
['calgary_information']
['oilgas_pro']
Should this be regarded as a bug for lxml? why xpath still match through the whole root element (ul) while it is under the sub-element (li)?
No, this is not a bug, but is an intended behavior. If you start your expression with //, it does not matter if you call it on the root of the tree or on any element of the tree - it is going to be absolute and it is going to be applied from the root.
Just remember, if calling xpath() on an element and you want it to work relative from this element, always start your expressions with a dot which would refer to a current node.
By the way, absolutely (pun intended) the same happens in selenium and it's find_element(s)_by_xpath().
//para selects all the para descendants of the document root and thus
selects all para elements in the same document as the context node
//olist/item selects all the item elements in the same document as the
context node that have an olist parent
. selects the context node
.//para selects the para element descendants of the context node
you can find more example in XML Path Language (XPath)

Get all text in a tag unless it is in another tag

I'm trying to parse some HTML with BeautifulSoup, and I'd like to get all the text (recursively) in a tag, but I want to ignore all text that appears within a small tag. For example, this HTML:
<li>
<a href="/path">
Final
</a>
definition.
<small>
Fun fact.
</small>
</li>
should give the text Final definition. Note that this is a minimal example. In the real HTML, there are many other tags involved, so small should be excluded rather than a being included.
The text attribute of the tag is close to what I want, but it would include Fun fact. I could concatenate the text of all children except the small tags, but that would leave out definition. I couldn't find a method like get_text_until (the small tag is always at the end), so what can I do?
You can use find_all to find all the <small> tags, clear them, then use get_text():
>>> soup
<li>
<a href="/path">
Final
</a>
definition.
<small>
Fun fact.
</small>
</li>
>>> for el in soup.find_all("small"):
... el.clear()
...
>>> soup
<li>
<a href="/path">
Final
</a>
definition.
<small></small>
</li>
>>> soup.get_text()
'\n\n\n Final\n \n definition.\n \n\n'
You can get this using recursive method state that you don't want to recurse into child tags:
Like
soup.li.find(text=True, recursive=False)
So you can do this like
' '.join(li.find(text=True, recursive=False) for li in soup.findAll('li', 'a'))

Fetching list data from malformed HTML

I am attempting to fetch data from a <UL> where the list data is malformed. In other words the end tags (</LI>) are missing in the list:
<UL>
<LI>Blah2
<LI><A>Blah</A>
<LI><A>Blah2</A>
</UL>
<UL>
<LI><A>Blah</A>
<LI>Blah2
<LI><A>Blah2</A>
</UL>
<UL>
<LI><A>Blah</A>
</UL>
<UL>
<LI>Blah
</UL>
Can i somehow iterate through this? As shown in the example there can be a mixture of links and no links. Whats most important is that i fetch the links (if any) and text.
Unfortunately BeutifulSoup attempts to repair the malformed HTML and causes more damage than needed:
from bs4 import BeautifulSoup as bsoup
html = '<UL><LI>Blah><LI><A>Blah</A><LI><A>Blah2</A></UL>'
print bsoup(html).prettify()
>>> <ul>
>>> <li>
>>> Blah>
>>> <li>
>>> <a>
>>> Blah
>>> </a>
>>> <li>
>>> <a>
>>> Blah2
>>> </a>
>>> </li>
>>> </li>
>>> </li>
>>> </ul>
As seen on the example above Bsoup is adding all end tags in the end of the list items.
As per my comment, BS4 handles invalid HTML differently depending on which parser you use. The four parsers that are supported are:
html.parser (which is built in)
lxml's HTML parser
lxml's XML parser
html5lib (which works in this case)
You can use trial and error or look specifically at your issue and the way each parser handles it (using the links above) and choose a parser that acts in the way you want it to.
If there are no nested list items, you can manually close the <li> tags using regular expressions:
>>> xhtml = re.sub(r'\<li\>(.*?)(?=\<li\>)', r'<li>\1</li>', html,
... flags=re.IGNORECASE | re.DOTALL)
>>> xhtml
'<UL><li>Blah></li><li><A>Blah</A></li><LI><A>Blah2</A></UL>'
>>> print(BeautifulSoup(xhtml).prettify())
<html>
<body>
<ul>
<li>
Blah>
</li>
<li>
<a>
Blah
</a>
</li>
<li>
<a>
Blah2
</a>
</li>
</ul>
</body>
</html>

How to extract value of attribute in a nested tag structure using beautifulsoup?

I have a html file which looks something similar to this :
<html>
...
<li class="not a user"> </li>
<li class="user">
<a href="abs" ...> </a>
</li>
<li class="user">
<a href="bss" ...> </a>
</li>
...
</html>
given the above input I want to parse the li tags with class="user" and get the value of the href's as output. is this possible using beautifulsoup in python ???
my solution was:
data="the above html code snippet"
soup=BeautifulSoup(data)
listset=soup("li","user")
for list in listset:
attrib_value=[a['href'] for a in list.findAll('a',{'href':True})]
obviously i have an error somewhere that it only lists the attribute value for the last anchor tag's href.
Your code is fine. There are three elements in listset - and attrib_value gets overridden in each iteration of your loop, so at the end of the program, it only contains the href values from the last element of listset, which is bss.
Try this instead to keep all values:
attrib_value += [a['href'] for a in list.findAll('a',{'href':True})]
and initialize attrib_value to the empty list before the loop (attrib_value = []).

Categories