get sub elements with xpath of lxml.html (Python) - python

I am trying to get sub element with lxml.html, the code is as below.
import lxml.html as LH
html = """
<ul class="news-list2">
<li>
<div class="txt-box">
<p class="info">Number:<label>cewoilgas</label></p>
</div>
</li>
<li>
<div class="txt-box">
<p class="info">Number:<label>NHYQZX</label>
</p>
</div>
</li>
<li>
<div class="txt-box">
<p class="info">Number:<label>energyinfo</label>
</p>
</div>
</li>
<li>
<div class="txt-box">
<p class="info">Number:<label>calgary_information</label>
</p>
</div>
</li>
<li>
<div class="txt-box">
<p class="info">Number:<label>oilgas_pro</label>
</p>
</div>
</li>
</ul>
"""
To get the sub element in li:
htm = LH.fromstring(html)
for li in htm.xpath("//ul/li"):
print li.xpath("//p/label/text()")
Curious why the outcome is
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
And I also found the solution is:
htm = LH.fromstring(html)
for li in htm.xpath("//ul/li"):
print li.xpath(".//p/label/text()")
the result is:
['cewoilgas']
['NHYQZX']
['energyinfo']
['calgary_information']
['oilgas_pro']
Should this be regarded as a bug for lxml? why xpath still match through the whole root element (ul) while it is under the sub-element (li)?

No, this is not a bug, but is an intended behavior. If you start your expression with //, it does not matter if you call it on the root of the tree or on any element of the tree - it is going to be absolute and it is going to be applied from the root.
Just remember, if calling xpath() on an element and you want it to work relative from this element, always start your expressions with a dot which would refer to a current node.
By the way, absolutely (pun intended) the same happens in selenium and it's find_element(s)_by_xpath().

//para selects all the para descendants of the document root and thus
selects all para elements in the same document as the context node
//olist/item selects all the item elements in the same document as the
context node that have an olist parent
. selects the context node
.//para selects the para element descendants of the context node
you can find more example in XML Path Language (XPath)

Related

Select dropdown in Selenium Python

I can't select values in a list. I tried using find_element_by_class_name() to open the menu but when I need to select a <li> returns that element doesn't have a function click().
Here the code:
click_menu = driver.find_element_by_class_name("periodSelector")
click_menu[1].click()
Here is the HTML that I am trying to parse:
<div data-period-selector="" data-period="periodFilter">
<div class="periodSelectorContainer">
<div class="btn-group periodSelector">
<button class="flat-btn dropdown-toggle periodToggle ng-binding" data-toggle="dropdown"> 20/02/2021 - 22/03/2021 <span class="dropdown-arrow"></span> </button>
<ul class="dropdown-menu">
<li>
<a href="javascript:void(0);" class="new-financ" ng-click="selectToday()"><i></i>
<span class="pull-left">Hoje</span>
<span class="pull-right"></span>
</a>
</li>
<li>
<a href="javascript:void(0);" class="new-financ" ng-click="selectThisWeek()"><i>
</li>
There are multiple class names you have to use a css selector.
click_menu = driver.find_element_by_css_selector("button.flat-btn.dropdown-toggle.periodToggle.ng-binding")
click_menu.click()
Clicks 1st li tag.
driver.find_element_by_xpath("ul[#class='dropdown-menu']/li[1]").click()
periodSelector is a class on a DIV
<div class="btn-group periodSelector">
I'm assuming that you need to click on the BUTTON
<button class="flat-btn dropdown-toggle periodToggle ng-binding" data-toggle="dropdown">
Most of those classes seem generic (probably not unique) but I'm guessing that periodToggle might be unique given the date range. Try
driver.find_element_by_css_selector("button.periodToggle").click()
NOTE:
You have an error in your code. You are using .find_element_by_class_name() (singular) but have array notation on the next line, click_menu[1]. In this case, you can just use click_menu.click(). You'd only need the array notation if you were using .find_elements_by_*() (note the plural, elements).

Getting repeats in beautifulsoup nested tags

I'm trying to parse through html using beautifulsoup (being called with lxml).
On nested tags I'm getting repeated text
I've tried going through and only counting tags that have no children, but then I'm losing out on data
given:
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments</span>
</li>
</ul>
</div>
and running:
soup = BeautifulSoup(file_info, features = "lxml")
soup.prettify().encode("utf-8")
for tag in soup.find_all(True):
if check_text(tag.text): #false on empty string/ all numbers
print (tag.text)
I get "to post comments" 4 times.
Is there a beautifulsoup way of just getting the result once?
Given an input like
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments1</span>
</li>
</ul>
</div>
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments2</span>
</li>
</ul>
</div>
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments3</span>
</li>
</ul>
</div>
You could do something like
[x.span.string for x in soup.find_all("li", class_="comment_forbidden first last")]
which would give
[' to post comments1', ' to post comments2', ' to post comments3']
find_all() is used to find all the <li> tags of class comment_forbidden first last and the <span> child tag of each of these <li> tag's content is obtained using their string attribute.
For anyone struggling with this, try swapping out the parser. I switched to html5lib and I no longer have repetitions. It is a costlier parser though, so may cause performance issues.
soup = BeautifulSoup(html, "html5lib")
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
You can use find() instead of find_all() to get your desired result only once

remove all data attributes with etree from all elements

So I'm attempting to clean some HTML. I've got the following function:
def clean_html(self, html):
replaced_html = html.decode('utf-8').replace('<', ' <')
tree = etree.HTML(replaced_html)
etree.strip_elements(tree, 'script', 'style', 'img', 'noscript', 'svg')
for el in tree.xpath('//*[#style]'):
el.attrib.pop('style')
for el in tree.xpath('//*[#class]'):
el.attrib.pop('class')
for el in tree.xpath('//*[#id]'):
el.attrib.pop('id')
etree.strip_tags(tree, etree.Comment)
return etree.tostring(tree, encoding='unicode', method='html')
I'm looking to also remove all data-attributes e.g
<li data-direction="ltr" '
'data-listposition="center" data-data-id="dataItem-ifz7cqbs" '
'data-state="menu idle link notMobile">sky</li>
But the attributes are unknown to me (above is just an example).
So I'm looking to transform the above into just <li>sky</li> and would run on every element on the page.
In my code above I'm able to remove simple things like id, class but I'm not sure how to handle the dynamic attributes data-*. Possibly regex?
EDIT
I should clarify a bit about the input. My example above shows the use of <li> tags. But the actual input is the entire html of a page so it would be something like:
<html>
<ul>
<li data-i="sdfdsf">something</li>
<li data-i="dsfd">something</li>
</ul>
<p data-para="cvcv">content</p>
<div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ifp35za1","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black LinkedIn Icon","uri":"6ea5b4a88f0b4f91945b40499aa0af00.png","width":200,"height":200,"alt":"Black LinkedIn Icon","link":{"type":"ExternalLink","id":"dataItem-ig84dp5v","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"https://www.linkedin.com/in/beth-liu-aba2b487?trk=hp-identity-name","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="https://www.pinterest.com/agencyb/" target="_blank" > <div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ijxtrrjj","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black Pinterest Icon","uri":"8f6f59264a094af0b46e9f6c77dff83e.png","width":200,"height":200,"alt":"Black Pinterest Icon","link":{"type":"ExternalLink","id":"dataItem-ikg674xm","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"https://www.pinterest.com/agencyb/","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="http://www.twitter.com/lubecka" target="_blank" > <div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ifp3554u","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black Twitter Icon","uri":"c7d035ba85f6486680c2facedecdcf4d.png","description":"","width":200,"height":200,"alt":"Black Twitter Icon","link":{"type":"ExternalLink","id":"dataItem-ifp3554u1","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"http://www.twitter.com/lubecka","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="https://www.instagram.com/" target="_blank">
<html>
Assuming that the names of the "data attributes" always start with "data-", you can remove them like this:
for el in tree.xpath("//*"):
for attr in el.attrib:
if attr.startswith("data-"):
el.attrib.pop(attr)
you can clear the attributes like this
import re
def strip_attribute(data):
p = re.compile('data-[^=]*="[^"]*"')
print(p)
return p.sub('', data)
print(strip_attribute('with attribute'))
Maybe this is what you're looking for:
from lxml import etree
code = """
<html>
<ul>
<li data-i="sdfdsf">something</li>
<li data-i="dsfd">something</li>
</ul>
<p data-para="cvcv">content</p>
</html>
"""
xml = etree.XML(code)
elements = list(xml.iter())
for element in elements:
if len(element.text.strip())>0:
print('<'+element.tag+'>'+element.text+'</'+element.tag+'>')
Output:
<li>something</li>
<li>something</li>
<p>content</p>

Python: XPATH search within node

I have a html code that looks kind of like this (shortened);
<div id="activities" class="ListItems">
<h2>Standards</h2>
<ul>
<li>
<a class="Title" href="http://www.google.com" >Guidelines on management</a>
<div class="Info">
<p>
text
</p>
<p class="Date">Status: Under development</p>
</div>
</li>
</ul>
</div>
<div class="DocList">
<h3>Reports</h3>
<p class="SupLink">+ <a href="http://www.google.com/test" >View More</a></p>
<ul>
<li class="pdf">
<a class="Title" href="document.pdf" target="_blank" >Document</a>
<span class="Size">
[1,542.3KB]
</span>
<div class="Info">
<p>
text <a href="http://www.google.com" >Read more</a>
</p>
<p class="Date">
14/03/2018
</p>
</div>
</li>
</ul>
</div>
I am trying to select the value in 'href=' under 'a class="Title"' by using this code:
def sub_path02(url):
page = requests.get(url)
tree = html.fromstring(page.content)
url2 = []
for node in tree.xpath('//a[#class="Title"]'):
url2.append(node.get("href"))
return url2
But I get two returns, the one under 'div class="DocList"' is also returned.
I am trying to change my xpath expressions so that I would only look within the node but I cannot get it to work.
Could someone please help me understand how to "search" within a specific node. I have gone through multiple xpath documentations but I cannot seem to figure it out.
Using // you are already selecting all the a elements in the document.
To search in a specific div try specifying the parent with // and then use //a again to look anywhere in the div
//div[#class="ListItems"]//a[#class="Title"]
for node in tree.xpath('//div[#class="ListItems"]//a[#class="Title"]'):url2.append(node.get("href"))
Try this xpath expression to select the div with a specific id recursively :
'//div[#id="activities"]//a[#class="Title"]'
so :
def sub_path02(url):
page = requests.get(url)
tree = html.fromstring(page.content)
url2 = []
for node in tree.xpath('//div[#id="activities"]//a[#class="Title"]'):
url2.append(node.get("href"))
return url2
Note :
It's ever better to select an id than a class because an id should be unique (in real life, there's sometimes bad code with multiple same id in the same page, but a class can be repeated N times)

Select text from multiple sub nodes in an xpath

I need to use XPath with lxml in Python 2.6 to extract two text items:
-Name One Type 1 Description 1
-Name Two Type 2 Description 2
I've tried using the following Xpath: '//*[#id="results"]/li/div/p/child::text()'
However this gives me only the following text
-Name One Type 1
-Name Two Type 2
Any suggestions on the correct Xpath to use?
<div id="container">
<ol id="results">
<li class="mod1" data-li-position="0">
<img src="image001.jpg">
<div class="bd">
<h3>
Category 1
</h3>
<p class="description">
<strong class="highlight">Name One</strong>
<strong class="highlight">Type 1</strong>
Description 1
</p>
</div>
</li>
<li class="mod2" data-li-position="1">
<img src="image002.jpg">
<div class="bd">
<h3>
Category 2
</h3>
<p class="description">
<strong class="highlight">Name Two</strong>
Description 2
<strong class="highlight">Type 2</strong>
</p>
</div>
</li>
This last part of your XPath :
...../p/child::text()
... select only text nodes which is child of child of <p>. That's why you missed, for example, Description 1, because it is direct child of <p>. You can try to change that part to be as follow :
...../p//text()
Above XPath will select all text nodes which are descendants of <p>, in other words, all text nodes anywhere within <p>.

Categories