Select text from multiple sub nodes in an xpath - python

I need to use XPath with lxml in Python 2.6 to extract two text items:
-Name One Type 1 Description 1
-Name Two Type 2 Description 2
I've tried using the following Xpath: '//*[#id="results"]/li/div/p/child::text()'
However this gives me only the following text
-Name One Type 1
-Name Two Type 2
Any suggestions on the correct Xpath to use?
<div id="container">
<ol id="results">
<li class="mod1" data-li-position="0">
<img src="image001.jpg">
<div class="bd">
<h3>
Category 1
</h3>
<p class="description">
<strong class="highlight">Name One</strong>
<strong class="highlight">Type 1</strong>
Description 1
</p>
</div>
</li>
<li class="mod2" data-li-position="1">
<img src="image002.jpg">
<div class="bd">
<h3>
Category 2
</h3>
<p class="description">
<strong class="highlight">Name Two</strong>
Description 2
<strong class="highlight">Type 2</strong>
</p>
</div>
</li>

This last part of your XPath :
...../p/child::text()
... select only text nodes which is child of child of <p>. That's why you missed, for example, Description 1, because it is direct child of <p>. You can try to change that part to be as follow :
...../p//text()
Above XPath will select all text nodes which are descendants of <p>, in other words, all text nodes anywhere within <p>.

Related

Unable to locate element with Python Selenium fin_element_by_xpath

I have a HTML code like :
<div class="A">
<div class="B"></div>
<div class="B">
<div class="C"></div>
<div class="C">
<p class="D"> Element 1 </p>
<div class="C"></div>
</div>
</div>
<div class="A">
<div class="B"></div>
<div class="B">
<div class="C"></div>
<div class="C">
<p class="D"> Element 2 </p>
<div class="C"></div>
</div>
</div>
(this is an example, there is more class "A")
I want to extract the text "Element 2" with Python Selenium.
I tried a lot of things but always the same result : No such element: Unable to locate element...
I tried :
elem = driver.find_element_by_xpath("//div[#class='A:last-child']/p[#class='D']").text
same result...
Try this:
"(//div[#class='A']//p)[2]"
This should get the second instance of Class = "A" and then the p element beneath that
Try this xpath:
"(//div[#class='A']//p)[last()]"
The main problem with your xpath, I think, is that the single slash before the p element means to only look for direct children of the div. You want the double slash to find any descendant.
In this structure Xpath
(//div[#class="A"]//p[#class="D"])[2]
if this is a second hierarchy or
(//div[#class="A"]//p[#class="D"])[last()]
if it is a last should work

(Regex) how to retrieve the contents of a particular div (p, span, etc) by ignoring the HTML it contains, using Python

enter image description here
Hello stackoverflow family
Elements to select its element of which I want to recover.
the goal here is of recuper all the contents of a quelquonque beacon are unaware of it code HTML which contien
My regex is ((<p)([\s]+|([a-zA-Z=(\"|')_[\s]+|]+)([\s]+|)>)|<p>)([a-zA-Z ]+)<
<ol class="arabic">
<li>
<div class="first">
Start the notebook server from the
<a class="reference internal" href="glossary">
<span class="xref std std-term">command line</span>
</a>
: yes very good
</div>
<div class="highlight-default notranslate">
The notebook open in your browse.
<span>ok very good</span>
<span class="n">ok nice</span>
<span class="n">notebook</span>
</div>
</li>
</ol>

Python: XPATH search within node

I have a html code that looks kind of like this (shortened);
<div id="activities" class="ListItems">
<h2>Standards</h2>
<ul>
<li>
<a class="Title" href="http://www.google.com" >Guidelines on management</a>
<div class="Info">
<p>
text
</p>
<p class="Date">Status: Under development</p>
</div>
</li>
</ul>
</div>
<div class="DocList">
<h3>Reports</h3>
<p class="SupLink">+ <a href="http://www.google.com/test" >View More</a></p>
<ul>
<li class="pdf">
<a class="Title" href="document.pdf" target="_blank" >Document</a>
<span class="Size">
[1,542.3KB]
</span>
<div class="Info">
<p>
text <a href="http://www.google.com" >Read more</a>
</p>
<p class="Date">
14/03/2018
</p>
</div>
</li>
</ul>
</div>
I am trying to select the value in 'href=' under 'a class="Title"' by using this code:
def sub_path02(url):
page = requests.get(url)
tree = html.fromstring(page.content)
url2 = []
for node in tree.xpath('//a[#class="Title"]'):
url2.append(node.get("href"))
return url2
But I get two returns, the one under 'div class="DocList"' is also returned.
I am trying to change my xpath expressions so that I would only look within the node but I cannot get it to work.
Could someone please help me understand how to "search" within a specific node. I have gone through multiple xpath documentations but I cannot seem to figure it out.
Using // you are already selecting all the a elements in the document.
To search in a specific div try specifying the parent with // and then use //a again to look anywhere in the div
//div[#class="ListItems"]//a[#class="Title"]
for node in tree.xpath('//div[#class="ListItems"]//a[#class="Title"]'):url2.append(node.get("href"))
Try this xpath expression to select the div with a specific id recursively :
'//div[#id="activities"]//a[#class="Title"]'
so :
def sub_path02(url):
page = requests.get(url)
tree = html.fromstring(page.content)
url2 = []
for node in tree.xpath('//div[#id="activities"]//a[#class="Title"]'):
url2.append(node.get("href"))
return url2
Note :
It's ever better to select an id than a class because an id should be unique (in real life, there's sometimes bad code with multiple same id in the same page, but a class can be repeated N times)

How to implement following-sibling axis of xpath alternative in Beautifulsoup Python

I'm trying to collect the text using Bs4, selenium and Python I want to get the text "Lisa Staprans" using:
name = str(profilePageSource.find(class_="hzi-font hzi-Man-Outline").div.get_text().encode("utf-8"))[2:-1]
Here is the code:
<div class="profile-about-right">
<div class="text-bold">
SF Peninsula Interior Design Firm
<br/>
Best of Houzz 2015
</div>
<br/>
<div class="page-tags" style="display:none">
page_type: pro_plus_profile
</div>
<div class="pro-info-horizontal-list text-m text-dt-s">
<div class="info-list-label">
<i class="hzi-font hzi-Ruler">
</i>
<div class="info-list-text">
<span class="hide" itemscope="" itemtype="http://data-vocabulary.org/Breadcr
umb">
<a href="http://www.houzz.com/professionals/c/Menlo-Park--CA" itemprop="url
">
<span itemprop="title">
Professionals
</span>
</a>
</span>
<span itemprop="child" itemscope="" itemtype="http://data-vocabulary.org/Bre
adcrumb">
<a href="http://www.houzz.com/professionals/interior-designer/c/Menlo-Park-
-CA" itemprop="url">
<span itemprop="title">
Interior Designers & Decorators
</span>
</a>
</span>
</div>
</div>
<div class="info-list-label">
<i class="hzi-font hzi-Man-Outline">
</i>
<div class="info-list-text">
<b>
Contact
</b>
: Lisa Staprans
</div>
</div>
</div>
</div>
Please let me know how it would be.
I assumed you are using Beautifulsoup since you are using class_ attribute dictionary-
If there is one div with class name hzi-font hzi-Man-Outline then try-
str(profilePageSource.find(class_="hzi-font hzi-Man-Outline").findNext('div').get_text().split(":")[-1]).strip()
Extracts 'Lisa Staprans'
Here findNext navigates to next div and extracts text.
I can't test it right now but I would do :
profilePageSource.find_element_by_class_name("info-list-text").get_attribute('innerHTML')
Then you will have to split the result considering the : (if it's always the case).
For more informations : https://selenium-python.readthedocs.org/en/latest/navigating.html
Maybe something is wrong with this part:
find(class_="hzi-font hzi-Man-Outline")
An easy way to get the right information can be: right click on the element you need in the page source by inspecting it with Google Chrome, copy the xpath of the element, and then use:
profilePageSource.find_element_by_xpath(<xpath copied from Chorme>).text
Hope it helps.

handling deeply nested tags in xpath

Please help me!
I don't know how to select deeply nested tag to select the text
inside of it.
If someone would please help me by saying, how to do it in a single line with
xpath query and please give me an explanation regarding the answer.
Below I have given a html code will anybody explain how to display the Hello world or whatever may be in that tags.
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div class="deep">
<span>
<strong class="select">Hello world!</strong>
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
I assume since you asked for the text property the node you'd like to match is the strong tag (the only one with content).
If you are guaranteed only one <strong> tag from the document root and the level of nesting is irrelevant, the simplest xpath would be:
//strong/text()
To match via class specifically as well:
//strong[#class="select"]/text()
// will start from the document root, and # is an attribute match clause.
http://www.b624.net/modelare-software-uml-si-xml/laboratoare-an-3-is/xpath-cheat-sheet

Categories