Trouble creating an xpath to be able to locate elements conditionally - python

I have been trying to create an xpath supposed to locate the first three Yes within p elements until the text Demarcation within h1 elements. The existing one which I've used within the below script locates all the text within p elements. However, I can't find any idea to move along. Just consider the one I've created already to be a placeholder.
How can I create an xapth to be able to locate first three Yes within p elements and nothing else?
My attempt so far:
from lxml.html import fromstring
htmldoc="""
<li>
<a>Nope</a>
<a>Nope</a>
<p>Yes</p>
<p>Yes</p>
<p>Yes</p>
<h1>Demarcation</h1>
<p>No</p>
<p>No</p>
<h1>Not this</h2>
<p>No</p>
<p>Not this</p>
</li>
"""
root = fromstring(htmldoc)
for item in root.xpath("//li/p"):
print(item.text)

Try below to select paragraphs that are preceding siblings of header "Demarcation"
//li/p[following-sibling::h1[.="Demarcation"]]

It looks like you are trying to depend on the h1 tag containing Demarcation, so start from it:
//h1[contains(., "Demarcation")]/preceding-sibling::p[contains(., "Yes")][position()<4]
The idea is to get previous p elements and I added the position()<4 so you only get three, you can remove that if you just need all of the p:
//h1[contains(., "Demarcation")]/preceding-sibling::p[contains(., "Yes")]

Related

Python selenium: finding multiple elements with partially different names

I have a webpage full of elements that consists below example (gave 2, the webpage consists of around >10), I want to search for all of the below elements containing 'suda-data' and click on all of them. Howvever I am unable to define the finding of all the elements properly.
Notes:
a. cannot search by class="S_txt2" (will include elements that is out of the criteria)
b.The numbers at the back of 'suda-data' changes everytime the page is refreshed.
<a suda-data="key=smart_feed&value=time_sort_comm:4611520921076523" href="javascript:void(0);" class="S_txt2" action-type="fl_comment" action-data="ouid=6430256035&location=page_100808_super_index"><span class="pos"><span class="line S_line1" node-type="comment_btn_text"><span><em class="W_ficon ficon_repeat S_ficon"></em><em>14</em></span></span></span></a>
.......
<a suda-data="key=smart_feed&value=time_sort_comm:4612135415451073" href="javascript:void(0);" class="S_txt2" action-type="fl_comment" action-data="ouid=7573331386&location=page_100808_super_index"><span class="pos"><span class="line S_line1" node-type="comment_btn_text"><span><em class="W_ficon ficon_repeat S_ficon"></em><em> 183</em></span></span></span></a>
Any way that i can find all the elements containing this? Thanks for the help.
Try this on your actual html and see if it works:
targets = browser.find_elements_by_xpath('//a[contains(#suda-data,"smart_feed")]')
for target in targets:
print(target.get_attribute('suda-data'))
For a page containing just the two <a> elements in your question, the output should be:
key=smart_feed&value=time_sort_comm:4611520921076523
key=smart_feed&value=time_sort_comm:4612135415451073
There is a jankier way to do this that I can think of off the top of my head.
First, get a list of all the elements with the "a" tag.
Second, filter through to see only the ones containing "suda-data" and save them in a list, making this dynamically change along with the numbers every time the website is refreshed.
Third, do a soup.find("a", {"suda-data": suda-data-instance}) where suda-data-instance is an element of the list in step 2.

Web Scraping a data inside an html h3 tag using Selenium Python

I wanted to grab a certain data using selenium, the data is located inside a tag with a similar class, so how do I grab it?
Those 2 are the data, but they are inside the same class.
i tried to use
driver.find_elements_by_class_name
But it doesn't work, is there a way to grab it? thanks
Use the following xpath "//*[#class='card-title']" and use the function driver.find_elements_by_xpath. In order to check the correctness of the xpath, inspect the page and with Control + F or Command + F put the xpath in the search bar so you will see if the xpath finds the elements you are looking for
Then if you want the text inside:
elements = driver.find_elements_by_xpath("//*[#class='card-title']")
data = [element.text for element in elements]
yes there is you can grab the first one like this:
driver.find_element_by_xpath("(//h3[#class='cart-title'])[1]").find_element_by_tag_name('b').text
and the second one like this
driver.find_element_by_xpath("(//h3[#class='cart-title'])[2]").find_element_by_tag_name('b').text

Empty list as output from scrapy response object

I am scraping this webpage and while trying to extract text from one element, I am hitting a dead end.
So the element in question is shown below in the image -
The text in this element is within the <p> tags inside the <div>. I tried extracting the text in the scrapy shell using the following code - response.css("div.home-hero-blurb no-select::text").getall(). I received an empty list as the result.
Alternatively, if I try going a bit further and reference the <p> tags individually, I can get the text. Why does this happen? Isn't the <div> a parent element and shouldn't my code extract the text?
Note - I wanted to use the div because I thought that'll help me get both the <p> tags in one query.
I can see two issues here.
The first is that if you separate the class name with spaces, the css selector will understand you are looking for a child element of that name. So the correct approach is "div.home-hero-blurb.no-select::text" instead of "div.home-hero-blurb no-select::text".
The second issue is that the text you want is inside a p element that is a child of that div. If you only select the div, the selector will return the text inside the div, but not in it's childs. Since there is also a strong element as child of p, I would suggest using a generalist approach like:
response.css("div.home-hero-blurb.no-select *::text").getall()
This should return all text from the div and it's descendants.
It's relevant to point out that extracting text from css selectors are a extension of the standard selectors. Scrapy mention this here.
Edit
If you were to use XPath, this would be the equivalent expression:
response.xpath('//div[#class="home-hero-blurb no-select"]//text()').getall()

Web scraping Button BeautifulSoup Python

i'm trying to webscrape the span from a button that has a determinated class. This is the code of the page on the website.
<button class="sqdOP yWX7d _8A5w5 " type="button">altri <span>17</span></button>
I'd like to find "17" that obviously changes everytime. Thanks.
I've tried with this one but it doesn't work
for item in soup.find_all('button', {'class': 'sqdOP yWX7d _8A5w5 '}):
For complex selections, it's best to use selectors. These work very similar to CSS.
p selects an element with the type p.
p.example selects an element with type p and class example.
p span selects any span inside a p.
There are also others, but only these are needed for this example.
These can be nested as you like. For example, p.example span.foo selects any span with class foo inside any p with class example.
Now, an element can have multiple classes, and they are separated by spaces. <p class="foo bar">Hello, World!</p> has both foo and bar as class.
I think I am safe to assume the class sqdOP is unique. You can build the selector pretty easily using the above:
button.sqdOP span
Now, issue select, and BeautifulSoup will return a list of matching elements. If this is the only one, you can safely use [0] to get the first item. So, the final code to select that span:
soup.select('button.sqdOP span')[0]

Using xpath to loop over all <h2> tags within a speciifc div

I am trying to loop over every <h2> tag (get the text of it) that is inside div's with the id="somediv" using this code:
for k,div1 in enumerate(tree.xpath('//div[#id="someid"]')):
print div1.xpath('.//h2['+str(k+1)+']/text()')
but it doesn't work. Why? However this works:
for i in range(5): #let's say there are 5 div's with id="someid" to make things easier
print tree.xpath('//div[#id="someid"]/div/div[1]/div[2]/h2['+str(i)+']/text()'))
Problem here is, that I have to give the absolute path .../div/div[1]/div[2]... which I don't want. My first solution looks nice but is not producing the desired result, instead I can only retrieve all <h2> tags from one div="someid" at a time. Can anyone tell me what I am doing wrong?
.// will continue the search down the tree. A list of h2 text nodes subordinate to your div is just
tree.xpath('//div[#id="someid"]/.//h2/text()'))

Categories