I have a webpage full of elements that consists below example (gave 2, the webpage consists of around >10), I want to search for all of the below elements containing 'suda-data' and click on all of them. Howvever I am unable to define the finding of all the elements properly.
Notes:
a. cannot search by class="S_txt2" (will include elements that is out of the criteria)
b.The numbers at the back of 'suda-data' changes everytime the page is refreshed.
<a suda-data="key=smart_feed&value=time_sort_comm:4611520921076523" href="javascript:void(0);" class="S_txt2" action-type="fl_comment" action-data="ouid=6430256035&location=page_100808_super_index"><span class="pos"><span class="line S_line1" node-type="comment_btn_text"><span><em class="W_ficon ficon_repeat S_ficon"></em><em>14</em></span></span></span></a>
.......
<a suda-data="key=smart_feed&value=time_sort_comm:4612135415451073" href="javascript:void(0);" class="S_txt2" action-type="fl_comment" action-data="ouid=7573331386&location=page_100808_super_index"><span class="pos"><span class="line S_line1" node-type="comment_btn_text"><span><em class="W_ficon ficon_repeat S_ficon"></em><em> 183</em></span></span></span></a>
Any way that i can find all the elements containing this? Thanks for the help.
Try this on your actual html and see if it works:
targets = browser.find_elements_by_xpath('//a[contains(#suda-data,"smart_feed")]')
for target in targets:
print(target.get_attribute('suda-data'))
For a page containing just the two <a> elements in your question, the output should be:
key=smart_feed&value=time_sort_comm:4611520921076523
key=smart_feed&value=time_sort_comm:4612135415451073
There is a jankier way to do this that I can think of off the top of my head.
First, get a list of all the elements with the "a" tag.
Second, filter through to see only the ones containing "suda-data" and save them in a list, making this dynamically change along with the numbers every time the website is refreshed.
Third, do a soup.find("a", {"suda-data": suda-data-instance}) where suda-data-instance is an element of the list in step 2.
Related
After third attempt of solving this problem, I am unable to finalize this on my own. I would appreciate, if someone would like to share their thoughts on the following issue.
Let us assume, we have such kind of HTML structure:
<div class="panel"></div>
<div class="title"></div>
<h3 class="title">HEADER NUMBER ONE<h3>
<div class="area"></div>
<div class="something">IO field</div>
<input class="input"></input>
<div class="panel"></div>
<div class="title"></div>
<h3 class="title">HEADER NUMBER TWO<h3>
<div class="area"></div>
<div class="something">IO field</div>
<input class="input"></input>
My intention is to identify an input element that belongs to the second panel.
Based on reliability check, when I have hardcoded XPATH gathered directly from the browser, sometimes wrong element is being identified (I assume that there are many scripts running when the page is being loaded, which impacts the reliability and stability). Therefore I would like to distinguish between elements based on the h3, which are the one and only difference between objects.
How can I do it?
When identifying elements one by one (so first the title, then its parent, and then moving down to the input), I receive an "element not interactable" exception which is not dependent from the time.
I am thinking of something like:
find //input[#class='input'] where one of ancestors contains /div/h3 which contains(text(), 'HEADER NUMBER TWO')
Obviously, I did not found any working solution for that, despite I spent more than a week with that.
Is it doable at all? If so, could you suggest me something, please? The structure in real is a little bit more complex, but I need just a pattern, hint, or clue.
Greetings!
You can locate the parent panel element based on it's child h3 with the desired title and then to locate the input element inside it.
The XPath to do so can look like the following:
"//div[#class='panel' and(.//h3[contains(.,'HEADER NUMBER ONE')])]//input"
Or even
"//div[#class='panel' and(contains(.,'HEADER NUMBER ONE'))]//input"
The selenium command using that XPath can look like:
driver.find_element(By.XPATH, "//div[#class='panel' and(contains(.,'HEADER NUMBER ONE'))]//input")
More explanations
The XPath
"//div[#class='panel' and(.//h3[contains(.,'HEADER NUMBER ONE')])]//input"
literally means:
Find element with div tag and class attribute value panel and having some child element inside it (this is .// comes for) containing HEADER NUMBER ONE text content.
Inside the above div element find input child element.
In Selenium you can find a set of elements:
inputs = driver.find_elements(By.CSS, 'input[class="input"]')
input_2 = inputs[1]
The input you need is the 2nd element in the list
In the image below the first 'td' is for the X quote and the second is for the 2.
I'm scraping the html website of betexplorer.com and during this scraping I have to take the quote variation of 1 X 2 element (the quote variation come out in a new DOM if you click on the quote), I have to take this in iterable mode (I did a for cycle where iterate on this Dom to take all the inside it) but the problem is that both X and 2 quote has the same structure in each element, same class same structure and if I try to take one element consequently theke the other, the problem is that I have to put this element in two different dictionaries.
How can I do this using selenium, xpath in python??
I have been trying to create an xpath supposed to locate the first three Yes within p elements until the text Demarcation within h1 elements. The existing one which I've used within the below script locates all the text within p elements. However, I can't find any idea to move along. Just consider the one I've created already to be a placeholder.
How can I create an xapth to be able to locate first three Yes within p elements and nothing else?
My attempt so far:
from lxml.html import fromstring
htmldoc="""
<li>
<a>Nope</a>
<a>Nope</a>
<p>Yes</p>
<p>Yes</p>
<p>Yes</p>
<h1>Demarcation</h1>
<p>No</p>
<p>No</p>
<h1>Not this</h2>
<p>No</p>
<p>Not this</p>
</li>
"""
root = fromstring(htmldoc)
for item in root.xpath("//li/p"):
print(item.text)
Try below to select paragraphs that are preceding siblings of header "Demarcation"
//li/p[following-sibling::h1[.="Demarcation"]]
It looks like you are trying to depend on the h1 tag containing Demarcation, so start from it:
//h1[contains(., "Demarcation")]/preceding-sibling::p[contains(., "Yes")][position()<4]
The idea is to get previous p elements and I added the position()<4 so you only get three, you can remove that if you just need all of the p:
//h1[contains(., "Demarcation")]/preceding-sibling::p[contains(., "Yes")]
I'm trying to access text from elements that have different xpaths but very predictable href schemes across multiple pages in a web database. Here are some examples:
<a href="/mathscinet/search/mscdoc.html?code=65J22,(35R30,47A52,65J20,65R30,90C30)">
65J22 (35R30 47A52 65J20 65R30 90C30) </a>
In this example I would want to extract "65J22 (35R30 47A52 65J20 65R30 90C30)"
<a href="/mathscinet/search/mscdoc.html?code=05C80,(05C15)">
05C80 (05C15) </a>
In this example I would want to extract "05C80 (05C15)". My web scraper would not be able to search by xpath directly due to the xpaths of my desired elements changing between pages, so I am looking for a more roundabout approach.
My main idea is to use the fact that every href contains "/mathscinet/search/mscdoc.html?code=". Selenium can't directly search for hrefs, but I was thinking of doing something similar to this C# implementation:
Driver.Instance.FindElement(By.XPath("//a[contains(#href, 'long')]"))
To port this over to python, the only analogous method I could think of would be to use the in operator, but I am not sure how the syntax will work when everything is nested in a find_element_by_xpath. How would I bring all of these ideas together to obtain my desired text?
driver.find_element_by_xpath("//a['/mathscinet/search/mscdoc.html?code=' in #href]").text
If I right understand you want to locate all elements, that have same partial href. You can use this:
elements = driver.find_elements_by_xpath("//a[contains(#href, '/mathscinet/search/mscdoc.html')]")
for element in elements:
print(element.text)
or if you want to locate one element:
driver.find_element_by_xpath("//a[contains(#href, '/mathscinet/search/mscdoc.html')]").text
This will give a list of all elements located.
As per the HTML you have shared #AndreiSuvorkov's answer would possibly cater to your current requirement. Perhaps you can get much more granular and construct an optimized xpath by:
Instead of using contains using starts-with
Include the ?code= part of the #href attribute
Your effective code block will be:
all_elements = driver.find_elements_by_xpath("//a[starts-with(#href,'/mathscinet/search/mscdoc.html?code=')]")
for elem in all_elements:
print(elem.get_attribute("innerHTML"))
In the case that I want the first use of class so I don't have to guess the find_elements_by_xpath(), what are my options for this? The goal is to write less code, assuring any changes to the source I am scraping can be fixed easily. Is it possible to essentially
find_elements_by_css_selector('source[1]')
This code does not work as is though.
I am using selenium with Python and will likely be using phantomJS as the webdriver (Firefox for testing).
In CSS Selectors, square brackets select attributes, so your sample code is trying to select the 'source' type element with an attribute named 1, eg
<source 1="your_element" />
Whereas I gather you're trying to find the first in a list that looks like this:
<source>Blah</source>
<source>Rah</source>
If you just want the first matching element, you can use the singular form:
element = find_element_by_css_selector("source")
The form you were using returns a list, so you're also able to get the n-1th element to find the nth instance on the page (Lists index from 0):
element = find_elements_by_css_selector("source")[0]
Finally, if you want your CSS selectors to be completely explicit in which element they're finding, you can use the nth-of-type selector:
element = find_element_by_css_selector("source:nth-of-type(1)")
You might find some other helpful information at this blog post from Sauce Labs to help you write flexible selectors to replace your XPath.