Finding an element by partial href (Python Selenium) - python

I'm trying to access text from elements that have different xpaths but very predictable href schemes across multiple pages in a web database. Here are some examples:
<a href="/mathscinet/search/mscdoc.html?code=65J22,(35R30,47A52,65J20,65R30,90C30)">
65J22 (35R30 47A52 65J20 65R30 90C30) </a>
In this example I would want to extract "65J22 (35R30 47A52 65J20 65R30 90C30)"
<a href="/mathscinet/search/mscdoc.html?code=05C80,(05C15)">
05C80 (05C15) </a>
In this example I would want to extract "05C80 (05C15)". My web scraper would not be able to search by xpath directly due to the xpaths of my desired elements changing between pages, so I am looking for a more roundabout approach.
My main idea is to use the fact that every href contains "/mathscinet/search/mscdoc.html?code=". Selenium can't directly search for hrefs, but I was thinking of doing something similar to this C# implementation:
Driver.Instance.FindElement(By.XPath("//a[contains(#href, 'long')]"))
To port this over to python, the only analogous method I could think of would be to use the in operator, but I am not sure how the syntax will work when everything is nested in a find_element_by_xpath. How would I bring all of these ideas together to obtain my desired text?
driver.find_element_by_xpath("//a['/mathscinet/search/mscdoc.html?code=' in #href]").text

If I right understand you want to locate all elements, that have same partial href. You can use this:
elements = driver.find_elements_by_xpath("//a[contains(#href, '/mathscinet/search/mscdoc.html')]")
for element in elements:
print(element.text)
or if you want to locate one element:
driver.find_element_by_xpath("//a[contains(#href, '/mathscinet/search/mscdoc.html')]").text
This will give a list of all elements located.

As per the HTML you have shared #AndreiSuvorkov's answer would possibly cater to your current requirement. Perhaps you can get much more granular and construct an optimized xpath by:
Instead of using contains using starts-with
Include the ?code= part of the #href attribute
Your effective code block will be:
all_elements = driver.find_elements_by_xpath("//a[starts-with(#href,'/mathscinet/search/mscdoc.html?code=')]")
for elem in all_elements:
print(elem.get_attribute("innerHTML"))

Related

Elements able to be found using XPATH but not using CSS Selector. Am I searching using the correct value?

I am trying to extract data from multiple pages of search results where the HTML in question looks like so:
<ul>
<li class="Card___StyledLi4-ulg8ho-7 jmevwM">...</li>
<li class="Card___StyledLi4-ulg8ho-7 jmevwM">...</li>
<li class="Card___StyledLi4-ulg8ho-7 jmevwM">...</li>
</ul>
I want to extract the text from the "li" tags, so I have:
text_data = WebDriverWait(driver,10).until(EC.visibility_of_all_element_located((By.XPATH,'Card___StyledLi4-ulg8ho-7.jmevwM')
print(text_data.text)
to wait and target "li" item. However, I get a "TimeoutException" error.
However, if I try to locate a single "li" item using the XPATH under the same conditions, the data is returned which leads me to question if I am inputting the class correctly?
Can anyone tell me what I'm doing wrong? Please let me know if there is any further information, you'd like me to provide.
I believe the XPath for these list items would be //li[#class="Card___StyledLi4-ulg8ho-7 jmevwM"] (or //*[#class="Card___StyledLi4-ulg8ho-7 jmevwM"] if you want all elements with that class rather than just li tags). You can take a look at this cheatsheet and this tutorial for further rules and examples of XPath.
You can also just use CSS Selectors like (By.CSS_SELECTOR, '.Card___StyledLi4-ulg8ho-7.jmevwM') in this case.
You have mentioned the wrong locator type, it should be CSS_SELECTOR and also put a dot '.' in front of element's property, because it is a 'class':
text_data = WebDriverWait(driver,10).until(EC.visibility_of_all_element_located((By.CSS_SELECTOR,'.Card___StyledLi4-ulg8ho-7.jmevwM')

Following sibling within an xpath is not working as intended

I've been trying to scoop out a portion of text out of some html elements using xapth but It seems I'm going somewhere wrong that is why I can't make it.
Html elements:
htmlelem = """
<div class="content">
<p>Type of cuisine: </p>International
</div>
"""
I would like to dig out International using xpath. I know I could get success using .next_sibling If I wanted to extract the same using css selector but I'm not interested in going that route.
That said If I try like this I can get the same using xpath:
tree.xpath("//*[#class='content']/p/following::text()")[0]
But the above expression is not what I'm after cause I can't use the same within selenium webdriver If I stick to driver.find_element_by_xpath()
The only way that I'm interested in is like the following but it is not working:
"//*[#class='content']/p/following::*"
Real-life example:
from lxml.html import fromstring
htmlelem = """
<div class="content">
<p>Type of cuisine: </p>International
</div>
"""
tree = fromstring(htmlelem)
item = tree.xpath("//*[#class='content']/p/following::text()")[0].strip()
elem = tree.xpath("//*[#class='content']/p/following::*")[0].text
print(elem)
In the above example, I can get success printing item but can't printing elem. However, I would like to modify the expression used within elem.
How can I make it work so that the same xpath I can use within lxml library or within selenium?
Since OP was looking for a solution which extracts the text from outside the xpath, the following should do that, albeit in a somewhat awkward manner:
tree.xpath("//*[#class='content']")[0][0].tail
Output:
International
The need for this approach is a result of the way lxml parses the html code:
tree.xpath("//*[#class='content']") results in a list of length=1.
The first (and only) element in the list - tree.xpath("//*[#class='content']")[0] is a lxml.html.HtmlElement which itself can be treated as a list and also has length=1.
In the tail of the first (and only) element in that lxml.html.HtmlElement hides the desired output...

what to do for dynamically changing xpaths in python using selenium?

I have a xpath as:
//*[#id="jobs-search-box-keyword-id-ember968"]
The number 968 constantly keeps on changing after every reload.
Rest of the string remains constant.
How to I find the constantly changing xpath?
You can use partial id with contains()
//*[contains(#id, "jobs-search-box-keyword-id-ember")]
You can try using starts-with below,
//*[starts-with(#id,'jobs-search-box-keyword-id-ember')]
The details provided is insufficient to to provide the accurate result. Still you can follow the below code references
In //*[#id="jobs-search-box-keyword-id-ember968"] the last number 968 keeps changing. but if you make this like //*[starts-with(#id,'jobs-search-box-keyword-id-ember')] then there might be possibility that you can have more then one element with the same partial is i.e. jobs-search-box-keyword-id-ember in this case it will locate on 1st matching element. that may not be your expected one
Use the tag name lets say element is an input tag whose id is jobs-search-box-keyword-id-ember968
Xpath - //input[starts-with(#id,'jobs-search-box-keyword-id-ember')]
CSS - input[id^='jobs-search-box-keyword-id-ember']
Use the relevant parent element to make this more specific. e.g the element is in parent tag <div class="container">
Xpath- //div[#class='container']//input[starts-with(#id,'jobs-search-box-keyword-id-ember')]
CSS - div.container input[id^='jobs-search-box-keyword-id-ember']
This worked for me:
Locator:
JOBS_SEARCH_BOX_XPATH = "//*[contains(#id,'jobs-search-box-keyword-id-ember')]"
Code:
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, JOBS_SEARCH_BOX_XPATH)))
.send_keys("SDET")

Using Selenium to select an anchor with specific content

I have a HTML element as follows:
<a class="country" href="/es-co">
Columbia
</a>
How do I select that anchor element based on the content 'Columbia'? I can't use find_element_by_class_css_selector because a.country represents half a dozen elements. How do I select that element and click it using Silenium with Python (through IE, if that has any bearing)?
As an aside, I could have any number of links with the same text and CSS selectors. How would Silenium differentiate?
There's no find_element_by_class_css_selector. But you are right, you can't use class names.
The best way is to use href="/es-co", if it's unique.
find_element_by_css_selector("a[href='/es-co']")
Otherwise you can find by text using XPath
find_element_by_xpath(".//a[contains(text(), 'Columbia')])
If you have many links with same locator, then you can index them, either by XPath directly or the list returned by Selenium.
For example, if you have ten Columbia
find_element_by_xpath(".//a[contains(text(), 'Columbia')][10]") # one-based index, one element only
find_elements_by_xpath(".//a[contains(text(), 'Columbia')]")[9] # find_elements_* gives you zero-base index list
In case of <a> with clickable text, Selenium provides API like find_with_link_text or find_with_partial_link_text (API name many be different but you got the idea).
If there are many <a> with same text/css-class, best bet to locate them is using XPath that is accepted by selenium APIs.

In Selenium, how do I include a specific node [1] using find_elements_by_css_selector()

In the case that I want the first use of class so I don't have to guess the find_elements_by_xpath(), what are my options for this? The goal is to write less code, assuring any changes to the source I am scraping can be fixed easily. Is it possible to essentially
find_elements_by_css_selector('source[1]')
This code does not work as is though.
I am using selenium with Python and will likely be using phantomJS as the webdriver (Firefox for testing).
In CSS Selectors, square brackets select attributes, so your sample code is trying to select the 'source' type element with an attribute named 1, eg
<source 1="your_element" />
Whereas I gather you're trying to find the first in a list that looks like this:
<source>Blah</source>
<source>Rah</source>
If you just want the first matching element, you can use the singular form:
element = find_element_by_css_selector("source")
The form you were using returns a list, so you're also able to get the n-1th element to find the nth instance on the page (Lists index from 0):
element = find_elements_by_css_selector("source")[0]
Finally, if you want your CSS selectors to be completely explicit in which element they're finding, you can use the nth-of-type selector:
element = find_element_by_css_selector("source:nth-of-type(1)")
You might find some other helpful information at this blog post from Sauce Labs to help you write flexible selectors to replace your XPath.

Categories