How can I get a specific link using BeautifulSoup - python

I'm trying to get the specific link from this
<a href="/doi/10.1021/ed500712k" title="Next" class="header_contnav-next">
<i class="icon-angle-right"></i>
</a>
I'm only able to find all the links within the page and it would be helpful to extract this specific one.
Thank you !

I presume you're using .find_all() function to get links, to find a specific item, you should use .find() function instead of that. If you're sure that only this one has the "class" variable set to "header_contnav-next", then all you need to do is to specify it in dictionary format:
soup.find("a", {"class": "header_contnav-next"})['href']

From what I can see, you just need to find the <a> element and, save it to some_variable and then use some_variable.href.
Can't give you more information from what you've provided.

Related

Elements able to be found using XPATH but not using CSS Selector. Am I searching using the correct value?

I am trying to extract data from multiple pages of search results where the HTML in question looks like so:
<ul>
<li class="Card___StyledLi4-ulg8ho-7 jmevwM">...</li>
<li class="Card___StyledLi4-ulg8ho-7 jmevwM">...</li>
<li class="Card___StyledLi4-ulg8ho-7 jmevwM">...</li>
</ul>
I want to extract the text from the "li" tags, so I have:
text_data = WebDriverWait(driver,10).until(EC.visibility_of_all_element_located((By.XPATH,'Card___StyledLi4-ulg8ho-7.jmevwM')
print(text_data.text)
to wait and target "li" item. However, I get a "TimeoutException" error.
However, if I try to locate a single "li" item using the XPATH under the same conditions, the data is returned which leads me to question if I am inputting the class correctly?
Can anyone tell me what I'm doing wrong? Please let me know if there is any further information, you'd like me to provide.
I believe the XPath for these list items would be //li[#class="Card___StyledLi4-ulg8ho-7 jmevwM"] (or //*[#class="Card___StyledLi4-ulg8ho-7 jmevwM"] if you want all elements with that class rather than just li tags). You can take a look at this cheatsheet and this tutorial for further rules and examples of XPath.
You can also just use CSS Selectors like (By.CSS_SELECTOR, '.Card___StyledLi4-ulg8ho-7.jmevwM') in this case.
You have mentioned the wrong locator type, it should be CSS_SELECTOR and also put a dot '.' in front of element's property, because it is a 'class':
text_data = WebDriverWait(driver,10).until(EC.visibility_of_all_element_located((By.CSS_SELECTOR,'.Card___StyledLi4-ulg8ho-7.jmevwM')

Using Scrapy Python not able to extract data from response html with xpath due to namespace

I am using scrapy with xpath to extract data from a webpage. My html response looks like this,
I want to extract the href link present in the highlighted "a" tag.
Usually I use response.xpath('//a[#id="jr-alt-sw"]/#href') to get the data, but here I think due to the namespace problem the result is empty. How can I get the data if namespace is present.
Any help is appreciated!!
Is that true about namespace? Another reason to use css instead:
response.css('a#jr-alt-sw::attr(href)')
There is no href attribute available for the selected a tag here, Check out for the next a tag that contains the href attribute.
response.xpath('//a[#id="jr-pdf-sw"]/#href')

Finding an element by partial href (Python Selenium)

I'm trying to access text from elements that have different xpaths but very predictable href schemes across multiple pages in a web database. Here are some examples:
<a href="/mathscinet/search/mscdoc.html?code=65J22,(35R30,47A52,65J20,65R30,90C30)">
65J22 (35R30 47A52 65J20 65R30 90C30) </a>
In this example I would want to extract "65J22 (35R30 47A52 65J20 65R30 90C30)"
<a href="/mathscinet/search/mscdoc.html?code=05C80,(05C15)">
05C80 (05C15) </a>
In this example I would want to extract "05C80 (05C15)". My web scraper would not be able to search by xpath directly due to the xpaths of my desired elements changing between pages, so I am looking for a more roundabout approach.
My main idea is to use the fact that every href contains "/mathscinet/search/mscdoc.html?code=". Selenium can't directly search for hrefs, but I was thinking of doing something similar to this C# implementation:
Driver.Instance.FindElement(By.XPath("//a[contains(#href, 'long')]"))
To port this over to python, the only analogous method I could think of would be to use the in operator, but I am not sure how the syntax will work when everything is nested in a find_element_by_xpath. How would I bring all of these ideas together to obtain my desired text?
driver.find_element_by_xpath("//a['/mathscinet/search/mscdoc.html?code=' in #href]").text
If I right understand you want to locate all elements, that have same partial href. You can use this:
elements = driver.find_elements_by_xpath("//a[contains(#href, '/mathscinet/search/mscdoc.html')]")
for element in elements:
print(element.text)
or if you want to locate one element:
driver.find_element_by_xpath("//a[contains(#href, '/mathscinet/search/mscdoc.html')]").text
This will give a list of all elements located.
As per the HTML you have shared #AndreiSuvorkov's answer would possibly cater to your current requirement. Perhaps you can get much more granular and construct an optimized xpath by:
Instead of using contains using starts-with
Include the ?code= part of the #href attribute
Your effective code block will be:
all_elements = driver.find_elements_by_xpath("//a[starts-with(#href,'/mathscinet/search/mscdoc.html?code=')]")
for elem in all_elements:
print(elem.get_attribute("innerHTML"))

Python/BeautifulSoup - Getting specific attribute in the same tag/element

I am new to Python and BeautifulSoup. So please forgive me if I'm using the wrong terminology.
I am trying to get a specific 'text' from a div tag/element that has multiple attributes in the same .
<div class="property-item" data-id="183" data-name="Brittany Apartments" data-street_number="240" data-street_name="Brittany Drive" data-city="Ottawa" data-province="Ontario" data-postal="K1K 0R7" data-country="Canada" data-phone="613-688-2222" data-path="/apartments-for-rent/brittany-apartments-240-brittany-drive-ottawa/" data-type="High-rise-apartment" data-latitude="45.4461070" data-longitude="-75.6465360" >
Below is my code to loop through and find 'property-item'
for btnMoreDetails in citySoup.findAll(attrs= {"class":"property-item"}):
My question is, if I specifically want the 'data-name' and 'data-path' for example, how do I go about getting it?
I've searched google and even this website. Some were saying using the .contents[2]. But I still wasn't able to get any of it.
Once you have extracted the element (which findAll does one at a time) you can access attributes as though they were dictionary keys. So for example the following code:
data = """<div class="property-item" data-id="183" data-name="Brittany Apartments" data-street_number="240" data-street_name="Brittany Drive" data-city="Ottawa" data-province="Ontario" data-postal="K1K 0R7" data-country="Canada" data-phone="613-688-2222" data-path="/apartments-for-rent/brittany-apartments-240-brittany-drive-ottawa/" data-type="High-rise-apartment" data-latitude="45.4461070" data-longitude="-75.6465360" >"""
import bs4
soup = bs4.BeautifulSoup(data)
for btnMoreDetails in soup.findAll(attrs= {"class":"property-item"}):
print btnMoreDetails["data-name"]
prints out
Brittany Apartments
If you want to get the data-name and data-path attributes, you can simply use the dictionary-like access to Tag's attributes:
for btnMoreDetails in citySoup.findAll(attrs={"class":"property-item"}):
print(btnMoreDetails["data-name"])
print(btnMoreDetails["data-path"])
Note that you can also use the CSS selector to match the property items:
for property_item in citySoup.select(".property-item"):
print(property_item["data-name"])
print(property_item["data-path"])
FYI, if you want to see all the attributes use .attrs property:
for property_item in citySoup.select(".property-item"):
print(property_item.attrs)

clicking on a link with the same href value using selenium python

I have a html code that has two links but both the links have the same href value, but the onclick and the text are different.
I wasn't sure as to how to access the second link.
I tried using driver.find_element_by_link_text('text'), but I get a no such element found error.
<div id="member">
<"a href="#" onclick="add_member("abc"); return false;">run abc<"/a>
<br>
<"a href="#" onclick="add_member("def"); return false;">run def<"/a>
</div>
There are multiple options to get the desired link.
One option would be to get use find_element_by_xpath() and check onclick attribute value:
link = driver.find_element_by_xpath('//div[#id="member"]/a[contains(#onclick, "add_member(\"def\")")]')
link.click()
Another one would be to simply find both links and get the desired one by index:
div = driver.find_element_by_id('member')
links = div.find_elements_by_tag_name('a')
links[1].click()
Which option to choose depends on the whole HTML content. Hope at least one of two suggested solutions solves the issue.

Categories