How to print an href/URL using XPath? - python

My code navigates to a website, and in the website there is an article which contains its own link/url/href.
I want to print this field.
My current code highlights the container that it is in, and then I try to do a for loop to get the href.
from selenium import webdriver
driver = webdriver.Chrome()
import time
url = 'https://library.ehaweb.org/eha/#!*menu=6*browseby=8*sortby=2*media=3*ce_id=2035*label=21986*ot_id=25553*marker=1283*featured=17286'
driver.get(url)
time.sleep(3)
page_source = driver.page_source
container=driver.find_element_by_xpath("//div[#class='list-box col-md-6 col-lg-6 col-xl-4 test']")
for j in container:
link= j.find_element_by_css_selector('a').get_attribute('href')
print(link)

If I correctly understand what you want, you just need to print element's child (a) attribute:
link = driver.find_element_by_xpath("//div[#class='list-box col-md-6 col-lg-6 col-xl-4 test']/a").get_attribute("href")
print(link)
This prints:
https://library.ehaweb.org/eha/2021/eha2021-virtual-congress/324511/hanny.al-samkari.pazopanib.for.severe.bleeding.and.transfusion-dependent.html?f=menu%3D6%2Abrowseby%3D8%2Asortby%3D2%2Amedia%3D3%2Ace_id%3D2035%2Alabel%3D21986%2Aot_id%3D25553%2Amarker%3D1283%2Afeatured%3D17286
If you want to use loop, then change container=driver.find_element_by_xpath("//div[#class='list-box col-md-6 col-lg-6 col-xl-4 test']") to
container=driver.find_elements_by_xpath("//div[#class='list-box col-md-6 col-lg-6 col-xl-4 test']")
For exactly this element the following locator would be enough:
//div[contains(#class, 'test')]/a
With the following code:
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
url = 'https://library.ehaweb.org/eha/#!*menu=6*browseby=8*sortby=2*media=3*ce_id=2035*label=21986*ot_id=25553*marker=1283*featured=17286'
driver.get(url)
driver.implicitly_wait(10)
container = driver.find_elements_by_xpath("//div[contains(#class, 'test')]")
for j in container:
link = j.find_element_by_css_selector('a').get_attribute('href')
print(link)
driver.close()

That page contain lots of inner URL. To click on EHA 2021 virtual container, you can use the below code.
eha_2021 = driver.find_element_by_css_selector('div#listing-main a')
eha_2021_link = eha_2021_link.get_attribute('href')
print(eha_2021_link)
Just in case if you want to click on COVID-19 Outbreak you may try the below code.
Code :
covid_19_element = driver.find_element(By.ID, 'menu-8')
covid_19_url = covid_19_element.get_attribute('href')
print(covid_19_url)
Suggestion :
Try to avoid this kind of xpath //div[#class='list-box col-md-6 col-lg-6 col-xl-4 test'] this looks bit dynamic and may change region wise. Always use locater in below order :
ID
Name
TagName
Class Name
Link Text
Partial Link Text
CSS selector
XPATH
Learn about them here

This works for me in getting the href
elems = driver.find_elements_by_xpath("//a[#href]")
for elem in elems:
print(elem.get_attribute("href"))
Loop through the list, take each element and fetch the required attribute value you want from it (in this case href).

Related

Get text from div using Selenium and Python

Situation
I'm using Selenium and Python to extract info from a page
Here is the div I want to extract from:
I want to extract the "Registre-se" and the "Login" text.
My code
from selenium import webdriver
url = 'https://www.bet365.com/#/AVR/B146/R^1'
driver = webdriver.Chrome()
driver.get(url.format(q=''))
elements = driver.find_elements_by_class_name('hm-MainHeaderRHSLoggedOutNarrow_Join ')
for e in elements:
print(e.text)
elements = driver.find_elements_by_class_name('hm-MainHeaderRHSLoggedOutNarrow_Login ')
for e in elements:
print(e.text)
Problem
My code don't send any output.
HTML
<div class="hm-MainHeaderRHSLoggedOutNarrow_Join ">Registre-se</div>
<div class="hm-MainHeaderRHSLoggedOutNarrow_Login " style="">Login</div>
By looking this HTML
<div class="hm-MainHeaderRHSLoggedOutNarrow_Join ">Registre-se</div>
<div class="hm-MainHeaderRHSLoggedOutNarrow_Login " style="">Login</div>
and your code, which looks okay to me, except that part you are using find_elements for a single web element.
and by reading this comment
The class name "hm-MainHeaderRHSLoggedOutMed_Login " only appear in
the inspect of the website, but not in the page source. What it's
supposed to do now?
It is clear that the element is in either iframe or shadow root.
Cause page_source does not look for iframe.
Please check if it is in iframe, then you'd have to switch to iframe first and then you can use the code that you have.
switch it like this :
driver.switch_to.frame(driver.find_element_by_xpath('xpath here'))

Selenium - Not being able to click links in search result

I am using Selenium plus python to search a keyword and then in the search result i am trying to clicking top 5 urls and getting data from p tag and then going back. So basically then i am storing the data from these 5 sites. But somehow after searching the keyword i am not being to click the urls and getting the data. i don't know whats wrong. This is the code i have written. Please Help.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time
driver = webdriver.Chrome(executable_path="E:\chromedriver\chromedriver.exe")
driver.get("https://www.google.com/")
print(driver.title)
driver.maximize_window()
time.sleep(2)
driver.find_element(By.XPATH, "//input[#name='q']").send_keys('selenium')
driver.find_element(By.XPATH, "//div[#class='FPdoLc tfB0Bf']//input[#name='btnK']").send_keys(Keys.ENTER)
a = driver.find_elements_by_xpath("//div[#class='g']/a[#href]")
links = []
for x in a:
links.append(x.get_attribute('href'))
link_data = []
for new_url in links:
print('new url : ', new_url)
driver.get(new_url)
link_data.append(driver.page_source)
b = driver.find_elements(By.TAG_NAME, "p")
for data in b:
print(data.text)
driver.back()
driver.close()
EDIT :
While navigating through links it is also including links from "People also ask " . i dont want to navigate through this box. How can i do it?
If you want the 16 or so links use.
driver.get("https://www.google.com/")
print(driver.title)
driver.maximize_window()
time.sleep(2)
driver.find_element(By.XPATH, "//input[#name='q']").send_keys('selenium')
driver.find_element(By.XPATH, "//input[#name='btnK']").send_keys(Keys.ENTER)
a = driver.find_elements_by_xpath("//div[#class='g']/div/div/a")
links = []
for x in a:
links.append(x.get_attribute('href'))
link_data = []
for new_url in links:
print('new url : ', new_url)
driver.get(new_url)
link_data.append(driver.page_source)
b = driver.find_elements(By.TAG_NAME, "p")
for data in b:
print(data.text)
driver.back()
You have the wrong xpath for the links, should be:
"//div[#class='yuRUbf']/a[#href]"
If you look at the relevant part of the code, you'll see the <a> tag is not a child of <div class="g">, but of <div class="yuRUbf">
<div class="g"><!--m-->
<div class="tF2Cxc" data-hveid="CAkQAA" data-ved="2ahUKEwjphfjOoazuAhUO1VkKHVSkA_oQFSgAMAp6BAgJEAA">
<div class="yuRUbf"><a href="https://www.healthline.com/nutrition/selenium-benefits"
data-ved="2ahUKEwjphfjOoazuAhUO1VkKHVSkA_oQFjAKegQICRAC"
ping="/url?sa=t&source=web&rct=j&url=https://www.healthline.com/nutrition/selenium-benefits&ved=2ahUKEwjphfjOoazuAhUO1VkKHVSkA_oQFjAKegQICRAC"><br>
<h3 class="LC20lb DKV0Md"><span>7 Science-Based Health Benefits of Selenium - Healthline</span></h3>
<div class="TbwUpd NJjxre"><cite class="iUh30 Zu0yb qLRx3b tjvcx">www.healthline.com<span
class="dyjrff qzEoUe"><span> › nutrition › selenium-benefits</span></span></cite></div>
</a>
...
</div>
</div>
</div>
You can also change your search lines a bit too, but it doesn't change the overall effect:
driver.find_element_by_xpath("//input[#name='q']").send_keys('selenium', Keys.ENTER)

How to find and click an image link using the image's src (Selenium, Python)

I would like to click an image link and I need to be able to find it by its src, however it's still not working for some reason. Is this even possible? This is what I'm trying:
#Find item
item = WebDriverWait(driver, 100000).until(EC.presence_of_element_located((By.XPATH, "//img[#src=link]")))
#item = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//img[#alt='Bzs36xl9 xa']")))
item.click()
link = //assets.supremenewyork.com/170065/vi/BZS36xl9-xA.jpg in the above code. This matches the HTML from below.
The second locator works (finding image using alt), but I will only have the image source when the program actually runs.
HTML for the webpage:
<article>
<div class="inner-article">
<a style="height:81px;" href="/shop/accessories/h68lyxo2h/llhxzvydj">
<img width="81" height="81" src="//assets.supremenewyork.com/170065/vi/BZS36xl9-xA.jpg" alt="Bzs36xl9 xa">
</a>
</div>
</article>
I don't see why finding by alt would work and not src, is this possible? I saw another similar question which is where I got my solution but it didn't work for me. Thanks in advance.
EDIT
To find the link I have to parse through a website in JSON format, here's the code:
#Loads Supreme JSON website into an object
url = urllib2.urlopen('https://www.supremenewyork.com/mobile_stock.json')
obj = json.load(url)
items = obj["products_and_categories"]["Accessories"]
itm_name = "Sock"
index = 0;
for i in items:
if(itm_name in items[index]["name"]):
found_url = i["image_url"]
break
index += 1
str_link = str(found_url)
link = str_link.replace("ca","vi")
Use WebDriverWait and element_to_be_clickable.Try the following xpath.Hope this will work.
link ='//assets.supremenewyork.com/170065/vi/BZS36xl9-xA.jpg'
item = WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, "//div[#class='inner-article']/a/img[#src='{}']".format(link))))
print(item.get_attribute('src'))
item.click()
item = WebDriverWait(driver, 100000).until(EC.presence_of_element_located((By.XPATH, "//img[#src=link]")))
Heres your problem, I can't believe it didn't jump out at me. You're asking the driver to find an element with a src of "link" NOT the variable link that you've defined earlier. Idk how to pass in variables into xpaths but i do know that you can use stringFormat to create the correct xpath string just before calling it.
i also dont speak python, so here's some pseudo java/c# to help you get the picture
String xPathString = String.Format("//img[#src='{0}']", link);
item = WebDriverWait(driver, 100000).until(EC.presence_of_element_located((By.XPATH, xPathString)))

Making xpath more selective? [Web scraping]

I am trying to print off some housing prices and am having trouble using Xpath. Here's my code:
from selenium import webdriver
driver = webdriver.Chrome("my/path/here")
driver.get("https://www.realtor.com/realestateandhomes-search/?pgsz=10")
for house_number in range(1,11):
try:
price = driver.find_element_by_xpath("""//*[#id="
{}"]/div[2]/div[1]""".format(house_number))
print(price.text)
except:
print('couldnt find')
I am on this website, trying to print off the housing prices of the first ten houses.
My output is that for all the houses that say "NEW", that gets taken as the price instead of the actual price. But for the bottom two, which don't have that NEW sticker, the actual price is recorded.
How do I make my Xpath selector so it selects the numbers and not NEW?
You can write it like this without loading the image, which can increase your fetching speed
from selenium import webdriver
# Unloaded image
chrome_opt = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_opt.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(chrome_options=chrome_opt,executable_path="my/path/here")
driver.get("https://www.realtor.com/realestateandhomes-search/Bladen-County_NC/sby-6/pg-1?pgsz=10")
for house_number in range(1,11):
try:
price = driver.find_element_by_xpath('//*[#id="{}"]/div[2]/div[#class="srp-item-price"]'.format(house_number))
print(price.text)
except:
print('couldnt find')
You're on the right track, you've just made an XPath that is too brittle. I would try making it a little more verbose, without relying on indices and wildcards.
Here's your XPath (I used id="1" for example purposes):
//*[#id="1"]/div[2]/div[1]
And here's the HTML (some attributes/elements removed for brevity):
<li id="1">
<div></div>
<div class="srp-item-body">
<div>New</div><!-- this is optional! -->
<div class="srp-item-price">$100,000</div>
</div>
</li>
First, replace the * wildcard with the element that you are expecting to contain the id="1". This simply serves as a way to help "self-document" the XPath a little bit better:
//li[#id="1"]/div[2]/div[1]
Next, you want to target the second <div>, but instead of searching by index, try to use the element's attributes if applicable, such as class:
//li[#id="1"]/div[#class="srp-item-body"]/div[1]
Lastly, you want to target the <div> with the price. Since the "New" text was in it's own <div>, your XPath was targeting the first <div> ("New"), not the <div> with the price. Your XPath did however work, if the "New" text <div> did not exist.
We can use a similar method as the previous step, targeting by attribute. This forces the XPath to always target the <div> with the price:
//li[#id="1"]/div[#class="srp-item-body"]/div[#class="srp-item-price"]
Hope this helps!
And so... having said all of that, if you are just interested in the prices and nothing else, this would probably also work :)
for price in driver.find_elements_by_class_name('srp-item-price'):
print(price.text)
Can you try this code:
from selenium import webdriver
driver = webdriver.Chrome()
driver.maximize_window()
driver.get("https://www.realtor.com/realestateandhomes-search/Bladen-County_NC/sby-6/pg-1?pgsz=10")
prices=driver.find_elements_by_xpath('//*[#class="data-price-display"]')
for price in prices:
print(price.text)
It will print
$39,900
$86,500
$39,500
$40,000
$179,000
$31,000
$104,900
$94,900
$54,900
$19,900
Do let me know if any other details are also required

Print dom when selenium misses an element I see

How to find what selenium see in a dom where it misses an image I see on screen?
Context: I have a Selenium python test
browser.wait_to_find_visible_element(By.ID, 'image')
that sometimes can't find an image that I see on the browser selenium launched for the test:
<div id="container">
<img id='image' src=''/>
</div>
To find out what selenium see instead, I get the enclosing div:
element = browser.find_displayed_elements(By.CSS_SELECTOR, '#container')
print element
which prints:
selenium.webdriver.remote.webelement.WebElement object at 0x9b3876c
and try to get the dom:
dom = browser.driver.execute_script('return arguments[0].parentNode', element)
print dom
which prints
None
What I'm missing?
Have you tried this
element = browser.find_displayed_elements(By.CSS_SELECTOR, '#container')
source_code = element.get_attribute("innerHTML")
# or
source_code = element.get_attribute("outerHTML")

Categories