I want to download CSV files from a website. this is why i use the the click() command from selenium.
Elements have the following structure
code
csvList = browser.find_elements_by_class_name("csv")
for l in csvlist:
if 'error' not in l.text and 'after' not in l.text:
#get link here
l.click()
Question
My question is how can we get the download link from the element before I download it? the link that pointed to by the black arrow in the picture.
When I use l.get_attribute('href') it gives me None.
For each element l in csvList, get the parent element by xpath and then get that element's href:
csvList = browser.find_elements_by_class_name("csv")
for l in csvlist:
if 'error' not in l.text and 'after' not in l.text:
currentLink = l.find_element_by_xpath("..")
href = currentLink.get_attribute("href")
Note: If you do a .click() in this loop and the link takes you to a new page, you will get a StaleElementException for each click after the first. In that case, extract each href and save to a collection. Then navigate to each href (URL) in the collection.
The div does not have the href attribute. Its parent the "a" tag does. I would use xpath.
By.XPath("//a[/div[#class='csv']]")
Related
I am trying to extract all the href links in an anchor tag using selenium for my web scraping project in python.
I have multiple pages on a single page and I am trying to access the href elements for a single page.
Below is the code:
url = "https://www.carwale.com/used/cars-for-sale/#sc=-1&so=-1&pn=1"
driver.get(url)
links=driver.find_elements_by_xpath('//*[#href]')
for l in links:
print(l.get_attribute('href'))
On running my code the same href element gets printed multiple times.
Snippet of Output of the code:
https://www.carwale.com/used/cars-in-chennai/ford-figo-2010-2012-d2115418/?slot=4&rk=1&isP=true
https://www.carwale.com/used/cars-in-chennai/ford-figo-2010-2012-d2115418/?slot=4&rk=1&isP=true
https://www.carwale.com/used/cars-in-chennai/ford-figo-2010-2012-d2115418/?slot=4&rk=1&isP=true
How do I get it to print only once?
Do something like:
url = "https://www.carwale.com/used/cars-for-sale/#sc=-1&so=-1&pn=1"
driver.get(url)
processed = []
links = driver.find_elements_by_xpath('//*[#href]')
for link in links:
if link not in processed:
print(link.get_attribute('href'))
processed.append(link)
else:
continue
I have the xpath of an element on a website but I'm trying to get the aria-label value of that element.
# NO SUCCESS: print(WebDriverWait(browser, 20).until(EC.visibility_of_element_located((By.XPATH, "element_xpath_you_found"))).get_attribute("aria-label"))
# NO SUCCESS: first_rev = browser.find_element(By.xpath, "/html/body/span/g-lightbox/div[2]/div[3]/span/div/div/div/div[2]/div[1]/div/div[2]/div[1]/div[1]/div[3]/div[1]/g-review-stars/span")
first_rev = browser.find_element_by_xpath("/html/body/span/g-lightbox/div[2]/div[3]/span/div/div/div/div[2]/div[1]/div/div[2]/div[1]/div[1]/div[3]/div[1]/g-review-stars/span").click()
aria_label = first_rev.find_element_by_css_selector('span').get_attribute("aria-label")
print(aria_label)
On the browser, I inspect the element and get this html:
<span class="fTKmHE99XE4__star fTKmHE99XE4__star-s" aria-label="Rated 3.0 out of 5," style=""><span style="width:42px"></span></span>
However, can the problem be that this element is inside a pop-up on the page? Page source doesn't show any html for any element in the pop-up.
click() doesn't have return value, which means it returns None, which make first_rev None, split it to two actions. You also don't need the first_rev.find_element, you are actually getting its child element
first_rev = browser.find_element_by_xpath("/html/body/span/g-lightbox/div[2]/div[3]/span/div/div/div/div[2]/div[1]/div/div[2]/div[1]/div[1]/div[3]/div[1]/g-review-stars/span")
first_rev.click()
aria_label = first_rev.get_attribute("aria-label")
print(aria_label)
I have been searching across the site in the hope of finding an answer, however, every question I view doesn't have heavily nested HTML code like the page I am trying to scrape.
I am really hoping someone will spot my obvious error. I have the following code which is pulling the category headers and but annoyingly not the href that goes with each one. When run, the code currently returns 'None' for all the href's but I cannot decipher why. I think it may be because I am targeting the wrong element, tag or class in the HTML but cannot correctly identify which one it should be.
from selenium import webdriver
import time
# The website to scrape
url = "https://www.jtinsight.com/JTIRA/JTIRA.aspx#!/full-category-list"
# Creating the WebDriver object using the ChromeDriver
driver = webdriver.Chrome()
# Directing the driver to the defined url
driver.get(url)
# driver.implicitly_wait(5)
time.sleep(1)
# Locate the categories
categories = driver.find_elements_by_xpath('//div[#class="subCatEntry ng-scope"]')
# Print out all categories on current page
num_page_items = len(categories)
print(num_page_items)
for headers in range(num_page_items):
print(categories[headers].text)
for elem in categories:
print(elem.get_attribute("a.divLink[href='*']"))
# Clean up (close browser once task is completed)
time.sleep(1)
driver.close()
I would really appreciate if anyone can point out my error.
Try this below code.
for elem in categories:
print(elem.find_element_by_css_selector("a.divLink").get_attribute('href'))
You are passing the CSS selector for the get_attribute method. That wouldn't work. You have to provide the attribute name only. If the web element elem has an attribute named href then it would print the value of that attribute.
First, get the anchor <a> element. All the subcategory anchors have class divLink. For getting anchor elements try this,
categories = driver.find_elements_by_class_name('divLink')
Second, Print the attribute value by passing the attribute name in the get_ttribute. Try this,
print(elem.get_attribute("href"))
This way you'll be able to print all the href values.
New to python and selenium webdriver. I am trying to check all the links on my own webpage and use it's http status code to see if it is a broken link or not. The code that I am running (reduced from original)...
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
links = driver.find_elements_by_xpath("//a[#href]")
while len(links):
url = links.pop()
url = url.get_attribute("href")
print(url)
The html looks like...
<ul>
<li>visit google</li>
<li>broken link ex</li>
</ul>
When I run my script, the only link that gets printed is the google link and not the broken link. I have done some test cases and it seems that only the links that include the phrase "http://www" in the link get printed. Although I can change the href links on my webpage to include this phrase, I have specific reasons as to why they cannot be included.
If I can just get all the links (with or without the "http://www" phrase) using driver.find_elements_by_xpath("//a[#href]"), then I can convert these later in the script to include the phrase and then get the http status codes.
I saw other posts but none that helped me get over this obstacle. Any clarification/workaround/hint would be appreciated.
the following list comprehension should get you a list of all links. It locates all anchor tags and generates a list containing the 'href' attribute of each element.
links = [elem.get_attribute("href") for elem in driver.find_elements_by_tag_name('a')]
here is same thing broken down into small steps and used as a function:
def get_all_links(driver):
links = []
elements = driver.find_elements_by_tag_name('a')
for elem in elements:
href = elem.get_attribute("href")
links.append(href)
return links
I am trying to get src(URL) link of main image from xkcd.com website. I am using the following code but it returns something like session="2f69dd2e-b377-4d1f-9779-16dad1965b81", element="{ca4e825a-88d4-48d3-a564-783f9f976c6b}"
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get('http://xkcd.com')
assert 'xkcd' in browser.title
idlink= browser.find_element_by_id("comic")
#link = idlink.get_attribute("src") ## print link prints null
print idlink
using xpath method also returns same as above.
browser.find_element_by_id returns web element, and that is what you print.
In addition, the text you want is in child element of idlink. Try
idlink = browser.find_element_by_css_selector("#comic > img")
print idlink.get_attribute("src")
idlink is now web element with img tag who has parent with comic ID.
The URL is in src so we want that attribute.
Building off the answer here
You need to:
Select the img tag (you're currently selecting the div)
Get the contents of the source attribute of the img tag
img_tag = browser.find_element_by_xpath("//div[#id='comic']/img")
print img_tag.get_attribute("src")
The above should print the URL of the image
More techniques for locating elements using selenium's python bindings are available here
For more on using XPath with Selenium, see this tutorial