Efficient download of images from website with Python and selenium - python

Disclaimer: I do not have any background in web-scraping/HTML/javascripts/css and the likes but I know a bit of Python.
My end goal is to download all 4th image view of every 3515 car views in the ShapeNet website WITH the associated tag.
For instance the first of the 3515 couples would be the image that can be found in the collapse menu on the right of this picture: (that can be loaded by clicking on the first item of the first page and then on Images) with the associated tag "sport utility" as can be seen in the first picture (first car top left).
To do that I wrote with the help of #DebanjanB a snippet of code that click on the sport utility on the first picture opens the iframe clicks on images and then download the 4th picture link to my question. The full working code is this one:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
import os
profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference("network.proxy.socks", "yourproxy")
profile.set_preference("network.proxy.socks_port", yourport)
#browser = webdriver.Firefox(firefox_profile=profile)
browser = webdriver.Firefox()
browser.get('https://www.shapenet.org/taxonomy-viewer')
#Page is long to load
wait = WebDriverWait(browser, 30)
element = wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='02958343_anchor']")))
linkElem = browser.find_element_by_xpath("//*[#id='02958343_anchor']")
linkElem.click()
#Page is also long to display iframe
element = wait.until(EC.element_to_be_clickable((By.ID, "model_3dw_bcf0b18a19bce6d91ad107790a9e2d51")))
linkElem = browser.find_element_by_id("model_3dw_bcf0b18a19bce6d91ad107790a9e2d51")
linkElem.click()
#iframe slow to be displayed
wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, 'viewerIframe')))
#iframe = browser.find_elements_by_id('viewerIframe')
#browser.switch_to_frame(iframe[0])
element = wait.until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[3]/div[3]/h4")))
time.sleep(10)
linkElem = browser.find_element_by_xpath("/html/body/div[3]/div[3]/h4")
linkElem.click()
img = browser.find_element_by_xpath("/html/body/div[3]/div[3]//div[#class='searchResult' and #id='image.3dw.bcf0b18a19bce6d91ad107790a9e2d51.3']/img[#class='enlarge']")
src = img.get_attribute('src')
os.system("wget %s --no-check-certificate"%src)
There are several issues with this. First I need to know by hand the xpath model_3dw_bcf0b18a19bce6d91ad107790a9e2d51 for each model I also need to extract the tag they both can be found at:
. So I need to extract it by inspecting every image displayed. Then I need to switch page (there are 22 pages) and maybe even scroll down on each page to be sure I have everything. Secondly I had to use time.sleep twice because the other method based on wait to be clickable does not seem to work as intented.
I have two questions the first one is obvious is it the right way of proceeding ? I feel that even if this could be quite fast without the time.sleep this feels very much like what a human would do and therefore must be terribly inefficient secondly if it is indeed the way to go: How could I write a double for loop on pages and items to be able to extract the tag and model id efficiently ?
EDIT 1: It seems that:
l=browser.find_elements_by_xpath("//div[starts-with(#id,'model_3dw')]")
might be the first step towards completion
EDIT 2: Almost there but the code is filled with time.sleep. Still need to get the tag name and to loop through the pages
EDIT 3: Got the tag name still need to loop through the pages and will post first draft of solution

So let me try to understand correctly what you mean and then see if I can help you solve the problem. I do not know Python, so excuse my synthax errors.
You want to click on each and every of the 183533 cars, and then download the 4th image within the iframe that pops up. Correct?
Now if this is the case, lets look at the first element you need, elements on the page with all the cars on it.
So to get all 160 cars of page 1, you are going to need:
elements = browser.find_elements_by_xpath("//img[#class='resultImg lazy']");
This is going to return 160 image elements for you. Which is exactly the amount of the displayed images (on page 1)
Then you can say:
for el in elements:
{here you place the code you need to download the 4th image,
so like switch to iframe, click on the 4th image etc.}
Now, for the first page, you have made a loop which will download the 4th image for every vehicle on it.
This doens't entirely solve your problem as you have multiple pages. Thankfully, the page navigation, previous and next, are greyed out on first and/or last page.
So you can just say:
browser.find_element_by_xpath("//a[#class='next']").click();
Just make sure you catch if element is not clickable as element will be greyed out on the last page.

Rather than scraping the site, you might consider examining the URLs that the webpage uses to query the data, then use the Python 'requests' package to simply make API requests directly from the server. I'm not a registered user on the site, so I can't provide you with any examples, but the paper that describes the shapenet.org site specifically mentions:
"To provide convenient access to all of the model and an-
notation data contained within ShapeNet, we construct an
index over all the 3D models and their associated annota-
tions using the Apache Solr framework. Each stored an-
notation for a given 3D model is contained within the index
as a separate attribute that can be easily queried and filtered
through a simple web-based UI. In addition, to make the
dataset conveniently accessible to researchers, we provide a
batched download capability."
This suggests that it might be easier to do what you want via API, as long as you can learn what their query language provides. A search in their QA/Forum may be productive too.

I came up with this answer, which kind of works but I don't know how to remove the several calls to time.sleep I will not accept my answer until someone finds something more elegant (also when it arrives at the end of the last page it fails):
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
import os
profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference("network.proxy.socks", "yourproxy")
profile.set_preference("network.proxy.socks_port", yourport)
#browser = webdriver.Firefox(firefox_profile=profile)
browser = webdriver.Firefox()
browser.get('https://www.shapenet.org/taxonomy-viewer')
#Page is long to load
wait = WebDriverWait(browser, 30)
element = wait.until(EC.element_to_be_clickable((By.XPATH, "//*[#id='02958343_anchor']")))
linkElem = browser.find_element_by_xpath("//*[#id='02958343_anchor']")
linkElem.click()
tag_names=[]
page_count=0
while True:
if page_count>0:
browser.find_element_by_xpath("//a[#class='next']").click()
time.sleep(2)
wait.until(EC.presence_of_element_located((By.XPATH, "//div[starts-with(#id,'model_3dw')]")))
list_of_items_on_page=browser.find_elements_by_xpath("//div[starts-with(#id,'model_3dw')]")
list_of_ids=[e.get_attribute("id") for e in list_of_items_on_page]
for i,item in enumerate(list_of_items_on_page):
#Page is also long to display iframe
current_id=list_of_ids[i]
element = wait.until(EC.element_to_be_clickable((By.ID, current_id)))
car_image=browser.find_element_by_id(current_id)
original_tag_name=car_image.find_element_by_xpath("./div[#style='text-align: center']").get_attribute("innerHTML")
count=0
tag_name=original_tag_name
while tag_name in tag_names:
tag_name=original_tag_name+"_"+str(count)
count+=1
tag_names.append(tag_name)
car_image.click()
wait.until(EC.frame_to_be_available_and_switch_to_it((By.ID, 'viewerIframe')))
element = wait.until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[3]/div[3]/h4")))
time.sleep(10)
linkElem = browser.find_element_by_xpath("/html/body/div[3]/div[3]/h4")
linkElem.click()
img = browser.find_element_by_xpath("/html/body/div[3]/div[3]//div[#class='searchResult' and #id='image.3dw.%s.3']/img[#class='enlarge']"%current_id.split("_")[2])
src = img.get_attribute('src')
os.system("wget %s --no-check-certificate -O %s.png"%(src,tag_name))
browser.switch_to.default_content()
browser.find_element_by_css_selector(".btn-danger").click()
time.sleep(1)
page_count+=1
One can also import a NoSuchElementException from selenium and use a while True loop with try except to get rid of the arbitrary time.sleep.

Related

Issue extracting specific data with Selenium

I am back with essentially a somewhat similar problem. I have now learnt that you cannot find elements that are in an iframe if you haven't switched to it, which helped a lot, but I seem to still have issues locating elements even though they are not in an iframe.
I also ask for any advice regarding my script in general, or how one would go about improving it. Yes, I will change the implicitwait to WebDriverWait, but besides that. Is it okay if a script is structured in this way, with task -> task -> task and so forth, or is it simply bad practice?
I don't really see how I would go about throwing in some objective-oriented programming, or what I would gain from it besides if I wanted to customise the script in a major way, besides of course the learning aspect.
In any case, here is the code:
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import accandpass as login
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import datetime
x = datetime.datetime.now()
x = x.strftime("%d")
driver = browser=webdriver.Firefox()
driver.get("https://connect.garmin.com/modern/activities")
driver.implicitly_wait(2)
iframe = driver.find_element(By.ID, "gauth-widget-frame-gauth-widget")
driver.switch_to.frame(iframe)
element = driver.find_element("name", "username")
element.send_keys(login.username)
element = driver.find_element("name", "password")
element.send_keys(login.password)
element.send_keys(Keys.RETURN)
driver.switch_to.default_content()
driver.implicitly_wait(10)
element = driver.find_element("name", "search")
element.send_keys("Reading")
element.send_keys(Keys.RETURN)
element = driver.find_element(By.CLASS_NAME, "unit")
print(element)
So everything actually works fine so far, to my great surprise. The element gives off this: <selenium.webdriver.remote.webelement.WebElement (session="0ef84b2e-e0af-4b0c-b04c-94d5371356c5", element="a70f4ee1-e840-457c-a255-4b2df603efec")> which wasn't really what I was looking for.
I am more looking for some check, to see that the element with name unit has the same date as x, which is today. So basically:
Minutes read
minutes = 0
for i in element:
if element == x:
minutes += (element with time)
For loop to run through all the elements and check them all for the same date, and if the date matches then add the minutes read that day to the integer minutes for a sum of total minutes read today, for example.
Then do the same for the activities I will do, running, hiking and meditating.
Questions:
How do I get the integer from the element, so I can check it with x?
Is the for loop -> if statement -> add time from element a good solution to the case at hand?
Is it bad practice to structure a script this way, and how would you improve it?
Thanks in advance
Part of your question sounds like you want a code review. If you do, you'll want to post your code over on https://codereview.stackexchange.com but review their question requirements carefully before posting.
I think the main issue you are asking about is wanting to compare the date on the page to the current system date and you're getting some session and element GUIDs instead. You are printing the element object and not the contained text. You want
print(element.text)
or add an assert to compare it to the current system date in a specific format, something like...
assert element.text == datetime.today().strftime('%m %d')
Some quick additional feedback since you asked for some...
It sounds like you've already been informed that .implicitly_wait() is a bad practice and should be replaced with WebDriverWait for each instance where you need to wait.
If you aren't going to reuse a variable, don't declare one. In most cases you don't need to use one.
element = driver.find_element("name", "username")
element.send_keys(login.username)
can be written
driver.find_element("name", "username").send_keys(login.username)
If you are going to use variables, don't reuse the same name over and over, e.g. element. Give each variable a meaningful name so that the next person (or maybe yourself in a few weeks/months) will be able to more easily read and understand your code.
element = driver.find_element("name", "search")
element.send_keys("Reading")
element.send_keys(Keys.RETURN)
should instead be
search = driver.find_element("name", "search")
search.send_keys("Reading")
search.send_keys(Keys.RETURN)
If you are going to continue writing scripts, do some reading on the page object model. Done right, it will clean up your code significantly, make maintenance much faster and easier, and make writing new scripts much faster. Basically you create a class for each page of the site and then add methods to the class for actions you need to take on the page.
I don't have a garmin account so I can't log in but from the screenshot you posted you might have some methods like .search_activities(string), .get_search_results(), .filter_activities(string), etc. Once you've created those methods, they can be called repeatedly from the same script or many scripts.

How do I access specific or all text elements using xpath locators?

Currently using Python and Selenium to scrape data, export to a CSV and then manipulate as needed. I am having trouble grasping how to build xpath statements to access specific text elements on a dynamically generated page.
https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu
From the above page I would like to export the category (not part of each product, but a parent element) followed by all the text fields associated to a product card.
The following statement allows me to pull all the titles (sort of) under the "Flower" category, but from that I am unable to access all child text elements within that product, only a weird variation of title. The xpath approach seems to be ideal as it allows me to pull this data without having to scroll the page with key passes/javascript.
products = driver.find_elements_by_xpath("//div[text()='Flower']/following-sibling::div/div")
for product in products:
print ("Flower", product.text)
What would I add to the above statement if I wanted to pull the full set of elements that contains text for all children within the 'consumer-product-card__InViewContainer', within each category...such as flower, pre-rolls and so on. I expiremented with different approaches last night and different paths/nodes/predicates to try and access this information building off the above code but ultimately failed.
Also is there a way for me to test or visualize in some way "where I am" in terms of scope of a given xpath statement?
Thank you in advance!
I have tried some code for you please take a look and let me know if it resolves your problem.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 60)
driver.get('https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu')
All_Heading = wait.until(
EC.visibility_of_all_elements_located((By.XPATH, "//div[contains(#class,\"products-grid__ProductGroupTitle\")]")))
for heading in All_Heading:
driver.execute_script("return arguments[0].scrollIntoView(true);", heading)
print("------------- " + heading.text + " -------------")
ChildElement = heading.find_elements_by_xpath("./../div/div")
for child in ChildElement:
driver.execute_script("return arguments[0].scrollIntoView(true);", child)
print(child.text)
Please find the output of the above code -
Hope this is what you are looking for. If it solve you query then please mark it as answer.

How to access text inside div tags using Selenium in Python?

I am trying to make a program in Python using Selenium which prints out the quotes from https://www.brainyquote.com/quote_of_the_day
EDIT:
I was able to access the quotes and the associated authors like so:
authors = driver.find_elements_by_css_selector("""div.col-xs-4.col-md-4 a[title="view author"]""")
for quote,author in zip(quotes,authors):
print('Quote: ', quote.text)
print('Author: ', author.text)
Not able to club topics similarly. Doing
total_topics = driver.find_elements_by_css_selector("""div.col-xs-4.col-md-4 a.qkw-btn.btn.btn-xs.oncl_list_kc""")
would make an undesired list
Earlier I was using Beautiful Soup which did the job perfectly except the fact that the requests library was able to access only the static website. However, I wanted to be able to scroll the website continuously to keep accessing new quotes. For that purpose, I'm trying to use Selenium.
This is how I did it using Soup:
for quote_data in soup.find_all('div', class_='col-xs-4 col-md-4'):
quote = quote_data.find('a',title='view quote').text
print('Quote: ',quote)
However, I am unable to find the same using Selenium.
My code in Selenium for basic testing:
driver.maximize_window()
driver.get('https://www.brainyquote.com/quote_of_the_day')
elem = driver.find_element_by_tag_name("body")
elem.send_keys(Keys.PAGE_DOWN)
time.sleep(0.2)
quote = driver.find_element_by_xpath('//div[#title="view quote"]')
I also tried CSS Selectors
print(driver.find_element_by_css_selector('div.col-xs-4 col-md-4')
The latter gave a NoSuchElementFound exception and the former is not giving any output at all. I would love to get some tips on where I am going wrong and how I would be able to tackle this.
Thanks!
quotes = driver.find_elements_by_xpath('//a[#title="view quote"]')
First scroll to bottom
You might need to write some kind of loop to scroll and click on the quotes links until there are no more elements found. Here's a bit of an outline of how I would do that:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get('https://www.brainyquote.com/quote_of_the_day')
while True:
# wait for all quote elements to appear
quote_links = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//a[#title='view quote']")))
# todo - need to check for the end condition. page has infinite scrolling
# break
# iterate the quote elements until we reach the end of this list
for quote_link in quote_links:
quote_link.click()
driver.back()
# now quote_links has gone stale because we are on a different page
quote_links = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//a[#title='view quote']")))
The above code enters a loop that searches for all of the 'View more' quote links on the page. Then, we iterate the list of links and click on each one. At this point the elements in quote_links list have gone stale due to the page no longer existing, so we re-find the elements with WebDriverWait before clicking another link.
This is just a rough outline and some extra work will need to be done to determine an end case for the infinite scrolling of the page, and you will need to write in the operations to perform on the quote pages themselves, but hopefully you see the idea here.

How do I click an element using selenium and beautifulsoup?

How do I click an element using selenium and beautifulsoup in python? I got these lines of code and I find it difficult to achieve. I want to click every element in every iteration. There are no pagination or next page. There are only like about 10 elements and after clicking the last element, it should stop. Does anyone know what should I do. Here are my code
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import urllib
import urllib.request
from bs4 import BeautifulSoup
chrome_path = r"C:\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
url = 'https://www.99.co/singapore/condos-apartments/a-treasure-trove'
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html,'lxml')
details = soup.select('.FloorPlans__container__rwH_w') //Whole container of the result
for d in details:
picture = d.find('span',{'class':'Tappable-inactive'}).click() //the single element.
print(d)
driver.close()
Here is the site https://www.99.co/singapore/condos-apartments/a-treasure-trove . I want to scrape the details and the image in every floor plans section but it is difficult because the image only appears after you click the specific element. I can only get the details except for the image itself. Try it yourself so that you know what I mean.
EDIT:
I tried this method
for d in driver.find_elements_by_xpath('//*[#id="floorPlans"]/div/div/div/div/span'):
d.click()
The problem is it clicks too fast that the image couldn't load. And also im using selenium here. Is there any method like selecting a beautifulsoup like this format picture = d.find('span',{'class':'Tappable-inactive'}).click() ?
You cannot interact with website widgets by using beautifulSoup you need to work with selenium. There are 2 ways to handle this problem.
First is to get the main wrapper (class) of the 10 elements and then iterate to each child element of the main class.
You can get the element by xpath and increment the last number in xpath by one in each iteration to move to the next element.
I print some result to check your code.
"details" only has one item.
And "picture" is not element. (So it's not clickable.)
details = soup.select('.FloorPlans__container__rwH_w')
print(details)
print(len(details))
for d in details:
print(d)
picture = d.find('span',{'class':'Tappable-inactive'})
print(picture)
Output:
For your edited version, you can check images have been visible before you do click().
Use visibility_of_element_located to do.
Reference: https://selenium-python.readthedocs.io/waits.html

Can't get "WebDriver" element data if not "eye-visible" in browser using Selenium and Python

I'm doing a scraping with Selenium in Python. My problem is that after I found all the WebElements, I'm unable to get their info (id, text, etc) if the element is not really VISIBLE in the browser opened with Selenium.
What I mean is:
First image
Second image
As you can see from the first and second images, I have the first 4 "tables" that are "visible" for me and for the code. There are however, other 2 tables (5 & 6 Gettho lucky dip & Sue Specs) that are not "visible" until I drag down the right bar.
Here's what I get when I try to get the element info, without "seeing it" in the page:
Third image
Manually dragging the page to the bottom and therefore making it "visible" to the human eye (and also to the code ???) is the only way I can the data from the WebDriver element I need:
Fourth image
What am I missing ? Why Selenium can't do it in background ? Is there a manner to solve this problem without going up and down the page ?
PS: the page could be any kind of dog race page in http://greyhoundbet.racingpost.com/. Just click City - Time - and then FORM.
Here's part of my code:
# I call this function with the URL and it returns the driver object
def open_main_page(url):
chrome_path = r"c:\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get(url)
# Wait for page to load
loading(driver, "//*[#id='showLandingLADB']/h4/p", 0)
element = driver.find_element_by_xpath("//*[#id='showLandingLADB']/h4/p")
element.click()
# Wait for second element to load, after click
loading(driver, "//*[#id='landingLADBStart']", 0)
element = driver.find_element_by_xpath("//*[#id='landingLADBStart']")
element.click()
# Wait for main page to load.
loading(driver, "//*[#id='whRadio']", 0)
return driver
Now I have the browser "driver" which I can use to find the elements I want
url = "http://greyhoundbet.racingpost.com/#card/race_id=1640848&r_date=2018-
09-21&tab=form"
browser = open_main_page(url)
# Find dog names
names = []
text: str
tags = browser.find_elements_by_xpath("//strong")
Now "TAGS" is a list of WebDriver elements as in the figures.
I'm pretty new to this area.
UPDATE:
I've solved the problem with a code workaround.
tags = driver.find_elements_by_tag_name("strong")
for tag in tags:
driver.execute_script("arguments[0].scrollIntoView();", tag)
print(tag.text)
In this manner the browser will move to the element position and it will be able to get its information.
However I still have no idea why with this page in particular I'm not able to read webpages elements that are not visible in the Browser area untill I scroll and literally see them.

Categories