WebScraping through multiple sites with Selenium

WebScraping through multiple sites with Selenium - python

I'm using Selenium in a project that consists of opening a range of websites, that contains pretty much the same structure, collecting data in each site and storing it.
The problem I ran into is that some of the sites I wan't to access are unavailable, and when the program get to one of those it just stops.
What I want it to do, is to skip those and follow on with the next iterations, but so far my tries have been obsolete... In my latest try I used the method is_displayed(), but apparently it will only tell me if an element is visible or not, instead of telling me if it's present or not.
if driver.find_element_by_xpath('//*[#id="main-2"]/div[2]/div[1]/div[1]/div/div[1]/strong').is_displayed():
The example above doesn't work, because the driver needs to find the element before telling me if it visible or not, but the element is simply not there.
Have any of you dealt with something similar?
How one of the sites looks like normally
How it looks like when it is unavailable

You can use Selenium expected conditions waiting for element presence.
I'm just giving an example below.
I have defined the timeout for 5 seconds here, but you can use any timeout value.
Also, your element locator looks bad.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
element_xpath_locator = '//*[#id="main-2"]/div[2]/div[1]/div[1]/div/div[1]/strong'
wait = WebDriverWait(browser, 5)
wait.until(EC.presence_of_element_located((By.XPATH, element_xpath_locator)))

Related

Accepting cookies popups using selenium in python

I'm trying to put together a small scraper for public trademark data. I have a database available that i'm using selenium and python to access.
I can do just about anything I need to be able to, but for some reason i can't actually click the "accept cookies" button on the website. The following code i use highlights the button, but it does not get rid of the popup.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://data.inpi.fr/recherche_avancee/marques')
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "tarteaucitronPersonalize2"))
).click()
I have looked up similar threads on this forum, and I have tried multiple things :
- adding a waiting period, that ended up highlighting the button, so at least i know it does something
- using javascript code to do the actual click, did not work any better
- i tried calling the button via its ID, its XPATH, its CSS selector, anything i could find really
I even downloaded Selenium IDE to record my clicks to see exactly how I could replicate it, but it still only recorded a click.
I tried my best, does anyone know where my mistake lie ? I am open to using other languages, or another platform

Well it looks like I managed to solve it just minutes after posting my question !
For some reason you need to resize the window. I just added the following line of code after opening the URL and it worked first time.
driver.maximize_window()
I added this answer in case anyone stumbles upon this post and wants to avoid pulling their hair out over this !

Python, Selenium and Chrome - How do can I detect the end of a page with dynamically generated content?

I have gone through existing questions and google results of a similar nature, every solution posed has not worked for me within the particular website I am currently scraping.
https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu
I am sending page down keys to the body element, which loads each item to be scraped. I have two issues with this, first I am unable to detect when the scrolling has stopped. Second, I have to manually click the browser window as it opens to allow the keys to be sent. I am not sure how to mimic this same focus giving behavior via code.
elem = driver.find_element_by_tag_name("body")
elem.send_keys(Keys.PAGE_DOWN)
I have tried the following, in many different iterations and the number printed never charged regardless of how far down the page I am or if I used innerHeight, or body instead of documentElement.
height = driver.execute_script("return document.documentElement.scrollHeight")
If I attempt to scroll down the page using a similar approach, this page does not move.
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
I am unsure if this has to do with iframes or if I am simply misunderstanding the best approach.
Still have been unable to find a way to reliably detect the end of the page.
Thank you!

After importing the required imports
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
you can validate the page button is reached when the following element is visible:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, "//p[contains(text(),'License')]"))
As about the second issue, try clicking on the following element with Selenium:
driver.find_element_by_id("products-container").click
I have no environment to debug this, but I guess this will work

Selenium/Python - Finding Dynamically Created Fields

Newbie here... 2 days into learning this.
In a learning management system, there is an element (a plus mark icon) to click which adds a form field upon each click.  The goal is to click the icon, which generates a new field, and then put text into the new field.  This field does NOT exist when the page loads... it's added dynamically based on the clicking of the icon.
When I try to use "driver.find_element_by_*" (have tried ID, Name and xpath), I get an error that it can't be found. I'm assuming it's because it wasn't there when the page loaded. Any way to resolve this?
By the way, I've been successful in scripting the login process and navigating through the site to get to this point. So, I have actually learned how to find other elements that are static.
Let me know if I need to provide more info or a better description.
Thanks,
Bill

Apparently I needed to have patience and let something catch up...
I added:
import time
and then:
time.sleep(3)
after the click on the icon to add the field. It's working!

You can use time.sleep(3) but that would force you to wait for the entire 3 seconds before using that element. In Selenium we use webdriver waits that polls the DOM to allow us to immediately use that element as quick as possible when it is useable.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 10)
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR,""))).click()

Selenium driver not returning all web elements even though they are the same class

I'm trying to make a web scraper for a webpage using selenium.
the webpage is https://www.supersport.hr/sport/dan/0/sport/1
I am able to get most of the web elements I'm trying to get but my script doesn't return all of them even though they are the same class.
page example:
In this case my script returns all of the above divs, but cuts off at "SRL GRČKA 1." and doesn't get any leagues below.
the HTML markup for each of these leagues are the same:
I'm getting a list of these elements in python like this:
football_leagues_elements = driver.find_elements_by_css_selector("div.sportska-liga.nogomet")
I've also tried with this code but it returns the same result:
football_leagues_elements = driver.find_elements_by_xpath("//div[contains(#class, 'sportska-liga-wrap')]//div[contains(#class, 'nogomet')]")
I think all the leagues are loaded to the page at the same time.
My question is why are some divs not included in the webelement list?
Any help is welcome.

The Website is NOT accessible here.However try Induce Explicit wait and wait for all element visible.
football_leagues_elements=WebDriverWait(driver,20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,"div.sportska-liga.nogomet")))
You need to import following libraries.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
If above doesn't work then try to scroll the page first and then check for elements.
#Scroll bottom of the page.
driver.execute_script("window.scrollTo(0,document.body.scrollHeight)")
#Get the elements.
football_leagues_elements=WebDriverWait(driver,20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR,"div.sportska-liga.nogomet")))

How to periodically re-check a webpage using selenium in python

I am new to selenium in python (and all web-interface applications of python) and I have a task to complete for my present internship.
My script successfully navigates to an online database and inputs information from my data tables, but then the webpage in question takes anywhere from 30 seconds to several minutes to compute an output.
How do I go about instructing python to re-check the page every 30 seconds until the output appears so that I can parse it for the data I need? For instance, which functions might be I start with?
This will be part of a loop repeated for over 200 entries, and hundreds more if I am successful so it is worth my time to automate it.
Thanks

You should use Seleniums Waits as pointed by G_M and Sam Holloway.
One which I most use is the expected_conditions:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "myDynamicElement"))
)
finally:
driver.quit()
It will wait until there is an element with id "myDynamicElement" and then execute the try block, which should contain the rest of your work.
I prefer to use the the class By.XPATH, but if you use By.XPATH with the method presence_of_element_located add another () so it will be the required tuple as noted in this answer:
from selenium.webdriver.common.by import By
driver.find_element(By.XPATH, '//button[contains(text(),"Some text")]')
driver.find_element(By.XPATH, '//div[#id="id1"]')
driver.find_elements(By.XPATH, '//a')
The easiest way to find (for me) the XPATH of an element is going to the developer mode in chrome (F12), pressing ctrl+F, and using the mouse with inspect, trying to compose the proper XPATH, which will be specific enough to find just the expected element, or the least number of elements as possible.
All the examples are from (or based) the great selenium documentation.

If you just want to space out checks, the time.sleep() function should work.
However, as G_M's comment says, you should look into Selenium waits. Think about this: is there an element on the page that will indicate that the result is loaded? If so, use a Selenium wait on that element to make sure your program is only pausing until the result is loaded and not wasting any time afterwards.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

WebScraping through multiple sites with Selenium - python

Related

Accepting cookies popups using selenium in python

Python, Selenium and Chrome - How do can I detect the end of a page with dynamically generated content?

Selenium/Python - Finding Dynamically Created Fields

Selenium driver not returning all web elements even though they are the same class

How to periodically re-check a webpage using selenium in python

Categories

Resources