So I am trying to scrape data from a table from several hundred pages on a website. Here is part of what I have so far:
driver.get("link")
driver.maximize_window()
window_before = driver.window_handles[0]
driver.switch_to.window(window_before)
wait = WebDriverWait(driver, 10)
driver.execute_script("window.scrollTo(0, 350)")
games = driver.find_elements(By.XPATH, '//*[#id="schedule"]/tbody/tr')
This code only works sometimes. If I run this chunk 10 times, only 5 times will the website actually scroll down. I tried using this:
for i in range(0, 2): driver.find_element(By.XPATH, '//*[#id="meta"]/div[1]/p[1]/a').send_keys(Keys.DOWN)
but the same issue arises. Sometimes that scrolls down the amount I need, other times it does nothing, and other times it scrolls the entire page.
This part of my code navigates to the first link I need to click and on the next page I need to scroll another page, where the same issue is present. This is all part of a loop that goes through several hundred pages to read html tables, so even if it works the first 50 times, I won't get all the data I need.
Edit: Directly after the above snippet I have this:
for idx, game in enumerate(games):
driver.find_element(By.XPATH, '/html/body/div[2]/div[6]/div[3]/div[2]/table/tbody/tr['+str(idx+1)+']/td[6]/a').click()
Which is where I get the "element is not clickable at point (X, Y)" error.
Am I doing something wrong here, or is there a work around to accomplish my goal?
Here is one way to access href attribute for every 'Box Score' link from that page (according to OP's clarification in comments):
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
wait = WebDriverWait(browser, 20)
actions = ActionChains(browser)
url = 'https://www.basketball-reference.com/leagues/NBA_2014_games-october.html'
browser.get(url)
# print(browser.page_source)
# browser.maximize_window()
try:
wait.until(EC.element_to_be_clickable((By.XPATH, '//div[#class="qc-cmp2-summary-section"]'))).click()
print('clicked cookie parent')
wait.until(EC.element_to_be_clickable((By.XPATH, '//button[#mode="primary"]'))).click()
print('accepted cookies')
except Exception as e:
print('no cookies')
wait.until(EC.element_to_be_clickable((By.XPATH, '//div[#id="all_schedule"]'))).location_once_scrolled_into_view
table_with_score_links = wait.until(EC.presence_of_element_located((By.XPATH, '//table[#id="schedule"]')))
# print(table_with_score_links.get_attribute('outerHTML'))
links_from_table = [x.get_attribute('href') for x in table_with_score_links.find_elements(By.TAG_NAME, 'a') if x.text == 'Box Score']
print(links_from_table)
Result printed in terminal:
clicked cookie parent
accepted cookies
['https://www.basketball-reference.com/boxscores/201310290IND.html', 'https://www.basketball-reference.com/boxscores/201310290MIA.html', 'https://www.basketball-reference.com/boxscores/201310290LAL.html', 'https://www.basketball-reference.com/boxscores/201310300CLE.html', 'https://www.basketball-reference.com/boxscores/201310300TOR.html', 'https://www.basketball-reference.com/boxscores/201310300PHI.html', 'https://www.basketball-reference.com/boxscores/201310300DET.html', 'https://www.basketball-reference.com/boxscores/201310300NYK.html', 'https://www.basketball-reference.com/boxscores/201310300NOP.html', 'https://www.basketball-reference.com/boxscores/201310300MIN.html', 'https://www.basketball-reference.com/boxscores/201310300HOU.html', 'https://www.basketball-reference.com/boxscores/201310300SAS.html', 'https://www.basketball-reference.com/boxscores/201310300DAL.html', 'https://www.basketball-reference.com/boxscores/201310300UTA.html', 'https://www.basketball-reference.com/boxscores/201310300PHO.html', 'https://www.basketball-reference.com/boxscores/201310300SAC.html', 'https://www.basketball-reference.com/boxscores/201310300GSW.html', 'https://www.basketball-reference.com/boxscores/201310310CHI.html', 'https://www.basketball-reference.com/boxscores/201310310LAC.html']
I tried to make variable names as descriptive as possible, and also left some commented out lines of code, to help with the thought process - build up to reach the end goal.
You can now go through those links one by one, etc.
Selenium documentation can be found here: https://www.selenium.dev/documentation/
Related
I'm trying to scrap the list of services we have for us from this site but not able to click to the next page.
This is what I've tried so far using selenium & bs4,
#attempt1
next_pg_btn = browser.find_elements(By.CLASS_NAME, 'ui-lib-pagination_item_nav')
next_pg_btn.click() # nothing happens
#attemp2
browser.find_element(By.XPATH, "//div[#role = 'button']").click() # nothing happens
#attempt3 - saw in some stackoverflow post that sometimes we need to scroll to the
#bottom of page to have the button clickable, so tried that
browser.execute_script("window.scrollTo(0,2500)")
browser.find_element(By.XPATH, "//div[#role = 'button']").click() # nothing happens
I'm not so experienced with scrapping, pls advice how to handle this and where I'm going wrong.
Thanks
Several issues with your code:
You tried wrong locators.
You probably need to wait for the element to be loaded before clicking it. But if before clicking the pagination you performing some actions on the page this is not needed since during you scraping the page content web elements are already got loaded.
Pagination button is on the buttom of the page, so you need to scroll the page to bring the pagination button into the visible screen.
After scrolling some delay should be added, as you can see in the code below.
Now pagination element can be clicked.
The following code works
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.add_argument("start-maximized")
webdriver_service = Service('C:\webdrivers\chromedriver.exe')
driver = webdriver.Chrome(options=options, service=webdriver_service)
wait = WebDriverWait(driver, 10)
url = "https://www.tamm.abudhabi/en/life-events/individual/HousingProperties"
driver.get(url)
pagination = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "ui-lib-pagination__item_nav")))
pagination.location_once_scrolled_into_view
time.sleep(0.5)
pagination.click()
I click on a specific button on a page, but for some reason there is one of the buttons that I can't click on, even though it's positioned exactly like the other elements like it that I can click on.
The code below as you will notice, it opens a page, then clicks to access another page, do this step because only then can you be redirected to the real url that has the //int.
import datetime
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
with open('my_user_agent.txt') as f:
my_user_agent = f.read()
headers = {
'User-Agent': my_user_agent
}
options = Options()
options.set_preference("general.useragent.override", my_user_agent)
options.set_preference("media.volume_scale", "0.0")
options.page_load_strategy = 'eager'
driver = webdriver.Firefox(options=options)
today = datetime.datetime.now().strftime("%Y/%m/%d")
driver.get(f"https://int.soccerway.com/matches/{today}/")
driver.find_element(by=By.XPATH, value="//div[contains(#class,'language-picker-trigger')]").click()
time.sleep(3)
driver.find_element(by=By.XPATH, value="//li/a[contains(#href,'https://int.soccerway.com')]").click()
time.sleep(3)
try:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[contains(#class,'tbl-read-more-btn')]")))
driver.find_element(by=By.XPATH, value="//a[contains(#class,'tbl-read-more-btn')]").click()
time.sleep(0.1)
except:
pass
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//div[#data-exponload='']//button[contains(#class,'expand-icon')]")))
for btn in driver.find_elements(by=By.XPATH, value="//div[#data-exponload='']//button[contains(#class,'expand-icon')]"):
btn.click()
time.sleep(0.1)
I've tried adding btn.location_once_scrolled_into_view before each click to make sure the button is correctly in the click position, but the problem still persists.
I also tried using the options mentioned here:
Selenium python Error: element could not be scrolled into view
But the essence of the case kept persisting in error, I couldn't understand what the flaw in the case was.
Error text:
selenium.common.exceptions.ElementNotInteractableException: Message: Element <button class="expand-icon"> could not be scrolled into view
Stacktrace:
RemoteError#chrome://remote/content/shared/RemoteError.jsm:12:1
WebDriverError#chrome://remote/content/shared/webdriver/Errors.jsm:192:5
ElementNotInteractableError#chrome://remote/content/shared/webdriver/Errors.jsm:302:5
webdriverClickElement#chrome://remote/content/marionette/interaction.js:156:11
interaction.clickElement#chrome://remote/content/marionette/interaction.js:125:11
clickElement#chrome://remote/content/marionette/actors/MarionetteCommandsChild.jsm:204:29
receiveMessage#chrome://remote/content/marionette/actors/MarionetteCommandsChild.jsm:92:31
Edit 1:
I noticed that the error only happens when the element is colored orange (when they are colored orange it means that one of the competition games is happening now, in other words it is live).
But the button is still the same, it keeps the same element, so I don't know why it's not being clicked.
See the color difference:
Edit 2:
If you open the browser normally or without the settings I put in my code, the elements in orange are loaded already expanded, but using the settings I need to use, they don't come expanded. So please use the settings I use in the code so that the page opens the same.
What you missing here is to wrap the command in the loop opening those sections with try-except block.
the following code works. I tried running is several times.
import datetime
import time
from selenium import webdriver
from selenium.webdriver import DesiredCapabilities
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.add_argument("start-maximized")
caps = DesiredCapabilities().CHROME
caps["pageLoadStrategy"] = "eager"
webdriver_service = Service('C:\webdrivers\chromedriver.exe')
driver = webdriver.Chrome(service=webdriver_service, options=options, desired_capabilities=caps)
wait = WebDriverWait(driver, 10)
today = datetime.datetime.now().strftime("%Y/%m/%d")
driver.get(f"https://int.soccerway.com/matches/{today}/")
wait.until(EC.element_to_be_clickable((By.XPATH, "//div[contains(#class,'language-picker-trigger')]"))).click()
time.sleep(5)
wait.until(EC.element_to_be_clickable((By.XPATH, "//li/a[contains(#href,'https://int.soccerway.com')]"))).click()
time.sleep(5)
try:
wait.until(EC.element_to_be_clickable((By.XPATH, "//a[contains(#class,'tbl-read-more-btn')]")))
driver.find_element(By.XPATH, "//a[contains(#class,'tbl-read-more-btn')]").click()
time.sleep(0.1)
except:
pass
wait.until(EC.element_to_be_clickable((By.XPATH, "//div[#data-exponload='']//button[contains(#class,'expand-icon')]")))
for btn in driver.find_elements(By.XPATH, "//div[#data-exponload='' and not(contains(#class,'status-playing'))]//button[contains(#class,'expand-icon')]"):
btn.click()
time.sleep(0.1)
UPD
We need to open only closed elements. The already opened sections should be stayed open. In this case click will always work without throwing exceptions. To do so we just need to add such indication - click buttons not inside the section where status is currently playing.
Hello i am using selenium webdriver for chrome in python and trying to scrape some data from https://www.google.com/travel/things-to-do
I am here focused on the place decscription which can be seen here:
So in order to get to each individual description I have to press on the attraction and save the html to list for future parsing with BeautifulSoup.
Every click refresh the page so i was thinking about couting somehow all the attractions that got displayed and then in loop click every attraction with saving the description.
Anybody has any idea how to approach it?
Heres simple code that gets you to the place where i am stuck
chrome_options = webdriver.ChromeOptions()
#chrome_options.headless = True
chrome_options.add_argument('--incognito')
#chrome_options.add_argument('--headless')
s=Service(ChromeDriverManager().install())
driver = webdriver.Chrome(executable_path=r"\\chromedriver.exe", options=chrome_options, service=s)
driver.get("https://www.google.com/travel/things-to-do/see-all?dest_mid=%2Fm%2F081m_&dest_state_type=sattd&dest_src=yts&q=Warszawa#ttdm=52.227486_21.004941_13&ttdmf=%252Fm%252F0862m")
# If you are not running webdriver in incognito mode you might skip the below button since it goes through accepting cookies
button = driver.find_element_by_xpath("/html/body/c-wiz/div/div/div/div[2]/div[1]/div[4]/form/div[1]/div/button")
button.click()
time.sleep(1)
objects = driver.find_elements_by_class_name('f4hh3d')
for k in objects:
k.click()
time.sleep(5)
For each attraction index you can click it to open the details, get the details, close the details, get the list of attractions again and go for the next attraction.
Something like this should work:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get("https://www.google.com/travel/things-to-do/see-all?dest_mid=%2Fm%2F081m_&dest_state_type=sattd&dest_src=yts&q=Warszawa#ttdm=52.227486_21.004941_13&ttdmf=%252Fm%252F0862m")
wait = WebDriverWait(driver, 20)
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "div.f4hh3d")))
time.sleep(1)
attractions = driver.find_elements_by_css_selector('div.f4hh3d')
for i in range(len(attractions)):
attractions = driver.find_elements_by_css_selector('div.f4hh3d')
attractions[i].click()
description = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'div[jsname="BEZjkb"] div[jsname="bN97Pc"]'))).text
#do with the description what you want
#close the attraction by clicking the button
driver.find_element_by_css_selector('div.reh1ld button')
I'm trying to get to accept a cookie and looked at a similar question.
This is how the popup looks: https://i.stack.imgur.com/FxChW.png
I have tried different things. For instance I tried following the others question solution like so:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Firefox()
browser.get('https://www.volkskrant.nl/best-gelezen?utm_source=pocket_mylist')
wait = WebDriverWait(browser, 4)
element = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'button.message-component:nth-child(1)')))
This however gives the this error.
I tried a bunch of different things but I can't seem to select anything on the page (at least nothing from the popup).
I know this question has already been asked a couple of times but I did not find a solution.
Is there anyone else who encountered this problem and knows how to just accept the cookies as to go to the regular site?
thanks in advance!
There's an iframe :
iframe[title='Iframe title']
you need to switch first in Selenium.
wait = WebDriverWait(driver, 10)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, "iframe[title='Iframe title']")))
after this you can click on accept cookies button.
full code :
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.implicitly_wait(30)
driver.get("https://www.volkskrant.nl/best-gelezen?utm_source=pocket_mylist")
wait = WebDriverWait(driver, 10)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.CSS_SELECTOR, "iframe[title='Iframe title']")))
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[title='Akkoord']"))).click()
This element is inside an iframe.
So you have first switch to that iframe and only after that you will be able to accept the cookies.
browser = webdriver.Firefox()
browser.get('https://www.volkskrant.nl/best-gelezen?utm_source=pocket_mylist')
wait = WebDriverWait(browser, 20)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, "//iframe[contains(#src,'preload')]")))
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'button[title="Akkoord"]'))).click()
I am trying to get a list of proxies from https://sslproxies.org/ using Selenium (headless via PhantomJS) and BeautifulSoup:
This is what I did so far:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550)
driver.get("https://sslproxies.org/")
while True:
try:
next_button = driver.find_element_by_xpath("//li[#class='paginate_button next'][#id='proxylisttable_next']")
except:
break
next_button.click()
soup = BeautifulSoup(next_button.get_attribute('innerHTML'),'html.parser')
But I get this error:
"errorMessage":"Element is no longer attached to the DOM"
You are defining next_button, then clicking said button, then trying to reference the next_button variable again. Your click has caused you to navigate to another page with a brand new DOM, and your definition for next_button no longer works. To avoid this you can simply redefine the variable or just always use the whole
driver.find_element_by_xpath("//li[#class='paginate_button next'][#id='proxylisttable_next']")
1 You can iterate through pages using for loop, but for this you will need to get the number of pages. Depending on site getting the number of pages method may be different. In you case it is ease.
You get the length of pages locators list+1, like this: len(driver.find_elements_by_xpath("//li[#class='paginate_button ']")).
2 Your locator was incorrect, so I changed it to: //li[#class='paginate_button next'][#id='proxylisttable_next']/a (added /a)
3 After finding the button you click it in finally.
SOLUTION
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
driver.implicitly_wait(10)
driver.set_window_size(1120, 550)
driver.get("https://sslproxies.org/")
wait = WebDriverWait(driver, 10)
length = len(driver.find_elements_by_xpath("//li[#class='paginate_button ']"))
print(f"List length is: {length}")
for j in range(1, length+1):
try:
print("Clicking Page " + str(j+1))
wait.until(
EC.visibility_of_element_located((By.CSS_SELECTOR, "section[id='list']")))
wait.until(EC.element_to_be_clickable((By.XPATH, "//li[#class='paginate_button next'][#id='proxylisttable_next']/a")))
finally:
next_button = driver.find_element_by_xpath(
"//li[#class='paginate_button next'][#id='proxylisttable_next']/a")
next_button.click()
P.S. I tested it on Chrome, but it should work in any browser, as I use stable locators and waits.
My output for debug:
List length is: 4
Clicking Page 2
Clicking Page 3
Clicking Page 4
Clicking Page 5