I'm trying to web scrape some information from a website. The loop works the first time but the 2nd time the error occurs. I've tried several approaches to solve it with implicit wait or WebDriverUntil but the exception continues to appear. Would you give me a hand?
Here is the code:
website = 'https://www.elempleo.com/co/ofertas-empleo/55-6-millones?'
driver = webdriver.Chrome(path)
driver.get(website)
empleos = driver.find_elements_by_tag_name('div.result-list.js-result-list.js-results-container')
data = []
i = 0
data = empleos[0].text.splitlines()
while i < 4:
data.append(empleos[0].text.splitlines())
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
siguiente = driver.find_elements_by_tag_name('a.js-btn-next')
siguiente[0].click()
i += 1
I tried using:
siguiente = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.TAG_NAME, "a.js-btn-next")))
and
driver.implicitly_wait(10) with different times but it didn't work
Related
Here's the link of the website : website
I would like to have all the links of th hotels in this location.
Here's my script :
import pandas as pd
import numpy as np
from selenium import webdriver
import time
PATH = "driver\chromedriver.exe"
options = webdriver.ChromeOptions()
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1200,900")
options.add_argument('enable-logging')
driver = webdriver.Chrome(options=options, executable_path=PATH)
driver.get('https://fr.hotels.com/search.do?destination-id=10398359&q-check-in=2021-06-24&q-check-out=2021-06-25&q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER')
cookie = driver.find_element_by_xpath('//button[#class="uolsaJ"]')
try:
cookie.click()
except:
pass
for i in range(30):
driver.execute_script("window.scrollBy(0, 1000)")
time.sleep(5)
time.sleep(5)
my_elems = driver.find_elements_by_xpath('//a[#class="_61P-R0"]')
links = [my_elem.get_attribute("href") for my_elem in my_elems]
X = np.array(links)
print(X.shape)
#driver.close()
But I cannot find a way to tell the script : scroll down until there is nothing more to scroll.
I tried to change this parameters :
for i in range(30):
driver.execute_script("window.scrollBy(0, 1000)")
time.sleep(30)
I changed the time.sleep(), the number 1000 and so on but my output keep changing and not in the right way.
output
As you can see, I have scraped a lot of numbers differents. How to make my script scraping a same amout each time ? Not necessarily each links but at last a stable number.
Here it scroll and at one point it seems blocked and scrape all the links it has at the moment. That's not appropriate.
There are several issues here.
You are getting the elements and their links only AFTER you finished scrolling while you should do that inside the scrolling loop.
You should wait until the cookies alert is appearing to close it.
You can scroll until the footer element is presented.
Something like this:
import pandas as pd
import numpy as np
from selenium import webdriver
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
PATH = "driver\chromedriver.exe"
options = webdriver.ChromeOptions()
options.add_argument("--disable-gpu")
options.add_argument("--window-size=1200,900")
options.add_argument('enable-logging')
driver = webdriver.Chrome(options=options, executable_path=PATH)
wait = WebDriverWait(driver, 20)
driver.get('https://fr.hotels.com/search.do?destination-id=10398359&q-check-in=2021-06-24&q-check-out=2021-06-25&q-rooms=1&q-room-0-adults=2&q-room-0-children=0&sort-order=BEST_SELLER')
wait.until(EC.visibility_of_element_located((By.XPATH, '//button[#class="uolsaJ"]'))).click()
def is_element_visible(xpath):
wait1 = WebDriverWait(driver, 2)
try:
wait1.until(EC.visibility_of_element_located((By.XPATH, xpath)))
return True
except Exception:
return False
while not is_element_visible("//footer[#id='footer']"):
my_elems = driver.find_elements_by_xpath('//a[#class="_61P-R0"]')
links = [my_elem.get_attribute("href") for my_elem in my_elems]
X = np.array(links)
print(X.shape)
driver.execute_script("window.scrollBy(0, 1000)")
time.sleep(5)
#driver.close()
You can try this by directly calling the DOM and locate some element that will be only at the bottom of the page with .is_displayed() selenium method which returns true/false:
# https://stackoverflow.com/a/57076690/15164646
while True:
# it will be returning false until the element is located
# "#message" id = "No more results" at the bottom of the YouTube search
end_result = driver.find_element_by_css_selector('#message').is_displayed()
driver.execute_script("var scrollingElement = (document.scrollingElement || document.body);scrollingElement.scrollTop = scrollingElement.scrollHeight;")
# further code below
# once the element is found it returns True. If so, it will break out of the while loop
if end_result == True:
break
I wrote a blog post where I used this method to scrape YouTube Search.
I'm scraping an E-Commerce website, Lazada using Selenium and bs4, I manage to scrape on the 1st page but I unable to iterate to the next page. What I'm tyring to achieve is to scrape the whole pages based on the categories I've selected.
Here what I've tried :
# Run the argument with incognito
option = webdriver.ChromeOptions()
option.add_argument(' — incognito')
driver = webdriver.Chrome(executable_path='chromedriver', chrome_options=option)
driver.get('https://www.lazada.com.my/')
driver.maximize_window()
# Select category item #
element = driver.find_elements_by_class_name('card-categories-li-content')[0]
webdriver.ActionChains(driver).move_to_element(element).click(element).perform()
t = 10
try:
WebDriverWait(driver,t).until(EC.visibility_of_element_located((By.ID,"a2o4k.searchlistcategory.0.i0.460b6883jV3Y0q")))
except TimeoutException:
print('Page Refresh!')
driver.refresh()
element = driver.find_elements_by_class_name('card-categories-li-content')[0]
webdriver.ActionChains(driver).move_to_element(element).click(element).perform()
print('Page Load!')
#Soup and select element
def getData(np):
soup = bs(driver.page_source, "lxml")
product_containers = soup.findAll("div", class_='c2prKC')
for p in product_containers:
title = (p.find(class_='c16H9d').text)#title
selling_price = (p.find(class_='c13VH6').text)#selling price
try:
original_price=(p.find("del", class_='c13VH6').text)#original price
except:
original_price = "-1"
if p.find("i", class_='ic-dynamic-badge ic-dynamic-badge-freeShipping ic-dynamic-group-2'):
freeShipping = 1
else:
freeShipping = 0
try:
discount = (p.find("span", class_='c1hkC1').text)
except:
discount ="-1"
if p.find(("div", {'class':['c16H9d']})):
url = "https:"+(p.find("a").get("href"))
else:
url = "-1"
nextpage_elements = driver.find_elements_by_class_name('ant-pagination-next')[0]
np=webdriver.ActionChains(driver).move_to_element(nextpage_elements).click(nextpage_elements).perform()
print("- -"*30)
toSave = [title,selling_price,original_price,freeShipping,discount,url]
print(toSave)
writerows(toSave,filename)
getData(np)
The problem might be that the driver is trying to click the button before the element is even loaded correctly.
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(PATH, chrome_options=option)
# use this code after driver initialization
# this is make the driver wait 5 seconds for the page to load.
driver.implicitly_wait(5)
url = "https://www.lazada.com.ph/catalog/?q=phone&_keyori=ss&from=input&spm=a2o4l.home.search.go.239e359dTYxZXo"
driver.get(url)
next_page_path = "//ul[#class='ant-pagination ']//li[#class=' ant-pagination-next']"
# the following code will wait 5 seconds for
# element to become clickable
# and then try clicking the element.
try:
next_page = WebDriverWait(driver, 5).until(
EC.element_to_be_clickable((By.XPATH, next_page_path)))
next_page.click()
except Exception as e:
print(e)
EDIT 1
Changed the code to make the driver wait for the element to become clickable. You can add this code inside a while loop for iterating multiple times and break the loop if the button is not found and is not clickable.
I am using Selenium on Python and trying to move the cursor and click on a specific element. This works for the first link and the structure of the HTML is the same for the next link but I get a StaleElementReferenceException for the second link when accessing it through the same webdriver. Why does this happen and how do I fix it? Below is the code I am running. Thank you so much!
def getZest(url):
zestlist = []
yearlist = []
driver.get(url)
time.sleep(5)
result = False;
attempts = 0;
while(attempts < 5):
try:
Home_Value = wait.until(EC.presence_of_element_located((By.XPATH, "//a[text()='Home value']")))
action.move_to_element(Home_Value).click().perform()
zestimate = driver.find_element_by_xpath('//*[#id="ds-home-values"]/div/div[3]/button')
action.move_to_element(zestimate).perform()
result = True
break
except exceptions.StaleElementReferenceException as e:
print(e)
attempts = attempts + 1
fivenums = ["https://www.zillow.com/homedetails/212-Haddrell-St-Mount-Pleasant-SC-29464/10922911_zpid/", "https://www.zillow.com/homedetails/20-Grove-St-Hicksville-NY-11801/31127407_zpid/"]
for num in fivenums:
getZest(num)
I was able to get informations about the first and second link with the following code, without any error:
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
import selenium.webdriver.support.expected_conditions as EC
from selenium.webdriver.common.by import By
import time
from selenium.common import exceptions
u = 'https://www.oddsportal.com/moving-margins/'
driver = webdriver.Chrome(executable_path=r"chromedriver.exe")
driver.maximize_window()
def getZest(url):
zestlist = []
yearlist = []
driver.get(url)
time.sleep(5)
result = False
attempts = 0
action = webdriver.ActionChains(driver)
wait = WebDriverWait(driver, 300)
while(attempts < 5):
try:
Home_Value = wait.until(EC.presence_of_element_located((By.XPATH, "//a[text()='Home value']")))
action.move_to_element(Home_Value).click().perform()
zestimate = driver.find_element_by_xpath('//*[#id="ds-home-values"]/div/div[3]/button')
action.move_to_element(zestimate).perform()
result = True
break
except exceptions.StaleElementReferenceException as e:
print(e)
attempts = attempts + 1
fivenums = ["https://www.zillow.com/homedetails/212-Haddrell-St-Mount-Pleasant-SC-29464/10922911_zpid/", "https://www.zillow.com/homedetails/20-Grove-St-Hicksville-NY-11801/31127407_zpid/"]
for num in fivenums:
getZest(num)
In your code there it's not showed where some variables are instantiated so, maybe this is were the problem is located.
However when opening the first link, the website showed me the Google Captcha protection, so, I suppose you have some kind of authorization to scrape the informations with the permission of the owner.
i am trying to loop through 42 sites ..
the script works fine for like 4-5 sites.. sometimes 3 sites.. some it reaches till no 15 site.. then i get an error given in the picture.
my code is given below:
import time
import requests
from selenium import webdriver
sites = []
userid=[]
password=[]
settings=[]
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--disable-infobars")
print(len(sites))
print(len(userid))
print(len(password))
print(len(settings))
count=5
for x in range(len(sites)):
try:
requests.get(sites[x])
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get(sites[x])
inputElement = driver.find_element_by_id("user_login")
inputElement.send_keys(userid[x])
inputElement = driver.find_element_by_id("user_pass")
inputElement.send_keys(password[x])
inputElement.submit()
link = driver.find_element_by_id('menu-plugins')
link.click()
driver.find_element_by_xpath('//a[#href="'+settings[x]+'"]').click()
driver.find_element_by_id('save_and_import').click()
count=count+2
time.sleep(count)
driver.quit()
except requests.ConnectionError:
print(sites[x]+"DOWN !!")
continue
enter image description here
I'm having issues with selenium webdriver runtime. In fact I'm opening an array with 10 urls and scraping some content.
As the time goes and selenium open the forth url, it gets extremely slow... if I let the task continue, it can't be finished, python aborts the process because of exceeded run time.
Imagine, first url scrape takes 1 minute, the second one 1 - 2 minutes, third 4 minutes, ..., then it breaks.
I need some workaround for this issue, I'm using ipython notebook 2.7.
PS: Do you think opening the url in different tabs could help?
Edit: This is how I create browser:
chromeOptions = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images":2,
"profile.default_content_setting_values.notifications" : 2,}
chromeOptions.add_experimental_option("prefs",prefs)
chromeOptions.add_argument("--window-position=0,0")
browser = webdriver.Chrome(chrome_options=chromeOptions)
This is the task is being run in each url of the array:
browser.get(url)
lastHeight = browser.execute_script("return document.body.scrollHeight")
while True:
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
newHeight = browser.execute_script("return document.body.scrollHeight")
if newHeight == lastHeight:
break
lastHeight = newHeight
start = 'Por '
end = ' com'
html_source = browser.page_source
soup = BeautifulSoup(html_source)
cl = soup.find_all('div', attrs={'class': 'cl'})
names = [None] * len(cl)
for i in range(len(cl)):
try: names[i] = re.search('%s(.*)%s' % (start, end), cl[i].text).group(1)
except: continue
photosof = list(set(names))
Unfortunatelly Selenium performance is highly dependent of time, it decreases very fast. The only solution I found was to close and reopen the driver.