I'm having issues with selenium webdriver runtime. In fact I'm opening an array with 10 urls and scraping some content.
As the time goes and selenium open the forth url, it gets extremely slow... if I let the task continue, it can't be finished, python aborts the process because of exceeded run time.
Imagine, first url scrape takes 1 minute, the second one 1 - 2 minutes, third 4 minutes, ..., then it breaks.
I need some workaround for this issue, I'm using ipython notebook 2.7.
PS: Do you think opening the url in different tabs could help?
Edit: This is how I create browser:
chromeOptions = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images":2,
"profile.default_content_setting_values.notifications" : 2,}
chromeOptions.add_experimental_option("prefs",prefs)
chromeOptions.add_argument("--window-position=0,0")
browser = webdriver.Chrome(chrome_options=chromeOptions)
This is the task is being run in each url of the array:
browser.get(url)
lastHeight = browser.execute_script("return document.body.scrollHeight")
while True:
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
newHeight = browser.execute_script("return document.body.scrollHeight")
if newHeight == lastHeight:
break
lastHeight = newHeight
start = 'Por '
end = ' com'
html_source = browser.page_source
soup = BeautifulSoup(html_source)
cl = soup.find_all('div', attrs={'class': 'cl'})
names = [None] * len(cl)
for i in range(len(cl)):
try: names[i] = re.search('%s(.*)%s' % (start, end), cl[i].text).group(1)
except: continue
photosof = list(set(names))
Unfortunatelly Selenium performance is highly dependent of time, it decreases very fast. The only solution I found was to close and reopen the driver.
Related
I'm trying to web scrape some information from a website. The loop works the first time but the 2nd time the error occurs. I've tried several approaches to solve it with implicit wait or WebDriverUntil but the exception continues to appear. Would you give me a hand?
Here is the code:
website = 'https://www.elempleo.com/co/ofertas-empleo/55-6-millones?'
driver = webdriver.Chrome(path)
driver.get(website)
empleos = driver.find_elements_by_tag_name('div.result-list.js-result-list.js-results-container')
data = []
i = 0
data = empleos[0].text.splitlines()
while i < 4:
data.append(empleos[0].text.splitlines())
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
siguiente = driver.find_elements_by_tag_name('a.js-btn-next')
siguiente[0].click()
i += 1
I tried using:
siguiente = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.TAG_NAME, "a.js-btn-next")))
and
driver.implicitly_wait(10) with different times but it didn't work
There is a web page that I want to run my scraping script on. However, because the page refreshes with additional content when you scroll down, I need to be able to add a function to my script that scrolls the web page all the way to the bottom before my scraping script is run.
In attempt to achieve this, please find my entire script which seems to stop at row height 5287.
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import csv
import pandas as pd
#Initialize a Chrome browser
driver = webdriver.Chrome("C:.............chromedriver_win32/chromedriver.exe")
#Go to the page we want to scrape
driver.get('https://icodrops.com/category/ended-ico/')
#Open csv file to write in
csv_file = open('icodrops_ended_icos.csv', 'w')
writer = csv.writer(csv_file)
writer.writerow(['Project_Name', 'Interest', 'Category', 'Received', 'Goal', 'End_Date', 'Ticker'])
page_url = 'https://icodrops.com/category/ended-ico/'
# Although only one page to scrape - need to scroll to the bottom to pull all data
lastHeight = driver.execute_script("return document.documentElement.scrollHeight")
print('lastHeight', lastHeight)
while True:
driver.execute_script(f"window.scrollTo(0, {lastHeight});")
time.sleep(15)
#height = driver.execute_script("return document.documentElement.scrollHeight")
newHeight = driver.execute_script("return document.documentElement.scrollHeight")
print('newHeight', newHeight)
if newHeight == lastHeight:
break
lastHeight = newHeight
try:
#print the url that we are scraping
print('Scraping this url:' + page_url)
#Exract a list object where each element of the list is a row in the table
rows = driver.find_elements_by_xpath('//div[#class="col-md-12 col-12 a_ico"]')
# Extract detail in columns from each row
for row in rows:
#Initialize a dictionary for each row
row_dict = {}
#Use relative xpaths to locate desired data
project_name = row.find_element_by_xpath('.//div[#class="ico-row"]/div[2]/h3/a').text
interest = row.find_element_by_xpath('.//div[#class="interest"]').text
category = row.find_element_by_xpath('.//div[#class="categ_type"]').text
received = row.find_element_by_xpath('.//div[#id="new_column_categ_invisted"]/span').text
goal = row.find_element_by_xpath('.//div[#id="categ_desctop"]').text
end_date = row.find_element_by_xpath('.//div[#class="date"]').text
ticker = row.find_element_by_xpath('.//div[#id="t_tikcer"]').text
# Add extracted data to the dictionary
row_dict['project_name'] = project_name
row_dict['interest'] = interest
row_dict['category'] = category
row_dict['received'] = received
row_dict['goal'] = goal
row_dict['end_date'] = end_date
row_dict['ticker'] = ticker
writer.writerow(row_dict.values())
except Exception as e:
print(e)
csv_file.close()
driver.close()
break
Without being able to scroll to the bottom of the page my script will only scrape data form the initial page which only constitutes about 10% of all that is available
I always use the below piece of code to scroll till bottom, and I have never seen that it fail.
driver.execute_script("var scrollingElement = (document.scrollingElement || document.body);scrollingElement.scrollTop = scrollingElement.scrollHeight;")
So, your effective code will be
while True:
driver.execute_script("var scrollingElement = (document.scrollingElement || document.body);scrollingElement.scrollTop = scrollingElement.scrollHeight;")
height = driver.execute_script("return document.documentElement.scrollHeight")
newHeight = driver.execute_script("window.scrollTo(0, " + str(height) + ");")
time.sleep(15)
if newHeight == lastHeight:
break
lastHeight = newHeight
If you use print() to see values in variables then you see that scrollTo gives None and you can't use it to get newHeight.
Minimal working code.
I tested on page http://quotes.toscrape.com/scroll created for learning scraping.
from selenium import webdriver
import time
url = 'http://quotes.toscrape.com/scroll'
driver = webdriver.Firefox()
driver.get(url)
lastHeight = driver.execute_script("return document.documentElement.scrollHeight")
print('lastHeight', lastHeight)
while True:
driver.execute_script(f"window.scrollTo(0, {lastHeight});")
time.sleep(1)
newHeight = driver.execute_script("return document.documentElement.scrollHeight")
print('newHeight', newHeight)
if newHeight == lastHeight:
break
lastHeight = newHeight
BTW:
I found on Stackoverflow answer from 2015 which use the same method but with document.body instead of document.documentElement
How can I scroll a web page using selenium webdriver in python?
So if this code works for you then this question could be closed as duplicate
I'm creating a web crawler for Zillow in order to practice using Selenium. All I'm trying to do is get the price, address, and link to each home, but when I use find_elements_by_class_name() or find_elements_by_css_selector(), it only finds the first 9 elements, when there are many more.
Normally my selenium works fine. Does anyone know why this occurs?
from selenium import webdriver
import time
zillow_url = "https://www.zillow.com/manhattan-new-york-ny/houses/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22usersSearchTerm%22%3A%22Manhattan%2C%20New%20York%2C%20NY%22%2C%22mapBounds%22%3A%7B%22west%22%3A-74.21047920019531%2C%22east%22%3A-73.73669379980468%2C%22south%22%3A40.626191262639644%2C%22north%22%3A40.933477919520115%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A12530%2C%22regionType%22%3A17%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22ah%22%3A%7B%22value%22%3Atrue%7D%2C%22beds%22%3A%7B%22min%22%3A0%2C%22max%22%3A0%7D%2C%22price%22%3A%7B%22max%22%3A400000%7D%2C%22mp%22%3A%7B%22max%22%3A1300%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A11%7D"
address = "My chrome driver address"
driver = webdriver.Chrome(executable_path=address)
driver.get(zillow_url)
time.sleep(2)
prices = driver.find_elements_by_class_name("list-card-price")
addresses = driver.find_elements_by_class_name("list-card-addr")
links = driver.find_elements_by_class_name("list-card-link")
Try this.
from selenium import webdriver
import time
zillow_url = "https://www.zillow.com/manhattan-new-york-ny/houses/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22usersSearchTerm%22%3A%22Manhattan%2C%20New%20York%2C%20NY%22%2C%22mapBounds%22%3A%7B%22west%22%3A-74.21047920019531%2C%22east%22%3A-73.73669379980468%2C%22south%22%3A40.626191262639644%2C%22north%22%3A40.933477919520115%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A12530%2C%22regionType%22%3A17%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22ah%22%3A%7B%22value%22%3Atrue%7D%2C%22beds%22%3A%7B%22min%22%3A0%2C%22max%22%3A0%7D%2C%22price%22%3A%7B%22max%22%3A400000%7D%2C%22mp%22%3A%7B%22max%22%3A1300%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A11%7D"
address = "My chrome driver address"
driver = webdriver.Chrome(executable_path=address)
driver.get(zillow_url)
prices = []
addresses = []
links = []
time.sleep(2)
SCROLL_PAUSE_TIME = 0.5
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while (condition):
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
prices = driver.find_elements_by_class_name("list-card-price")
addresses = driver.find_elements_by_class_name("list-card-addr")
links = driver.find_elements_by_class_name("list-card-link")
# Wait to load page
time.sleep(SCROLL_PAUSE_TIME)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
Just put the condition as len(prices) <= number of houses you wanna scrape
So I came from the question here
Now I am able to interact with the page, scroll down the page, close the popup that appears and click at the bottom to expand the page.
The problem is when I count the items, the code only returns 20 and it should be 40.
I have checked the code again and again - I'm missing something but I don't know what.
See my code below:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import time
import datetime
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
#options.add_argument('--headless')
driver = webdriver.Chrome(executable_path=r"C:\\chromedriver.exe", options=options)
url = 'https://www.coolmod.com/componentes-pc-procesadores?f=375::No'
driver.get(url)
iter=1
while True:
scrollHeight = driver.execute_script("return document.documentElement.scrollHeight")
Height=10*iter
driver.execute_script("window.scrollTo(0, " + str(Height) + ");")
if Height > scrollHeight:
print('End of page')
break
iter+=1
time.sleep(3)
popup = driver.find_element_by_class_name('confirm').click()
time.sleep(3)
ver_mas = driver.find_elements_by_class_name('button-load-more')
for x in range(len(ver_mas)):
if ver_mas[x].is_displayed():
driver.execute_script("arguments[0].click();", ver_mas[x])
time.sleep(10)
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'lxml')
# print(soup)
items = soup.find_all('div',class_='col-xs-12 col-sm-6 col-sm-6 col-md-6 col-lg-3 col-product col-custom-width')
print(len(items))
````=
What is wrong?. I newbie in the scraping world.
Regards
Your while and for statements don't work as intended.
Using while True: is a bad practice
You scroll until the bottom - but the button-load-more button isn't displayed there - and Selenium will not find it as displayed
find_elements_by_class_name - looks for multiple elements - the page has only one element with that class
if ver_mas[x].is_displayed(): if you are lucky this will be executed only once because the range is 1
Below you can find the solution - here the code looks for the button, moves to it instead of scrolling, and performs a click. If the code fails to found the button - meaning that all the items were loaded - it breaks the while and moves forward.
url = 'https://www.coolmod.com/componentes-pc-procesadores?f=375::No'
driver.get(url)
time.sleep(3)
popup = driver.find_element_by_class_name('confirm').click()
iter = 1
while iter > 0:
time.sleep(3)
try:
ver_mas = driver.find_element_by_class_name('button-load-more')
actions = ActionChains(driver)
actions.move_to_element(ver_mas).perform()
driver.execute_script("arguments[0].click();", ver_mas)
except NoSuchElementException:
break
iter += 1
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'lxml')
# print(soup)
items = soup.find_all('div', class_='col-xs-12 col-sm-6 col-sm-6 col-md-6 col-lg-3 col-product col-custom-width')
print(len(items))
I have this script to download images from Instagram. The only issue I am having is that when Selenium starts scrolling down to the bottom of the webpage, BeautifulSoup starts grabbing the same img src links after requests is being looped.
Although it will continue to scroll down and download pictures, after all that is done, I end up having 2 or 3 duplicates. So my question is is there a way of preventing this duplication from happening?
import requests
from bs4 import BeautifulSoup
import selenium.webdriver as webdriver
url = ('https://www.instagram.com/kitties')
driver = webdriver.Firefox()
driver.get(url)
scroll_delay = 0.5
last_height = driver.execute_script("return document.body.scrollHeight")
counter = 0
print('[+] Downloading:\n')
def screens(get_name):
with open("/home/cha0zz/Desktop/photos/img_{}.jpg".format(get_name), 'wb') as f:
r = requests.get(img_url)
f.write(r.content)
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(scroll_delay)
new_height = driver.execute_script("return document.body.scrollHeight")
soup = BeautifulSoup(driver.page_source, 'lxml')
imgs = soup.find_all('img', class_='_2di5p')
for img in imgs:
img_url = img["src"]
print('=> [+] img_{}'.format(counter))
screens(counter)
counter = counter + 1
if new_height == last_height:
break
last_height = new_height
Update:
So I placed this part of the code outside of while True and let selenium load the whole page first in order to hopefully have bs4 scrape all the images. It works to number 30 only and then stops.
soup = BeautifulSoup(driver.page_source, 'lxml')
imgs = soup.find_all('img', class_='_2di5p')
for img in imgs:
#tn = datetime.now().strftime('%H:%M:%S')
img_url = img["src"]
print('=> [+] img_{}'.format(counter))
screens(counter)
counter = counter + 1
The reason it only loads 30 in your second version of your script is because the rest of the elements are removed from the page DOM and are no longer part of the source that BeautifulSoup sees. The solution is to keep doing what you were doing the first time, but to remove any duplicate elements before you iterate through the list and call screens(). You can do this using sets as below, though I'm not sure if this is the absolute most efficient way to do it:
import requests
import selenium.webdriver as webdriver
import time
driver = webdriver.Firefox()
url = ('https://www.instagram.com/cats/?hl=en')
driver.get(url)
scroll_delay = 3
last_height = driver.execute_script("return document.body.scrollHeight")
counter = 0
print('[+] Downloading:\n')
def screens(get_name):
with open("test_images/img_{}.jpg".format(get_name), 'wb') as f:
r = requests.get(img_url)
f.write(r.content)
old_imgs = set()
while True:
imgs = driver.find_elements_by_class_name('_2di5p')
imgs_dedupe = set(imgs) - set(old_imgs)
for img in imgs_dedupe:
img_url = img.get_attribute("src")
print('=> [+] img_{}'.format(counter))
screens(counter)
counter = counter + 1
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(scroll_delay)
new_height = driver.execute_script("return document.body.scrollHeight")
old_imgs = imgs
if new_height == last_height:
break
last_height = new_height
driver.quit()
As you can see, I used a different page to test it, one with 420 images of cats. The result was 420 images, the number of posts on that account, with no duplicates among them.
I would use os library to check if file already exists
import os
def screens(get_name):
with open("/home/cha0zz/Desktop/photos/img_{}.jpg".format(get_name), 'wb') as f:
if os.path.isfile(path/to/the/file): #checks file exists. Gives false on directory
# or if os.path.exists(path/to/the/file): #checks file/directory exists
pass
else:
r = requests.get(img_url)
f.write(r.content)
*I might have messed up the ordering of if and with statements