Selenium Firefox Scraping : Takes so much RAM and crash

Selenium Firefox Scraping : Takes so much RAM and crash - python

Reposted cause by : Not enought desired behavior
I have been looking for a solution to my problem for several days.
I have a python script which scrapes iteratively a dynamic website with selenium using the geckodriver.
During 2 hours it manages to take all of the data that I told to recover and at the end of these 2 hours, it begins to slow down and eventually crash.
The crash is caused by the occupation of firefox in the RAM. In detail, the longer the script scrapes, the more memory Firefox occupies increases.
I scoured the net and found various solutions which did not work.
If you can help me find the solution to be able to scrape for at least 24 hours that would be cool of you.
A bit of code
binary = FirefoxBinary('/opt/firefox/firefox')
start_time = time.time()
options = Options()
options.add_argument("--headless")
firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
driver=webdriver.Firefox(executable_path="/usr/bin/geckodriver",options=options, firefox_binary=binary, firefox_profile=firefox_profile)
driver.get("https:********************************")
time.sleep(5)
print("WebSite OPENED ready to connect")
def auth(t_end) :
print("entering in auth function")
login_button = driver.find_element_by_xpath("******************").click()
time.sleep(5)
username = driver.find_element_by_xpath("******************")
username.clear()
username.send_keys("*****")
password = driver.find_element_by_xpath("*****************")
password.clear()
password.send_keys("*****")
driver.find_element_by_xpath("**************").click()
time.sleep(5)
print("Connected ready for Scraping")
def scraping(i, t_end) :
print("entering in scraping function")
os.system("free -h && sysctl vm.drop_caches=3 && free -h")
maps = driver.find_elements_by_class_name("************")
t_end = t_end * 3600
t_end = time.time() + t_end
l_a = []
dit_l_a ={}
time.sleep(5)
while time.time() < t_end :
dict_tempo = {}
total_a = driver.find_element_by_xpath("***************").text
if total_a == '0.00' :
time.sleep(10.5)
final_a = driver.find_element_by_xpath("**********").text
l_a.append(final_a)
history_a1 = driver.find_element_by_xpath('******').text
scrape = driver.find_element_by_xpath('*******').text
while scrape == history_a1 :
scrape = driver.find_element_by_xpath('****').text
scrape = scrape.split('\n')
dict_tempo["Final a"] = final_a
dict_tempo["List Of All a"] = scrape
now = datetime.datetime.now()
now = now.strftime("%d/%m/%Y %H:%M:%S")
dict_tempo["Date"] = now
dit_l_a[scrape[0]] = dict_tempo
return dit_l_a
for i in range(168):
print("Interation : ", i)
try :
returned_dict = scraping(i, t_end)
joblib.dump(returned_dict, './returned_dict_' + str(i))
except Exception as e :
print(e)
pass
return returned_dict
if __name__ == '__main__':
returned_dict = auth(2)
Environment: VPS 4GB RAM - CentOS 8 - Python 3.8 - Firefox84Beta - GeckDriver 0.28.0 - Headless scraping

I can see you are using firefox profile so what you can do is configure your profile to use less memory by following the recommended settings in firefox and then use that
https://support.mozilla.org/en-US/kb/firefox-uses-too-much-memory-or-cpu-resources
Additionally
In script try to add implicit wait of 3 or 4 seconds , this causes all commands to wait for 4 seconds making the scraping process slower if elements are not found or completed.
also add sleep after while and for loop:
while time.time() < t_end :
time.sleep(5)
Also
for i in range(168):
time.sleep(5)
This results in firefox storing files in harddisk than RAM , If you don't use sleep , the selenium action on the browser will be really fast than firefox could handle. This results in firefox using more RAM to serve the fast demand as read write operation is faster from RAM than harddisk
In addition add driver.close() this wont mostly remove the browser session , if it does try opening url in different tab and closing the previous tab in every 100 iteration or something . This will also free up memory
You can add sleep before each action using listeners
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time;
from selenium.webdriver.support.events import EventFiringWebDriver, AbstractEventListener
class MyListener(AbstractEventListener):
def before_navigate_to(self, url, driver):
time.sleep(5)
print("Before navigate to %s" % url)
def find_element_by_id(self, id_, driver):
for i in range(10):
time.sleep(5)
print("helloooo")
print("After navigate to %s" % url)
def after_navigate_to(self, url, driver):
for i in range(10):
time.sleep(5)
print("hi")
print("After navigate to %s" % url)
tempdriver = webdriver.Chrome()
driver = EventFiringWebDriver(tempdriver, MyListener())
driver.get("http://www.google.co.in/")
driver.find_element_by_id("hi")
Read : 7.37. Event Firing WebDriver Support
at : https://selenium-python.readthedocs.io/api.html
THe above example is only for driver, add the same for webelement if you need. But in your case time.sleep() with in loop would be more than enough

Related

Can I run my selenium code so that it doesn't trigger cloudflare?

I'm writing a code that can help my dad get tee times for his golf course. At the moment, it scans through a series of n number of tabs for a button to book the tee time, but if it can't find one, it refreshes the page. The problem comes when it refreshes, as the page is protected by cloudflare, and thus my code gets blocked ten times more often than it can check for the actual tee time. Is there a better way to run my code, so it doesn't get blocked so often?
(At the moment, I'm running it on a headless selenium Chrome browser, but I'm looking to see if I can run the code on a normal browser instead.)
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pyautogui as pag
date = "2021-09-19"
st = "08"
et = "09"
golfers = "4"
tabs_opened = 10
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
for _ in range(tabs_opened):
driver.execute_script(f"window.open('about:blank', 'tab{_+1}');")
driver.switch_to.window(f"tab{_+1}")
driver.get(f"https://city-of-burnaby-golf.book.teeitup.com/?course=5fc6afcfd62a025a3123401a&date={date}&end={et}&golfers={golfers}&holes=18&start={st}") #loads the website
driver.switch_to.window(driver.window_handles[0]) #assuming new tab is at index 1
driver.close()
tcgbbc = 0 #tracks number of times cloudflare blocks me
wt = 0 #tracks actual times the website has been checked for button
i = 0
_ = 0
starttest = time.perf_counter()
while i < 100:
driver.switch_to.window(driver.window_handles[_%tabs_opened]) #switch to next tab
try:
driver.find_element_by_xpath('/html/body/table/tbody/tr/td/div[2]/span/code') #checks if the cloudflare ray id text is there
print("whoops got blocked by cloudflare")
tcgbbc+=1
except:
print("no cloudflare yay") #it wasn't. yay
elem = driver.find_elements_by_xpath('//*[#id="app-container"]/div/div[2]/div/div[2]/div[2]/div[2]/div[1]/div/button') #checks golf burnaby website to see if the button is there
if len(elem) > 0:
elem[0].click()
print("page found")
break
else:
if len(driver.find_elements_by_xpath('//*[#id="header"]/div/div[1]')) > 0: #if the website's title is there (sometimes it checks and can't find the button, but the page wasn't even loaded)
print("didn't work")
wt+=1
i+=1
driver.refresh() #refreshes the page. seems to be the most time consuming.
_+=1
print(time.perf_counter() - starttest)
time.sleep(1)
driver.close
print(f"Cloudflare blocks: {tcgbbc}, actual checks: {wt}")

Driver.get a group of links?

How do I use driver.get to open several URLs in Chrome.
My code:
import requests
import json
import pandas as pd
from selenium import webdriver
chromeOptions = webdriver.ChromeOptions()
chromedriver = r"C:\Users\Harrison Pollock\Downloads\Python\chromedriver_win32\chromedriver.exe"
driver = webdriver.Chrome(executable_path=r"C:\Users\Harrison Pollock\Downloads\Python\chromedriver_win32\chromedriver.exe",chrome_options=chromeOptions)
links = []
request1 = requests.get('https://api.beta.tab.com.au/v1/recommendation-service/featured-events?jurisdiction=NSW')
json1 = request1.json()
for n in json1['nextToGoRaces']:
if n['meeting']['location'] in ['VIC','NSW','QLD','SA','WA','TAS','IRL']:
links.append(n['_links']['self'])
driver.get('links')

Based on the comments - you'll want a class to manage your browsers, a class for your tests, then a runner to run in parallel.
Try this:
import unittest
import time
import testtools
from selenium import webdriver
class BrowserManager:
browsers=[]
def createBrowser(self, url):
browser = webdriver.Chrome()
browser.get(url)
self.browsers.append(browser)
def getBrowserByPartialURL(self, url):
for browser in self.browsers:
if url in browser.current_url:
return browser
def CloseItAllDown(self):
for browser in self.browsers:
browser.close()
class UnitTest1(unittest.TestCase):
def test_DoStuffOnGoogle(self):
browser = b.getBrowserByPartialURL("google")
#Point of this is to watch the output! you'll see this +other test intermingled (proves parallel run)
for i in range(10):
print(browser.current_url)
time.sleep(1)
def test_DoStuffOnYahoo(self):
browser = b.getBrowserByPartialURL("yahoo")
#Point of this is to watch the output! you'll see this +other test intermingled (proves parallel run)
for i in range(10):
print(browser.current_url)
time.sleep(1)
#create a global variable for the brwosers
b = BrowserManager()
# To Run the tests
if __name__ == "__main__":
##move to an init to Create your browers
b.createBrowser("https://www.google.com")
b.createBrowser("https://www.yahoo.com")
time.sleep(5) # This is so you can see both open at the same time
suite = unittest.TestLoader().loadTestsFromTestCase(UnitTest1)
concurrent_suite = testtools.ConcurrentStreamTestSuite(lambda: ((case, None) for case in suite))
concurrent_suite.run(testtools.StreamResult())
This code doesn't do anything exciting - it's an example of how to manage multiple browsers and run tests in parallel. It goes to the specified urls (which you should move to an init/setup), then prints out the URL it's on 10 times.
This is how you add a browser to the manager: b.createBrowser("https://www.google.com")
This is how you retrieve your browser: browser = b.getBrowserByPartialURL("google") - note it's a partial URL so you can use the domain as a keyword.
This is the output (just the first few lines- not all of it...) - It's a print URL for google then yahoo, then google then yahoo - showing that they're running at the same time:
PS C:\Git\PythonSelenium\BrowserManager> cd 'c:\Git\PythonSelenium'; & 'C:\Python38\python.exe' 'c:\Users\User\.vscode\extensions\ms-python.python-2020.7.96456\pythonFiles\lib\python\debugpy\launcher' '62426' '--' 'c:\Git\PythonSelenium\BrowserManager\BrowserManager.py'
DevTools listening on ws://127.0.0.1:62436/devtools/browser/7260dee3-368c-4f21-bd59-2932f3122b2e
DevTools listening on ws://127.0.0.1:62463/devtools/browser/9a7ce919-23bd-4fee-b302-8d7481c4afcd
https://www.google.com/
https://consent.yahoo.com/collectConsent?sessionId=3_cc-session_d548b656-8315-4eef-bb1d-82fd4c6469f8&lang=en-GB&inline=false
https://www.google.com/
https://consent.yahoo.com/collectConsent?sessionId=3_cc-session_d548b656-8315-4eef-bb1d-82fd4c6469f8&lang=en-GB&inline=false
https://www.google.com/

Selenium in Notebook vs. Script

Running into something interesting when trying to set up a Selenium webdriver to scrape fantasy football stats from ESPN. When I execute the following cells in a jupyter notebook I can reach the page I'm looking for (the draft recap page of my fantasy league) and successfully login to my account, accessing the page:
# cell 1
driver = webdriver.Firefox()
driver.get(url)
# cell 2
i = 0
iter_again = True
iframes = driver.find_elements_by_tag_name('iframe')
while i < len(iframes) and iter_again:
driver.switch_to_frame(iframes[i])
if (len(driver.find_elements_by_class_name("input-wrapper"))) > 0:
username, password = driver.find_elements_by_class_name("input-wrapper")
iter_again = False
else:
sleep(1)
driver.switch_to_default_content()
i += 1
# Cell 3
username.find_elements_by_tag_name('input')[0].send_keys(espn_username)
password.find_elements_by_tag_name('input')[0].send_keys(espn_password)
# Cell 4
driver.find_elements_by_tag_name('button')[0].click()
# Cell 5
driver.refresh()
The strange thing though is that when I put all of this in a function and return the webdriver object, espn won't let me log in. I get an error message saying that ESPN is experiencing technical difficulties at this time and I may not be able to login (they're right I can't).
I initially thought this could be some sort of rate limiting thing but can't think of anything different in the HTTP request from the functional form vs. the cell-by-cell approach. For what it's worth, I've tested the functional approach in both a jupyter notebook environment as well as running from a standalone script via the CLI. Any thoughts? All help/feedback is greatly appreciated!
EDIT - Adding the script that doesn't execute properly
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
time import sleep
def get_active_webdriver(url, espn_username, espn_password, headless=False):
driver = webdriver.Firefox()
driver.get(url)
i = 0
iter_again = True
# find iframe with the login info and log in
iframes = driver.find_elements_by_tag_name('iframe')
while i < len(iframes) and iter_again:
driver.switch_to_frame(iframes[i])
if (len(driver.find_elements_by_class_name("input-wrapper"))) > 0:
username, password = driver.find_elements_by_class_name("input-wrapper")
iter_again = False
else:
sleep(1)
driver.switch_to_default_content()
i += 1
username.find_elements_by_tag_name('input')[0].send_keys(espn_username)
password.find_elements_by_tag_name('input')[0].send_keys(espn_password)
driver.find_elements_by_tag_name('button')[0].click()
driver.refresh()
return driver
if __name__ == "__main__":
url = #url here
espn_username = #username
espn_password = #password
driver = get_active_webdriver(url, espn_username, espn_password)

automaticaly impossible to click element in Webdriver (python)

I am trying to program a Python script which downloads table automatically from the webpage. The table is not fully loaded, when I simply go to the specified url address. I have to click link "Load more". This I tried to do by the script bellow.
delay = 2
driver = webdriver.Chrome('chromedriver')
driver.get("url")
time.sleep(delay + np.random.rand() )
click_except = 0
while click_except == 0:
try:
driver.find_element_by_id("id").click()
time.sleep(delay + np.random.rand() )
except:
click_except = 1
time.sleep(delay + np.random.rand() )
web = driver.find_element_by_id("id_table")
str = (web.text)
It worked before, but now it does not work... the same code! I moved to a different country and I am using different wi-fi. Can this have any effect? Actually the line with click command still works, when processed separately and manually. It does not work together with the While and Try cycle. Any idea what is wrong? Or any idea, how to programme it better?
The delay should give the webpage enough time to upload.

I recommend you to avoid waiting for a time period, it is better to wait for specific element and selenium supports it, check: https://selenium-python.readthedocs.io/waits.html#explicit-waits
You can do something like:
driver = webdriver.Chrome('chromedriver')
driver.get('url')
wait_for_id('id').click()
str = wait_for_id('id_table').text
def wait_for_id(identifier):
"""
It waits for web element with identifier
:return: found selenium web element
"""
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, identifier))
)
return element

measuring the time to load a page - Python

I'm using webbrowser, so I can open a html to an performance test I'm currently doing.
This small piece of code is the begin of the automation. The goal of the function perf_measure is to return how long took to load the page in url entirely.
import webbrowser
def perf_measure(url=""):
try:
webbrowser.open(url)
except webbrowser.Error, e:
print "It couldn't open the url: ", url
url = "www.google.com"
open_browser(url)
How can I accomplish that? I just need the value, in seconds, like:
www.google.com Total time to load page in (secs): 2.641

Do you need to use the web browser? As in do you need to view the result?
Otherwise you could do this.
import urllib2
from time import time
stream = urllib2.urlopen('http://www.rarlab.com/rar/winrar-x64-420.exe')
start_time = time()
output = stream.read()
end_time = time()
stream.close()
print(end_time-start_time)
If you want a more human-readable result you can use round.
print(round(end_time-start_time, 3))
Output
0.865000009537 # Without Round
0.865 # With Round

A fancy way using a decorator
import time
def time_it(func):
def wrapper(*arg,**kw):
t1 = time.time()
res = func(*arg,**kw)
t2 = time.time()
return (t2-t1),res,func.func_name
return wrapper
#time_it
def perf_measure(url=""):
#w hatever you want
pass

If you want to time the page load (including all of the resources it loads, rendering time etc.) in a real browser you can use Selenium Webdriver. This will open your browser of choice, load the URL and then extract timings:
from selenium import webdriver
def time_url(driver, url):
driver.get(url)
# Use the browser Navigation Timing API to get some numbers:
# https://developer.mozilla.org/en-US/docs/Web/API/Navigation_timing_API
navigation_start = driver.execute_script(
"return window.performance.timing.navigationStart")
dom_complete = driver.execute_script(
"return window.performance.timing.domComplete")
total_time = dom_complete - navigation_start
print(f"Time {total_time}ms")
driver = webdriver.Chrome()
try:
url = "https://httpbin.org/delay/"
time_url(driver, url + '1')
time_url(driver, url + '2')
finally:
driver.close()
There are many other metrics you can load if you want to know the render-time separately from the loading time etc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selenium Firefox Scraping : Takes so much RAM and crash - python

Related

Can I run my selenium code so that it doesn't trigger cloudflare?

Driver.get a group of links?

Selenium in Notebook vs. Script

automaticaly impossible to click element in Webdriver (python)

measuring the time to load a page - Python

Categories

Resources