Leaking memory when using Selenium webdriver in a for loop - python

I'm running a Python script which uses Selenium, on an EC2 instance (ubuntu).
Under my AWS plan, I have 2 GB memory. I upgraded to this from the free version after some performance issues with my script. However, when I check the free memory when connected to the Ubuntu server, I'm only seeing 348MB of available memory, and 353MB of free memory!
As of now, I'm only running two Python scripts, once a day, using crontab. The scripts run through a fairly long array of URLs and grabs information from each of them.
base_url = 'https://www.bandsintown.com/en/c/san-francisco-ca?page='
events = []
eventContainerBucket = []
for i in range(1,25):
#cycle through pages in range
driver.get(base_url + str(i))
pageURL = base_url + str(i)
# get events links
event_list = driver.find_elements_by_css_selector('div[class^=_3buUBPWBhUz9KBQqgXm-gf] a[class^=_3UX9sLQPbNUbfbaigy35li]')
# collect href attribute of events in even_list
events.extend(list(event.get_attribute("href") for event in event_list))
allEvents = []
for event in events:
driver.get(event)
//do a bunch of other stuff
driver.quit()
Is there an inherent problem in this code that would be causing memory leak? I would think free memory should go up again when the script has stopped running, but it doesn't.
I tried to invoke driver.close() within the for-loop, so that after the information is retrieved from each URL, the window closes. I was thinking this would help with memory leak - unfortunately, this gave me an error selenium.common.exceptions.InvalidSessionIdException: Message: invalid session id.
Am I on the right path with driver.close() or is something else entirely the problem?

Instead of driver quit, have you tried driver.close()?
close() - It is used to close the browser
quite() - It is used to shut down the web driver instance.
for i in range(1,25):
#cycle through pages in range
driver.get(base_url + str(i))
pageURL = base_url + str(i)
# get events links
event_list = driver.find_elements_by_css_selector('div[class^=_3buUBPWBhUz9KBQqgXm-gf] a[class^=_3UX9sLQPbNUbfbaigy35li]')
# collect href attribute of events in even_list
events.extend(list(event.get_attribute("href") for event in event_list))
allEvents = []
for event in events:
driver.get(event)
driver.close()
I have checked your code its working fine without any issue. Please find below screenshot for more details

Related

Python undetectable_webdriver won't open in a loop

I am trying to open a site multiple times in a loop to test if different credentials have expired so that I can notify our users. I'm achieving this by opening the database, getting the records, calling the chrome driver to open the site, and inputting the values into the site. The first loop works but when the next one initiates the driver hangs and eventually outputs the error:
"unknown error: cannot connect to chrome at 127.0.0.1:XXXX from chrome not reachable"
This error commonly occurs when there is already an instance running. I have tried to prevent this by using both driver.close() and driver.quit() when the first loop is done but to no avail. I have taken care of all other possibilities of detection such as using proxies, different user agents, and also using the undetected_chromedriver by https://github.com/ultrafunkamsterdam/undetected-chromedriver.
The core issue I am looking to solve is being able to open an instance of the chrome driver, close it and open it back again all in the same execution loop until all the credentials I am testing have finished. I have abstracted the code and provided an isolated version that replicates the issue:
# INSTALL CHROMDRIVER USING "pip install undetected-chromedriver"
import undetected_chromedriver.v2 as uc
# Python Libraries
import time
options = uc.ChromeOptions()
options.add_argument('--no-first-run')
driver = uc.Chrome(options=options)
length = 8
count = 0
if count < length:
print("Im outside the loop")
while count < length:
print("This is loop ",count)
time.sleep(2)
with driver:
print("Im inside the loop")
count =+ 1
driver.get("https://google.com")
time.sleep(5)
print("Im at the end of the loop")
driver.quit() # Used to exit the browser, and end the session
# driver.close() # Only closes the window in focus
I recommend using a python virtualenv to keep packages consistent. I am using python3.9 on a Linux machine. Any solutions, suggestions, or workarounds would be greatly appreciated.
You are quitting your driver in the loop and then trying to access the executor address, which no longer exists, hence your error. You need to reinitialize the driver by moving it down within the loop, before the while statement.
from multiprocessing import Process, freeze_support
import undetected_chromedriver as uc
# Python Libraries
import time
chroptions = uc.ChromeOptions()
chroptions.add_argument('--no-first-run enable_console_log = True')
# driver = uc.Chrome(options=chroptions)
length = 8
count = 0
if count < length:
print("Im outside the loop")
while count < length:
print("This is loop ",count)
driver = uc.Chrome(options=chroptions)
time.sleep(2)
with driver:
print("Im inside the loop")
count =+ 1
driver.get("https://google.com")
print("Session ID: ", end='') #added to show your session ID is changing
print(driver.session_id)
driver.quit()

How does OpenWPMs run_custom_function affect the Scope?

I'm trying to scrape some sites with OpenWPM and wrote a custom function get_pp_links(), that adds something to a global list. But when I use it with OpenWPMs
run_custom_function(), the things I append to the list disappear.
My script looks like this:
from automation import CommandSequence, TaskManager
NUM_BROWSERS = 2
alist = [1]
def get_pp_links(**kwargs):
alist.append(1)
def main():
# The list of sites that we wish to crawl
with open("top-100.csv", "r") as file:
csv_reader = reader(file)
sites = list(csv_reader)
# Loads the default manager preferences
manager_params, browser_params = TaskManager.load_default_params(NUM_BROWSERS)
# Update browser configuration
for i in range(NUM_BROWSERS):
browser_params[i]['headless'] = True
# Instantiates the measurement platform
manager = TaskManager.TaskManager(manager_params, browser_params)
for site in sites:
command_sequence = CommandSequence.CommandSequence("https://"+site[1], reset=True)
command_sequence.get(sleep=0, timeout=60)
command_sequence.run_custom_function(get_pp_links, ())
manager.execute_command_sequence(command_sequence)
# Shuts down the browsers and waits for the data to finish logging
manager.close()
print(alist)
if __name__ == "__main__":
main()
And top-100.cvs contains a domain list:
92,twitch.tv
93,forbes.com
94,bbc.com
I'm expecting the list to grow with every scanned site, so the result would look like [1,1,1,1], but instead, the printed list is only [1]
I think this is somehow connected to the run_custom_function(), because when I call get_pp_links() directly, the problem does not appear.
This is because OpenWPM creates a new process for each browser spawned.
Since each process is isolated against the parent process and each other the following happens.
alist gets created in the main process.
alist gets copied into the browser process.
alist gets changed by get_pp_links in the browser process
The changes stay in the browser process and you can't observe them in the parent process.
You might be able to get around this by using a multiprocessing.SyncManager and syncing the list between the processes.

How to use a webscraper running on an EC2 instance with lambda functions?

I have built a webscraper using python and selenium with geckodriver, it is currently running in an EC2 instance on a crontab schedule.
My issue is it takes more than 5 minutes to finish downloading and I want to use lamda functions to run my scraper but they only allow for 5 minutes of runtime.
So I have a code similar to this.
from selenium import webdriver
def start_browser(url):
browser = webdriver.Firefox( executable_path="./geckodriver")
executable_path="./geckodriver")
browser.get(url)
return browser
def log_in(user, pass, user_elem, pass_elem, login_elem, browser):
user_elem.click().send_keys(user)
pass_elem.click().send_keys(pass)
login_elem.click()
return browser
def nav_to_data(browser, data_elem)
data_elem.click()
return browser
def find_data(browser, data_table)
data_links = data_table.find_elements_by_tag_name("tr")
return data_links, browser
I'm thinking these functions could be ran on lambda functions passing the browser/webdriver instance to each other?
The part I'm struggling with is looping through the data and waiting for all downloads to finish, this would take longer than 5 mins.
Is there anyway around this?
def download_data(browser, link)
link.click()
time.sleep(2)
download_elem = browser.find_element_by_id("download_xls_file")
download_path = download_elem.click()
return download_path
# THIS TAKES LONGER THAN 5 mins
download_paths = []
for link in data_links:
download = download_data(browser, link) # clicks a link to a new page wdownload button and returns path to the .xls file
download_paths.append(download)
upload_data()
You can partition your data and use a recursive lambda to process chunks of your list.
Taking an example from my blog
def invoke_self_async(data_list, context):
this_data_list = data_list[0:20] # increase number as needed
new_event = {
'data': data_list[20:] # needs to match above number
}
boto3.client('lambda').invoke_async(
FunctionName=context.invoked_function_arn,
InvokeArgs=json.dumps(new_event)
)
my_data = []
for data in data_list:
download = download_data(browser, data) # returns path to .xls file
my_data.append(download)
return my_data

How to reuse a selenium driver instance during parallel processing?

To scrape a pool of URLs, I am paralell processing selenium with joblib. In this context, I am facing two challenges:
Challenge 1 is to speed up this process. In the moment, my code opens and closes a driver instance for every URL (ideally would be one for every process)
Challenge 2 is to get rid of the CPU-intensive while loop that I think I need to continue on empty results (I know that this is most likely wrong)
Pseudocode:
URL_list = [URL1, URL2, URL3, ..., URL100000] # List of URLs to be scraped
def scrape(URL):
while True: # Loop needed to use continue
try: # Try scraping
driver = webdriver.Firefox(executable_path=path) # Set up driver
website = driver.get(URL) # Get URL
results = do_something(website) # Get results from URL content
driver.close() # Close worker
if len(results) == 0: # If do_something() failed:
continue # THEN Worker to skip URL
else: # If do_something() worked:
safe_results("results.csv") # THEN Save results
break # Go to next worker/URL
except Exception as e: # If something weird happens:
save_exception(URL, e) # THEN Save error message
break # Go to next worker/URL
Parallel(n_jobs = 40)(delayed(scrape)(URL) for URL in URL_list))) # Run in 40 processes
My understanding is that in order to re-use a driver instance across iterations, the # Set up driver-line needs to be placed outside scrape(URL). However, everything outside scrape(URL) will not find its way to joblib's Parallel(n_jobs = 40). This would imply that you can't reuse driver instances while scraping with joblib which can't be true.
Q1: How to reuse driver instances during parallel processing in the above example?
Q2: How to get rid of the while-loop while maintaining functionality in the above-mentioned example?
Note: Flash and image loading is disabled in firefox_profile (code not shown)
1) You should first create a bunch of drivers: one for each process. And pass an instance to the worker. I don't know how to pass drivers to an Prallel object, but you could use threading.current_thread().name key to identify drivers. To do that, use backend="threading". So now each thread will has its own driver.
2) You don't need a loop at all. Parallel object itself iter all your urls (I hope I realy understend your intentions to use a loop)
import threading
from joblib import Parallel, delayed
from selenium import webdriver
def scrape(URL):
try:
driver = drivers[threading.current_thread().name]
except KeyError:
drivers[threading.current_thread().name] = webdriver.Firefox()
driver = drivers[threading.current_thread().name]
driver.get(URL)
results = do_something(driver)
if results:
safe_results("results.csv")
drivers = {}
Parallel(n_jobs=-1, backend="threading")(delayed(scrape)(URL) for URL in URL_list)
for driver in drivers.values():
driver.quit()
But I don't realy think you get profit in using n_job more than you have CPUs. So n_jobs=-1 is the best (of course I may be wrong, try it).

Using Python/Selenium/Best Tool For The Job to get URI of image requests generated through JavaScript?

I have some JavaScript from a 3rd party vendor that is initiating an image request. I would like to figure out the URI of this image request.
I can load the page in my browser, and then monitor "Live HTTP Headers" or "Tamper Data" in order to figure out the image request URI, but I would prefer to create a command line process to do this.
My intuition is that it might be possible using python + qtwebkit, but perhaps there is a better way.
To clarify: I might have this (overly simplified code).
<script>
suffix = magicNumberFunctionIDontHaveAccessTo();
url = "http://foobar.com/function?parameter=" + suffix
img = document.createElement('img'); img.src=url; document.all.body.appendChild(img);
</script>
Then once the page is loaded, I can go figure out the url by sniffing the packets. But I can't just figure it out from the source, because I can't predict the outcome of magicNumberFunction...().
Any help would be muchly appreciated!
Thank you.
The simplest thing to do might be to use something like HtmlUnit and skip a real browser entirely. By using Rhino, it can evaluate JavaScript and likely be used to extract that URL out.
That said, if you can't get that working, try out Selenium RC and use the captureNetworkTraffic command (which requires the Selenium instant be started with an option of captureNetworkTraffic=true). This will launch Firefox with a proxy configured and then let you pull the request info back out as JSON/XML/plain text. Then you can parse that content and get what you want.
Try out the instant test tool that my company offers. If the data you're looking for is in our results (after you click View Details), you'll be able to get it from Selenium. I know, since I wrote the captureNetworkTraffic API for Selenium for my company, BrowserMob.
I would pick any one of the many http proxy servers written in Python -- probably one of the simplest ones at the very top of the list -- and tweak it to record all URLs requested (as well as proxy-serve them) e.g. appending them to a text file -- without loss of generality, call that text file 'XXX.txt'.
Now all you need is a script that: starts the proxy server in question; starts Firefox (or whatever) on your main desired URL with the proxy in question set as your proxy (see e.g. this SO question for how), though I'm sure other browsers would work just as well; waits a bit (e.g. until the proxy's XXX.txt file has not been altered for more than N seconds); reads XXX.txt to extract only the URLs you care about and record them wherever you wish; turns down the proxy and Firefox processes.
I think this will be much faster to put in place and make work correctly, for your specific requirements, than any more general solution based on qtwebkit, selenium, or other "automation kits".
Use Firebug Firefox plugin. It will show you all requests in real time and you can even debug the JS in your Browser or run it step-by-step.
Ultimately, I did it in python, using Selenium-RC. This solution requires the python files for selenium-rc, and you need to start the java server ("java -jar selenium-server.jar")
from selenium import selenium
import unittest
import lxml.html
class TestMyDomain(unittest.TestCase):
def setUp(self):
self.selenium = selenium("localhost", \
4444, "*firefox", "http://www.MyDomain.com")
self.selenium.start()
def test_mydomain(self):
htmldoc = open('site-list.html').read()
url_list = [link for (element, attribute,link,pos) in lxml.html.iterlinks(htmldoc)]
for url in url_list:
try:
sel = self.selenium
sel.open(url)
sel.select_window("null")
js_code = '''
myDomainWindow = this.browserbot.getUserWindow();
for(obj in myDomainWindow) {
/* This code grabs the OMNITURE tracking pixel img */
if ((obj.substring(0,4) == 's_i_') && (myDomainWindow[obj].src)) {
var ret = myDomainWindow[obj].src;
}
}
ret;
'''
omniture_url = sel.get_eval(js_code) #parse&process this however you want
except Exception, e:
print 'We ran into an error: %s' % (e,)
self.assertEqual("expectedValue", observedValue)
def tearDown(self):
self.selenium.stop()
if __name__ == "__main__":
unittest.main()
Why can't you just read suffix, or url for that matter? Is the image loaded in an iframe or in your page?
If it is loaded in your page, then this may be a dirty hack (substitute document.body for whatever element is considered):
var ac = document.body.appendChild;
var sources = [];
document.body.appendChild = function(child) {
if (/^img$/i.test(child.tagName)) {
sources.push(child.getAttribute('src'));
}
ac(child);
}

Categories