How does OpenWPMs run_custom_function affect the Scope? - python

I'm trying to scrape some sites with OpenWPM and wrote a custom function get_pp_links(), that adds something to a global list. But when I use it with OpenWPMs
run_custom_function(), the things I append to the list disappear.
My script looks like this:
from automation import CommandSequence, TaskManager
NUM_BROWSERS = 2
alist = [1]
def get_pp_links(**kwargs):
alist.append(1)
def main():
# The list of sites that we wish to crawl
with open("top-100.csv", "r") as file:
csv_reader = reader(file)
sites = list(csv_reader)
# Loads the default manager preferences
manager_params, browser_params = TaskManager.load_default_params(NUM_BROWSERS)
# Update browser configuration
for i in range(NUM_BROWSERS):
browser_params[i]['headless'] = True
# Instantiates the measurement platform
manager = TaskManager.TaskManager(manager_params, browser_params)
for site in sites:
command_sequence = CommandSequence.CommandSequence("https://"+site[1], reset=True)
command_sequence.get(sleep=0, timeout=60)
command_sequence.run_custom_function(get_pp_links, ())
manager.execute_command_sequence(command_sequence)
# Shuts down the browsers and waits for the data to finish logging
manager.close()
print(alist)
if __name__ == "__main__":
main()
And top-100.cvs contains a domain list:
92,twitch.tv
93,forbes.com
94,bbc.com
I'm expecting the list to grow with every scanned site, so the result would look like [1,1,1,1], but instead, the printed list is only [1]
I think this is somehow connected to the run_custom_function(), because when I call get_pp_links() directly, the problem does not appear.

This is because OpenWPM creates a new process for each browser spawned.
Since each process is isolated against the parent process and each other the following happens.
alist gets created in the main process.
alist gets copied into the browser process.
alist gets changed by get_pp_links in the browser process
The changes stay in the browser process and you can't observe them in the parent process.
You might be able to get around this by using a multiprocessing.SyncManager and syncing the list between the processes.

Related

Why is the code returning a TypeError - Python

#app.route("/admin/3")
def admin3_p():
return render_template("input_test.html")
#app.route("/admin/3", methods=['POST'])
def student_name():
with app.test_request_context('/admin/3', data='student'):
variable = request.form.get('student', list(''))
return variable
# Connect to CSV
def csv_func():
variable = student_name()
csv_f = "names.csv"
titles = ["Event", "Student", "Grade"]
students = [["Ev", "St", "Gr"], [variable]]
with open(csv_f, 'w') as csvfile:
csvwriter = csv.writer(csvfile)
csvwriter.writerow(titles)
csvwriter.writerows(students)
with open(csv_f, 'r') as csvfile:
csvreader = csv.reader(csvfile)
titles = next(csvreader)
for student in csvreader:
students.append(students)
print('Fields: ' + ', '.join(title for title in titles))
print(students)
csv_func()
I am trying to make a website with Flask. Th csv_func method is supposed to take the input from the html and print it to a csv file.
It returns "TypeError: The view function did not return a valid response. The return type must be a string, dict, tuple, Response instance, or WSGI callable, but it was a list" When it runs
Technically the error is because function with a route decorator is considered 'a view' and is supposed to return a page, yet yours student_name returns a tuple (of student names)
Yet I have to tell you that you got it wrong idea of web app syntax and structure. Your flow of control is opposite from what is should be. You should initiate model and csv changes from controller (student_name function), and you are doing it vise versa, by calling student_name from . The main code usually just start web app with something like
app.run(host='0.0.0.0', port=81)
So you should restructure you code in a so student_name function invokes csv changing function.
I guess you think that web app form is akin to input command in python, yet a web app is very different from python console input. The main difference is that website normally offer several different pages, and user is free to land on any page he likes. So normal webserver just wait for user landing to one or another page or sending one or another form. Thus the structure of web app is a set of pages, routes and controllers for that pages, and main code just starts the flask server. Go throw some introductory flask tutorial if it is still
unclear. E.g. https://flask.palletsprojects.com/en/1.1.x/quickstart/
Most web apps follow UI design pattern called Model-View-Controller, where user actions, such as opening a webpage on a specific web address or filling a form first hit some controlling code, which the initiates some changes in the model (data).
Get rid of the app.route(...) decorator above def student_name():.

Leaking memory when using Selenium webdriver in a for loop

I'm running a Python script which uses Selenium, on an EC2 instance (ubuntu).
Under my AWS plan, I have 2 GB memory. I upgraded to this from the free version after some performance issues with my script. However, when I check the free memory when connected to the Ubuntu server, I'm only seeing 348MB of available memory, and 353MB of free memory!
As of now, I'm only running two Python scripts, once a day, using crontab. The scripts run through a fairly long array of URLs and grabs information from each of them.
base_url = 'https://www.bandsintown.com/en/c/san-francisco-ca?page='
events = []
eventContainerBucket = []
for i in range(1,25):
#cycle through pages in range
driver.get(base_url + str(i))
pageURL = base_url + str(i)
# get events links
event_list = driver.find_elements_by_css_selector('div[class^=_3buUBPWBhUz9KBQqgXm-gf] a[class^=_3UX9sLQPbNUbfbaigy35li]')
# collect href attribute of events in even_list
events.extend(list(event.get_attribute("href") for event in event_list))
allEvents = []
for event in events:
driver.get(event)
//do a bunch of other stuff
driver.quit()
Is there an inherent problem in this code that would be causing memory leak? I would think free memory should go up again when the script has stopped running, but it doesn't.
I tried to invoke driver.close() within the for-loop, so that after the information is retrieved from each URL, the window closes. I was thinking this would help with memory leak - unfortunately, this gave me an error selenium.common.exceptions.InvalidSessionIdException: Message: invalid session id.
Am I on the right path with driver.close() or is something else entirely the problem?
Instead of driver quit, have you tried driver.close()?
close() - It is used to close the browser
quite() - It is used to shut down the web driver instance.
for i in range(1,25):
#cycle through pages in range
driver.get(base_url + str(i))
pageURL = base_url + str(i)
# get events links
event_list = driver.find_elements_by_css_selector('div[class^=_3buUBPWBhUz9KBQqgXm-gf] a[class^=_3UX9sLQPbNUbfbaigy35li]')
# collect href attribute of events in even_list
events.extend(list(event.get_attribute("href") for event in event_list))
allEvents = []
for event in events:
driver.get(event)
driver.close()
I have checked your code its working fine without any issue. Please find below screenshot for more details

Function in django views run 2 times without reason

I have problem because I can not find the reason why my function in Django views.py sometimes runs two times. When I go to url, which call function create_db in Django view, function read json files from directory, parse it and write the data in the database. Most of the time it works perfectly, but sometimes for no reason it runs two times and write the same data in the data base. Does anyone know what can be the reason why code is sometimes done twice and how can I solve the problem?
Here is my create_db function:
def create_db(request):
response_data = {}
try:
start = time.time()
files = os.listdir()
print(files)
for filename in files:
if filename.endswith('.json'):
print(filename)
with open(f'{filename.strip()}', encoding='utf-8') as f:
data = json.load(f)
for item in data["CVE_Items"]:
import_item(item)
response_data['result'] = 'Success'
response_data['message'] = 'Baza podatkov je ustvarjena.'
except KeyError:
response_data['result'] = 'Error'
response_data['message'] = 'Prislo je do napake! Podatki niso bili uvozeni!'
return HttpResponse(json.dumps(response_data), content_type='application/json')
The console output that I expect:
['nvdcve-1.0-2002.json', 'nvdcve-1.0-2003.json', 'nvdcve-1.0-2004.json', 'nvdcve-1.0-2005.json', 'nvdcve-1.0-2006.json', 'nvdcve-1.0-2007.json', 'nvdcve-1.0-2008.json', 'nvdcve-1.0-2009.json', 'nvdcve-1.0-2010.json', 'nvdcve-1.0-2011.json', 'nvdcve-1.0-2012.json', 'nvdcve-1.0-2013.json', 'nvdcve-1.0-2014.json', 'nvdcve-1.0-2015.json', 'nvdcve-1.0-2016.json', 'nvdcve-1.0-2017.json']
nvdcve-1.0-2002.json
nvdcve-1.0-2003.json
nvdcve-1.0-2004.json
nvdcve-1.0-2005.json
nvdcve-1.0-2006.json
nvdcve-1.0-2007.json
nvdcve-1.0-2008.json
nvdcve-1.0-2009.json
nvdcve-1.0-2010.json
nvdcve-1.0-2011.json
nvdcve-1.0-2012.json
nvdcve-1.0-2013.json
nvdcve-1.0-2014.json
nvdcve-1.0-2015.json
nvdcve-1.0-2016.json
nvdcve-1.0-2017.json
Console output when error happened:
['nvdcve-1.0-2002.json', 'nvdcve-1.0-2003.json', 'nvdcve-1.0-2004.json', 'nvdcve-1.0-2005.json', 'nvdcve-1.0-2006.json', 'nvdcve-1.0-2007.json', 'nvdcve-1.0-2008.json', 'nvdcve-1.0-2009.json', 'nvdcve-1.0-2010.json', 'nvdcve-1.0-2011.json', 'nvdcve-1.0-2012.json', 'nvdcve-1.0-2013.json', 'nvdcve-1.0-2014.json', 'nvdcve-1.0-2015.json', 'nvdcve-1.0-2016.json', 'nvdcve-1.0-2017.json']
nvdcve-1.0-2002.json
['nvdcve-1.0-2002.json', 'nvdcve-1.0-2003.json', 'nvdcve-1.0-2004.json', 'nvdcve-1.0-2005.json', 'nvdcve-1.0-2006.json', 'nvdcve-1.0-2007.json', 'nvdcve-1.0-2008.json', 'nvdcve-1.0-2009.json', 'nvdcve-1.0-2010.json', 'nvdcve-1.0-2011.json', 'nvdcve-1.0-2012.json', 'nvdcve-1.0-2013.json', 'nvdcve-1.0-2014.json', 'nvdcve-1.0-2015.json', 'nvdcve-1.0-2016.json', 'nvdcve-1.0-2017.json']
nvdcve-1.0-2002.json
nvdcve-1.0-2003.json
nvdcve-1.0-2003.json
nvdcve-1.0-2004.json
nvdcve-1.0-2004.json
nvdcve-1.0-2005.json
nvdcve-1.0-2005.json
nvdcve-1.0-2006.json
nvdcve-1.0-2006.json
nvdcve-1.0-2007.json
nvdcve-1.0-2007.json
nvdcve-1.0-2008.json
nvdcve-1.0-2008.json
nvdcve-1.0-2009.json
nvdcve-1.0-2009.json
nvdcve-1.0-2010.json
nvdcve-1.0-2010.json
nvdcve-1.0-2011.json
nvdcve-1.0-2011.json
nvdcve-1.0-2012.json
nvdcve-1.0-2012.json
nvdcve-1.0-2013.json
nvdcve-1.0-2013.json
nvdcve-1.0-2014.json
nvdcve-1.0-2014.json
nvdcve-1.0-2015.json
nvdcve-1.0-2015.json
nvdcve-1.0-2016.json
nvdcve-1.0-2016.json
nvdcve-1.0-2017.json
nvdcve-1.0-2017.json
The problem is not in the code which you show us. Enable logging for the HTTP requests which your application receives to make sure the browser sends you just a single request. If you see two requests, make sure they use the same session (maybe another user is clicking at the same time).
If it's from the same user, maybe you're clicking the button twice. Could be a hardware problem with the mouse. To prevent this, use JavaScript to disable the button after the first click.

How to reuse a selenium driver instance during parallel processing?

To scrape a pool of URLs, I am paralell processing selenium with joblib. In this context, I am facing two challenges:
Challenge 1 is to speed up this process. In the moment, my code opens and closes a driver instance for every URL (ideally would be one for every process)
Challenge 2 is to get rid of the CPU-intensive while loop that I think I need to continue on empty results (I know that this is most likely wrong)
Pseudocode:
URL_list = [URL1, URL2, URL3, ..., URL100000] # List of URLs to be scraped
def scrape(URL):
while True: # Loop needed to use continue
try: # Try scraping
driver = webdriver.Firefox(executable_path=path) # Set up driver
website = driver.get(URL) # Get URL
results = do_something(website) # Get results from URL content
driver.close() # Close worker
if len(results) == 0: # If do_something() failed:
continue # THEN Worker to skip URL
else: # If do_something() worked:
safe_results("results.csv") # THEN Save results
break # Go to next worker/URL
except Exception as e: # If something weird happens:
save_exception(URL, e) # THEN Save error message
break # Go to next worker/URL
Parallel(n_jobs = 40)(delayed(scrape)(URL) for URL in URL_list))) # Run in 40 processes
My understanding is that in order to re-use a driver instance across iterations, the # Set up driver-line needs to be placed outside scrape(URL). However, everything outside scrape(URL) will not find its way to joblib's Parallel(n_jobs = 40). This would imply that you can't reuse driver instances while scraping with joblib which can't be true.
Q1: How to reuse driver instances during parallel processing in the above example?
Q2: How to get rid of the while-loop while maintaining functionality in the above-mentioned example?
Note: Flash and image loading is disabled in firefox_profile (code not shown)
1) You should first create a bunch of drivers: one for each process. And pass an instance to the worker. I don't know how to pass drivers to an Prallel object, but you could use threading.current_thread().name key to identify drivers. To do that, use backend="threading". So now each thread will has its own driver.
2) You don't need a loop at all. Parallel object itself iter all your urls (I hope I realy understend your intentions to use a loop)
import threading
from joblib import Parallel, delayed
from selenium import webdriver
def scrape(URL):
try:
driver = drivers[threading.current_thread().name]
except KeyError:
drivers[threading.current_thread().name] = webdriver.Firefox()
driver = drivers[threading.current_thread().name]
driver.get(URL)
results = do_something(driver)
if results:
safe_results("results.csv")
drivers = {}
Parallel(n_jobs=-1, backend="threading")(delayed(scrape)(URL) for URL in URL_list)
for driver in drivers.values():
driver.quit()
But I don't realy think you get profit in using n_job more than you have CPUs. So n_jobs=-1 is the best (of course I may be wrong, try it).

Celery-Redis worker showing weird results on Heroku

I have built a Flask application which I have hosted on Heroku, Celery as the worker with Redis as the Broker and for saving the backend on Redis itself, has the following code:
def create_csv_group(orgi,mx):
# Write a csv file with filename 'group'
cols = []
maxx = int(mx)+1
cols.append(['SID','First','Last','Email'])
for i in range(0,int(mx)):
cols.append(['SID'+str(i),'First'+str(i),'Last'+str(i),'Email'+ str(i)])
with open(os.path.join('uploads/','groupfile_t.csv'), 'wb') as f:
writer = csv.writer(f)
for i in range(len(max(cols, key=len))):
writer.writerow([(c[i] if i<len(c) else '') for c in cols])
#app.route('/mark',methods=['POST'])
def mark():
task = create_csv_group.apply_async(args=[orig,mx])
tsk_id = task.id
If I try to access the variable tsk_id, sometimes it gives the error:
variable used before being initialized.
I thought the reason it was not sending the task to the queue before I was accessing the tsk_id. So I moved the function after two form filling pages.
But now, it is not updating/saving the file correctly, it shows weird output in the file(Seems to be the old file data, which should get updated on filling the new form). When I run the same code locally, it runs perfectly fine. I logged the worker, it goes in the task function, runs properly too.
Why is this weird output is being displayed? How can I fix both of the issues, so that it writes properly to the file and check on the task id?

Categories