I have a script in Playwright 1.23.1 with Python 3.10.2 running in Ubuntu LTS 20.4 to extract data from a page, the scripts run for about an hour because it is a lot of data. The problem is that sometimes the execution freezes and I can't catch it and I have to give the SIGINT sign (CTRL + C) to stop the execution. I don't know if it is a problem of Playwright or it is of the page that it is browsing. This page is really not that good, it's slow and have some problems in general.
My code is something like this:
from playwright.sync_api import TimeoutError as PlaywrightTimeoutError
from playwright.sync_api import Error as PlaywrightError
with sync_playwright() as spw:
# connect to the page
browser, page = connect(spw)
# sign in with credentials
sign_in(username, page)
data_code = 1
while True:
try:
# Extract the data from the page using playwright methods
# including select_option, frame_locator, evaluate, click, and reload
end_signal = extract_data(page, data_code)
if end_signal:
break
except (TimeoutError, PlaywrightTimeoutError, PlaywrightError) as e:
print(f"Error is: {e}")
page.reload(timeout=90000, wait_until='domcontentloaded')
data_code += 1
print("Exit cycle")
That's the basic structure of my code, the problem is that sometimes when trying to use click, select_option, or reloading the execution freezes. There are some prints inside extract_data but they don't appear, and sometimes print(f"Error is: {e}") does not print, so it does reach it, therefore, my conclusion is that the execution is freezed.
I have researched a little about Playwright debug, but in general I can't find the problem, and this error happens sometimes not always, so I can't repliacte it for a proper debug with playwights. The logs at least let me identify where it freezes, but nothing else so far.
Hope someone can help me, thanks.
Related
I've written a script that periodically scrapes twitter data using while True.
For the except element, I initiate a one-time scrape of a large chunk of data.
The only way I can trigger this is by using Ctrl+C. What I want to do is map the 'Ctrl+C' function to the button on my RaspberryPi Pibrella.
I've looked around here, there, and everywhere, but had no joy. The only module I can find does not work on Raspberry Pi (Linux).
def status_update():
while True:
try:
scrape_some_stuff()
time.sleep(1x60)
except:
scrape_lots_of_stuff()
time.sleep(1x60)
pibrella has a package (pip install pibrella) so you can easily monitor the button status, with which you can use to raise your exception
import pibrella
# add this line into your codes
pibrella.button.pressed(raise Exception)
Their github repo has some examples.
Getting some very strange behaviour in the following example code segment:
from selenium import webdriver
try:
driver = webdriver.Firefox(executable_path='./bin/geckodriver.exe')
variable = input("Enter something or Ctrl+C: ")
driver.get(variable)
except:
pass
finally:
driver.close()
driver.quit()
If I enter a valid URL, the webdriver fetches the page & then the browser instance is closed.
If I enter an invalid URL, a selenium.common.exceptions.InvalidArgumentException is thrown, but the code progresses & the browser is still closed
If I press Ctrl + C to send the SIGINT during the input statement:
Exception hits pass in main method & proceeds to finally
Calling just driver.quit() returns None and Firefox instance is left open
Calling driver.close() results in urllib3.exceptions.MaxRetryError: ... Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it' and the program terminates with Firefox left open
This is the simplest example I could come up with, but I'm getting the same behaviour in some code that I'm writing when WebDriverWait's are interrupted or when seemingly unrelated code throws an Exception; suddenly the webdriver instance is unresponsive. This is a problem since it leaves headless Firefox instances open. Is this a known issue when working with Selenium, or am I doing something I shouldn't be?
The Firefox version being used is Quantum v64 & Geckodriver is v0.23.0; both should be up-to-date.
Edit: Using pdb to step through the code, the driver instance is created & firefox opens, it prompts for the input and I press Ctrl+C, driver.get(variable) is not executed, the code moves to except and then to finally, and I receive a MaxRetryError out of nowhere. If I replace the input(..) line with: raise KeyboardInterrupt(), then the browser closes as expected; I'm not sure why the program has this behaviour in response to Ctrl+C.
Edit (2): Reported this as a bug on the Selenium Github Repo. Python version difference was suggested, but I retried under 3.7.2 (most recent) and still exhibiting the same behaviour-- Selenium / Firefox / Python / Geckodriver are all now as up-to-date as they can be & I'm still running into this issue-- Hopefully this gets resolved; seemingly, this is not an issue with the code I've written.
so I have this script in selenium, that it sometimes crashes for various reasons. Sometimes it fails to click a button, gets confused, gets messed up and displays an error.
How can I command the script, so whenever it crashes, to re-run the script from the beginning again? I've heard about try and except functions but I'm not sure how to use them.
Any help is appreciated! :)
[Using Python 2.7 with Selenium Webdriver]
generic answer to retry on any exception:
while True:
try:
# run your selenium code which sometimes breaks
pass
except Exception as e:
print("something went wrong: "+repr(e))
you may try to refine the exception type to avoid retrying say, because of a python error like ValueError or IOError. Check the exception type and change Exception by the qualified type.
I have a Python program which sends several (about 5-6) long poll requests in parallel using different threads for each poll via requests package. And I realized that some of my threads sometimes just freeze. When this happens, the server I am sending the request does not receive the request. Also I set a timeout on the request and it does not work.
try:
print("This line prints")
response = requests.head(poll_request_url, timeout=180)
print("This line does not print when freeze occurs")
except ReadTimeout:
print("Request exception.")
except RequestException as e:
print("Request exception.")
except Exception:
print("Unknown exception.")
print("This line does not print either when freeze occurs.")
I am doing this on Raspberry Pi 2 hardware with Raspbian OS.
I used this same program without a problem when I was using Python 2.7. Recently I switched to Python 3.5. I tested using both requests versions with 2.8.1 and 2.9.1.
This problem does not occur very frequently but happens 2-3 times per day on different threads.
What might be the problem? How can I debug this?
Edit: The problem is solved by updating the Linux kernel.
According to the docs:
http://docs.python-requests.org/en/master/user/quickstart/#timeouts
It should be throwing a Timeout exception, when the timeout happens. That would mean that the line:
print("This line does not print when freeze occurs")
Would never be called it a timeout actually happens.
Are you catching the exception? Or any other exception? It might be that it's timing out fine, but you just aren't seeing this. Maybe try something like this:
try:
response = requests.head(poll_request_url, timeout=180)
except requests.exceptions.Timeout:
print("Timeout occurred")
So you can see if that's what is going on.
EDIT: possibly it's the "connect" step that's not timing out correctly. It may be the large timeout value for the "connect" step is messing it up somehow. Perhaps trying having a shorter timeout for that (as mentioned here):
http://docs.python-requests.org/en/master/user/advanced/#timeouts
e.g.
response = requests.head(poll_request_url, timeout=(3, 180))
Failing that it might be some sort of DNS lookup issue? Maybe see if hardcoding the IPs presents the same problem?
Solved my problem using timers (from threading import Timer). If no result next 10 seconds - repeat, if no result next 10 seconds - print 'Error' and go on. You can't monitor timer status with if statement if request freezes, but you can do it through while loop, adding time if result is ok (Python: Run code every n seconds and restart timer on condition).
I'm still relatively new to Python, so if this is an obvious question, I apologize.
My question is in regard to the urllib2 library, and it's urlopen function. Currently I'm using this to load a large amount of pages from another server (they are all on the same remote host) but the script is killed every now and then by a timeout error (I assume this is from the large requests).
Is there a way to keep the script running after a timeout? I'd like to be able to fetch all of the pages, so I want a script that will keep trying until it gets a page, and then moves on.
On a side note, would keeping the connection open to the server help?
Next time the error occurs, take note of the error message. The last line will tell you the type of exception. For example, it might be a urllib2.HTTPError. Once you know the type of exception raised, you can catch it in a try...except block. For example:
import urllib2
import time
for url in urls:
while True:
try:
sock=urllib2.urlopen(url)
except (urllib2.HTTPError, urllib2.URLError) as err:
# You may want to count how many times you reach here and
# do something smarter if you fail too many times.
# If a site is down, pestering it every 10 seconds may not
# be very fruitful or polite.
time.sleep(10)
else:
# Success
contents=sock.read()
# process contents
break # break out of the while loop
The missing manual of urllib2 might help you