I am scraping a website which has 2 versions at the moment, and when you visit the site you never know which one you are going to get. For this reason I have had to set up two separate files to scrape it.
For the sake of simplicity I have a master file which controls the running of the two files:
attempts = 0
while attempts < 10:
try:
try:
runfile('file1.py')
except SomeException:
runfile('file2.py')
break
except:
attempts += 1
So basically this keeps trying a maximum of 10 times until the correct version of the site meets the correct scraper file.
The problem with this is that the files launch a webdriver every time, so I can end up with several empty browsers clogging up the machine. Is there any command which can just close all webdriver instances? I cannot use driver.quit() because in the environment of this umbrella script, driver is not a recognized variable.
I also cannot use driver.quit() at the end of file1.py or file2.py because when file1.py encounters an error, it ceases to run and so the driver.quit() command will not be executed. I can't use a try / except because then my master file won't understand that there was an error in file1.py and thus won't run file2.py.
Handle the exception in individual runners, close the driver and raise a common exception that you then handle in the caller.
In file1.py and file2.py
try:
# routine
except Exception as e:
driver.quit()
raise e
You can factor this out to the caller by initializing the driver in the caller, and passing the driver instance to functions instead of modules.
You can have a try..finally block in runfile.
def runfile(filename):
driver = ...
try:
...
finally:
# close the webdriver
driver.quit()
Related
Is there a way to detect whether there is a browser available on the system on which the script is run? Nothing happens when running the following code on a server:
try:
webbrowser.open("file://" + os.path.realpath(path))
except webbrowser.Error:
print "Something went wrong when opening webbrowser"
It's weird that there's no caught exception, and no open browser. I'm running the script from command line over an SSH-connection, and I'm not very proficient in server-related stuff, so there may be another way of detecting this that I am missing.
Thanks!
Checkout the documentation:
webbrowser.get([name])
Return a controller object for the browser type name. If name is empty, return a controller for a default browser appropriate to the caller’s environment.
This works for me:
try:
# we are not really interested in the return value
webbrowser.get()
webbrowser.open("file://" + os.path.realpath(path))
except Exception as e:
print "Webbrowser error: " % e
Output:
Webbrowser error: could not locate runnable browser
There's something weird happening with my code, I have a first function that goes like this :
def function1():
try : #1
#try to open a file
#read file
#return info variable from the file
except : #1
try : #2
#try to open a web page
#read web page
if directory1 not in directorylist :
#create directory1
#change working directory to directory1
else :
#change working directory to directory1
#write web page content in a file
#return info variable from the file
except : #2
try : #3
#try to open a second web page
#print error message 1
except : #3
#print error message 2
#set info variable to None
#return info variable
So this function works perfectly when called in the main program, but when I try to call function1 in another function2, both try#2 and except#2 are executed ! Cause directory1 is created and error message 1 is printed, also my info variable equals None.
How can calling function1 in a second function mess try and except clauses ?
Thank you !
Why is it surprising? try block is supposed to execute till some exception is raised and after that except block will execute. So why does it look like both blocks got executed in spite of an exception?
One of the most likely reasons is there are stuff in try block that has nothing to do with the exception being raised. That's the primary reason for the else block. Refactoring your code as follows might help
try:
# only statements that might raise exception
except SomeException:
# except block
else:
# everything you wanted do if no exception was raised
If it's a big chunk of code, fatter the else block, things are likely to go smoothly.
If an exception is raised while executing body of try #2 , obviously except #2 will be executed. You probably should check what kind of exception is raised and at which line.
I have a problem with identifying an exception.
Im writing a scraper that scrapes a lot of different websites, and some errors I want to handle and some I only want to ignore.
I except my exceptions like this:
except Exception as e:
most of the exceptions I can identify like this:
type(e).__name__ == "IOError"
But I have one exception "[Errno 10054] An existing connection was forcibly closed by the remote host"
that has the name "error" which is too vague and Im guessing other errors also have that name. Im guessing I can somehow get the errno number from my exception and thus identify it. But I don't know how.
First, you should not rely on the exception's class name, but on the class itself - two classes from two different modules can have the same value for the __name__ attribute while being different exceptions. So what you want is:
try:
something_that_may_raise()
except IOError as e:
handle_io_error(e)
except SomeOtherError as e:
handle_some_other_error(e)
etc...
Then you have two kind of exceptions: the one that you can actually handle one way or another, and the other ones. If the program is only for your personal use, the best way to handle "the other ones" is usually to not handle them at all - the Python runtime will catch them, display a nice traceback with all relevant informations (so you know what happened and where and can eventually add some handling for this case).
If it's a "public" program and/or if you do have some things to clean up before the program crash, you can add a last "catch all" except clause at the program's top level that will log the error and traceback somewhere so it isn't lost (logging.exception is your friend), clean what has to be cleaned and terminate with a more friendly error message.
There are very few cases where one would really want to just ignore an exception (I mean pretending nothing wrong or unexpected happened and happily continue). At the very least you will want to notify the user one of the actions failed and why - in your case that might be a top-level loop iterating over a set of sites to scrap, with an inner try/except block catching "expected" error cases, ie:
# config:
config = [
# ('url', {params})
('some.site.tld', {"param1" : value1, "param2" : value2}),
('some.other.tld', {"param1" : value1, "answer" : 42}),
# etc
]
def run():
for url, params in config:
try:
results = scrap(url, **params)
except (SomeKnownError, SomeOtherExceptedException) as e:
# things that are to be expected and mostly harmless
#
# you configured your logger so that warnings only
# go to stderr
logger.warning("failed to scrap %s : %s - skipping", url, e)
except (MoreSeriousError, SomethingIWannaKnowAbout) as e:
# things that are more annoying and you want to know
# about but that shouldn't prevent from continuing
# with the remaining sites
#
# you configured your logger so that exceptions goes
# to both stderr and your email.
logger.exception("failed to scrap %s : %s - skipping", url, e)
else:
do_something_with(results)
Then have a top-level handler around the call to run() that takes care of unexpected errors :
def main(argv):
parse_args()
try:
set_up_everything()
run()
return 0
except Exception as e:
logger.exception("oops, something unexpected happened : %s", e)
return 1
finally:
do_some_cleanup()
if __name__ == "__main__":
sys.exit(main(sys.argv))
Note that the logging module has an SMTPHandler - but since mail can easily fail too you'd better still have a reliable log (stderr and tee to a file ?) locally. The logging module takes some time to learn but it really pays off in the long run.
I'm writing a script in Python to do a smoke test on a social network, that implements a post feed.
I wrote a method that looks for the topmost post, and returns it (it's the class "media"). You'll see that there are some time.sleep() and refresh() calls, and that's because the server we use is horrible, to say the least, and the loading fails very often, only rendering partial content, making a refresh necessary.
Here's where the problem is: When, and only when the br.refresh() is called, the object returned is NoneType. If the page loads correctly, and the refresh() is not called, the object returned is correct.
Does anyone have any idea why this might happen? I tried implementing the method without the use of exceptions (in case this is what broke the return, somehow) without any success.
PS: Curiously enough, if instead of waiting for the br.refresh() to be called, I manually go and press the Refresh button on the "driven" browser, the object is returned perfectly.
Here's the code:
def getLastPost (br, count = 0):
try:
elapsed = 0
while(br.find_elements_by_class_name("media") == []) and elapsed < 15:
if elapsed % 5 == 0:
log("Waiting...","w")
time.sleep(0.5)
elapsed += 0.5
if(br.find_elements_by_class_name("media") == []):
raise NoSuchElementException
return br.find_elements_by_class_name("media")[0]
except NoSuchElementException:
if(count >= 5):
raise Exception("Element not found after 5 page reloads.")
log("Element not loaded! Retrying.","w")
count += 1
br.refresh()
time.sleep(count) # Wait a bit more each time.
getLastPost(br, count)
And the error that gives when trying to read the returned object:
Traceback (most recent call last):
File "Smoke.py", line 37, in <module>
assert ("MESSAGE") in getLastPost(br).text
AttributeError: 'NoneType' object has no attribute 'text'
refresh() reloads the browser page. After that, all WebElements are invalid and you must locate them again.
Background: Selenium / WebDriver doesn't remember how you got the element, they ask the browser for a unique internal ID of the element and when you, say, click on an element, they sand a message to the browser "click on 34987563424563.34675".
A reload invalidates all internal IDs.
The reason why you get None in the assert is because there is no return statement in the except clause (last line). Without an explicit return, all Python functions return None. Try return getLastPost(br, count)
When I use SeleniumRC,sometimes I meet a error, but sometimes not. I guess it's related to the time of wait_for_page_to_load(), but I don't know how long will it need?
The error information:
Exception: Timed out after 30000ms
File "C:\Users\Herta\Desktop\test\newtest.py", line 9, in <module>
sel.open(url)
File "C:\Users\Herta\Desktop\test\selenium.py", line 764, in open
self.do_command("open", [url,])
File "C:\Users\Herta\Desktop\test\selenium.py", line 215, in do_command
raise Exception, data
This is my program:
from selenium import selenium
url = 'http://receptome.stanford.edu/hpmr/SearchDB/getGenePage.asp?Param=4502931&ProtId=1&ProtType=Receptor#'
sel = selenium('localhost', 4444, '*firefox', url)
sel.start()
sel.open(url)
sel.wait_for_page_to_load(1000)
f = sel.get_html_source()
sav = open('test.html','w')
sav.write(f)
sav.close()
sel.stop()
Timing is a big issue when automating UI pages. You want to make sure you use timeouts when needed and provide the needed time for certain events. I see that you have
sel.open(url)
sel.wait_for_page_to_load(1000)
The sel.wait_for_page_to_load command after a sel.open call is redundant. All sel.open commands have a built in wait. This may be the cause of your problem because selenium waits as a part of the built in process of the sel.open command. Then selenium is told to wait again for the page to load. Since no page is loaded. It throws an error.
However, this is unlikely since it is throwing the trace on the sel.open command. Wawa's response above may be your best bet.
The "Timed out after 30000ms" message is coming from the sel.open(url) call which uses the selenium default timeout. Try increasing this time using sel.set_timeout("timeout"). I would suggest 60 seconds as a good starting point, if 60 seconds doesn't work, try increasing the timeout. Also make sure that you can get to the page normally.
from selenium import selenium
url = 'http://receptome.stanford.edu/hpmr/SearchDB/getGenePage.asp?Param=4502931&ProtId=1&ProtType=Receptor#'
sel = selenium('localhost', 4444, '*firefox', url)
sel.set_timeout('60000')
sel.start()
sel.open(url)
sel.wait_for_page_to_load(1000)
f = sel.get_html_source()
sav = open('test.html','w')
sav.write(f)
sav.close()
sel.stop()
I had this problem and it was windows firewall blocking selenium server. Have you tried adding an exception to your firewall?