How do I monitor a "stuck" Python script?

How do I monitor a "stuck" Python script? - python

I have a data-intensive Python script that uses HTTP connections to download data. I usually run it overnight. Sometimes the connection will fail, or a website will be unavailable momentarily. I have basic error-handling that catches these exceptions and tries again periodically, exiting gracefully (and logging errors) after 5 minutes of retrying.
However, I've noticed that sometimes the job just freezes. No error is thrown, and the job is still running, sometimes hours after the last print message.
What is the best way to:
monitor a Python script,
detect if it is unresponsive after a given interval,
exit it if it is unresponsive,
and start another one?
UPDATE
Thank you all for your help. As a few of you have pointed out, the urllib and socket modules don't have timeouts set properly. I am using Python 2.5 with the Freebase and urllib2 modules, and catching and handling MetawebErrors and urllib2.URLErrors. Here is a sample of err output after the last script hung for 12 hours:
File "/home/matthew/dev/projects/myapp_module/project/app/myapp/contrib/freebase/api/session.py", line 369, in _httpreq_json
resp, body = self._httpreq(*args, **kws)
File "/home/matthew/dev/projects/myapp_module/project/app/myapp/contrib/freebase/api/session.py", line 355, in _httpreq
return self._http_request(url, method, body, headers)
File "/home/matthew/dev/projects/myapp_module/project/app/myapp/contrib/freebase/api/httpclients.py", line 33, in __call__
resp = self.opener.open(req)
File "/usr/lib/python2.5/urllib2.py", line 381, in open
response = self._open(req, data)
File "/usr/lib/python2.5/urllib2.py", line 399, in _open
'_open', req)
File "/usr/lib/python2.5/urllib2.py", line 360, in _call_chain
result = func(*args)
File "/usr/lib/python2.5/urllib2.py", line 1107, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.5/urllib2.py", line 1080, in do_open
r = h.getresponse()
File "/usr/lib/python2.5/httplib.py", line 928, in getresponse
response.begin()
File "/usr/lib/python2.5/httplib.py", line 385, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.5/httplib.py", line 343, in _read_status
line = self.fp.readline()
File "/usr/lib/python2.5/socket.py", line 372, in readline
data = recv(1)
KeyboardInterrupt
You'll notice the socket error at the bottom. Since I'm using Python 2.5 and don't have access to the third urllib2.urlopen option, is there another way to watch for and catch this error? For example, I'm catching URLErrrors - is there another type of error in urllib2 or socket that I can catch which will help me?

It sounds like there is a bug in your script. The answer is not to monitor the bug, but to hunt down the bug and fix it.
We can't help you find the bug without seeing some code. But as a general idea, you might want to use logging to pinpoint where the problem is occurring, and write unit tests to help you build confidence about which parts of your code do not have the bug.
Another idea is to break your "stuck" program with Ctrl-C and to study the traceback message. It will show you what line your program was last executing.
That may give you a clue where the script is going wrong.

Since the program is doing web communication, I'd fire up a debugging proxy like Charles http://www.charlesproxy.com/ and see if there's anything kooky happening in the back-and-forth between your script and the server.
Also consider that the socket module has no timeout set by default and therefore can hang. As of python 2.6, however, you can pass a third argument to urllib2.urlopen (if you are using urllib2, that is), specifying a request timeout period in seconds. That way the script will error out rather than go catatonic waiting from a response from a perhaps uncooperative server. If you haven't already, I'd check these things out before trying anything more elaborate.
Update for python 2.5:
To do this in python < 2.6, you would have to set the timeout value directly in the socket module, which urllib2 uses. I haven't tried this, but it presumably works. Found this info at http://www.voidspace.org.uk/python/articles/urllib2.shtml:
import socket
import urllib2
# timeout in seconds
timeout = 10
socket.setdefaulttimeout(timeout)
# this call to urllib2.urlopen now uses the default timeout
# we have set in the socket module
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)

a simple way to do what you ask is to make use of UDP packets sent by your current program to another harvesting program that monitors the output. If it doesn't receive a packet in a certain amount of time, it kills the other python process then restarts another one

You could run your script in pdb and break in when you suspect it's frozen. It won't work on its own, but might help you figure out why it's freezing.

Related

Python Urllib.urlopen:IOError: [Errno socket error] [Errno 10060] [duplicate]

OS: Windows 7; Python 2.7.3 using the Python GUI Shell
I'm trying to read a website through Python, and several authors use the urllib and urllib2 libraries. To store the site in a variable, I've seen a similar approach proposed:
import urllib
import urllib2
g = "http://www.google.com/"
read = urllib2.urlopen(g)
The last line generates an error after a 120+ seconds:
> Traceback (most recent call last): File "<pyshell#27>", line 1, in
> <module>
> r = urllib2.urlopen(o) File "C:\Python27\lib\urllib2.py", line 126, in urlopen
> return _opener.open(url, data, timeout) File "C:\Python27\lib\urllib2.py", line 400, in open
> response = self._open(req, data) File "C:\Python27\lib\urllib2.py", line 418, in _open
> '_open', req) File "C:\Python27\lib\urllib2.py", line 378, in _call_chain
> result = func(*args) File "C:\Python27\lib\urllib2.py", line 1207, in http_open
> return self.do_open(httplib.HTTPConnection, req) File "C:\Python27\lib\urllib2.py", line 1177, in do_open
> raise URLError(err) URLError: <urlopen error [Errno 10060] A connection attempt failed because the connected party did not properly
> respond after a period of time, or established connection failed
> because connected host has failed to respond>
I tried bypassing the g variable and trying to urlopen("http://www.google.com/") with no success either (it generates the same error after the same length of time).

The error code 10060 means it cannot connect to the remote peer. It might be because of the network problem or mostly your setting issues, such as proxy setting.
You could try to connect the same host with other tools(such as ncat) and/or with another PC within your same local network to find out where the problem is occuring.
For proxy issue, there are some material here:
Using an HTTP PROXY - Python
Why can't I get Python's urlopen() method to work on Windows?
Hope it helps!

Answer (Basic is advance!):
Error: 10060
Adding a timeout parameter to request solved the issue for me.
Example 1
import urllib
import urllib2
g = "http://www.google.com/"
read = urllib2.urlopen(g, timeout=20)
Example 2
A similar error also occurred while I was making a GET request. Again, passing a timeout parameter solved the 10060 Error.
response = requests.get(param_url, timeout=20)

This is because of the proxy settings.
I also had the same problem, under which I could not use any of the modules which were fetching data from the internet.
There are simple steps to follow:
1. open the control panel
2. open internet options
3. under connection tab open LAN settings
4. go to advance settings and unmark everything, delete every proxy in there. Or u can just unmark the checkbox in proxy server this will also do the same
5. save all the settings by clicking ok.
you are done.
try to run the programme again, it must work
it worked for me at least

just change your internet connection it is going to work..

How to deal with sporadic BadStatusLine, CannotSendRequest errors in python WebDriver

Since we started running selenium UI tests in jenkins, we noticed a small but annoying frequency of errors during tests. We get BadStatusLine and CannotSendRequest errors on seemingly random selenium actions (click, quit, visit, etc.).
They would usually look something like:
File "/usr/lib/python2.7/unittest/case.py", line 327, in run
testMethod()
File "/home/jenkins/workspace/Create and Upload Functional Testing/shapeways/test_suite/Portal/CreateAndUpload/TestUploadWhenNotLoggedIn_ExpectLoginModal.py", line 22, in runTest
self.dw.visit(ShapewaysUrlBuilder.build_model_upload_url())
File "/home/jenkins/workspace/Create and Upload Functional Testing/webdriver/webdriverwrapper/WebDriverWrapper.py", line 212, in visit
return self.execute_and_handle_webdriver_exceptions(lambda: _visit(url))
File "/home/jenkins/workspace/Create and Upload Functional Testing/webdriver/webdriverwrapper/WebDriverWrapper.py", line 887, in execute_and_handle_webdriver_exceptions
return function_to_execute()
File "/home/jenkins/workspace/Create and Upload Functional Testing/webdriver/webdriverwrapper/WebDriverWrapper.py", line 212, in <lambda>
return self.execute_and_handle_webdriver_exceptions(lambda: _visit(url))
File "/home/jenkins/workspace/Create and Upload Functional Testing/webdriver/webdriverwrapper/WebDriverWrapper.py", line 205, in _visit
return self.driver.get(url)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 185, in get
self.execute(Command.GET, {'url': url})
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 171, in execute
response = self.command_executor.execute(driver_command, params)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 349, in execute
return self._request(command_info[0], url, body=data)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/remote_connection.py", line 380, in _request
resp = self._conn.getresponse()
File "/usr/lib/python2.7/httplib.py", line 1030, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 371, in _read_status
raise BadStatusLine(line)
This particular case came from the following stack:
selenium==2.44.0
python==2.7.3
firefox==34.0
jenkins
xvfb (using the jenkins plugin for headless displays)
though we've seen these errors popping up all the time across many different version permutations of firefox/selenium.
I ran a tcpdump to capture the actual request sent right before the BadStatusLine error came up and got the following.
POST /hub/session/ab64574a-4a17-447a-b2e8-5b0f5ed5e923/url HTTP/1.1
Host: 127.0.0.1:41246
Accept-Encoding: identity Content-Length: 102
Connection: keep-alive
Content-type: application/json;charset="UTF-8"
POST: /hub/session/ab64574a-4a17-447a-b2e8-5b0f5ed5e923/url
Accept: application/json
User-Agent: Python http auth
{"url": "http://example.com/login", "sessionId": "ab64574a-4a17-447a-b2e8-5b0f5ed5e923"}
Response comes back with 0 bytes. So the BadStatusLine was caused by an empty response, which makes sense.
The question is, why would selenium's server return an empty response. If the server died, wouldn't we get a ConnectionError or something along those lines?

For a while, I had no repro and no idea what the cause was. I was finally able to repro by running:
import requests
import json
while True:
requests.post('http://127.0.0.1/hub/session/', data=json.dumps({"url": "http://example.com/login", "sessionId": "ab64574a-4a17-447a-b2e8-5b0f5ed5e923"}))
While this was running, I quit the browser and got a BadStatusLine error! When I tried making that request again, that's when I got the expected "ConnectionError" that you would see from any dead server.
SO, what I suspect happens is that when the browser is sent the kill signal, there is a short window during its shutdown where any response will still be returned but with 0 bytes. That's why you get different types of exceptions for essentially the same problem (browser dies). Turns out we had a cron which was killing our browsers in the background.

DeadlineExceededError only when starting new instance (using webapp2)

I'm getting some weird behavior -- when the application starts up a new instance for the first time, I get a DeadlineExceededError. When I hit refresh in the browser it works just fine And it doesn't matter which page I try. The strange thing is I can see all my debugging code just fine. In fact, I write to the log just prior to calling self.response and it shows up in the console's log. This is pretty hard to troubleshoot, since I'm not having any page load problems in the development environment, and the traceback is a bit opaque to me:
E 2013-09-29 00:10:03.975
Traceback (most recent call last):
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 267, in Handle
for chunk in result:
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/ext/appstats/recording.py", line 1286, in appstats_wsgi_wrapper
end_recording(status)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/ext/appstats/recording.py", line 1410, in end_recording
rec.save()
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/ext/appstats/recording.py", line 654, in save
key, len_part, len_full = self._save()
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/ext/appstats/recording.py", line 678, in _save
namespace=config.KEY_NAMESPACE)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/memcache/__init__.py", line 1008, in set_multi
namespace=namespace)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/memcache/__init__.py", line 907, in _set_multi_with_policy
status_dict = rpc.get_result()
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 612, in get_result
return self.__get_result_hook(self)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/memcache/__init__.py", line 974, in __set_with_policy_hook
rpc.check_success()
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 578, in check_success
self.__rpc.CheckSuccess()
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/api/apiproxy_rpc.py", line 133, in CheckSuccess
raise self.exception
DeadlineExceededError
I 2013-09-29 00:10:03.988
This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This request may thus take longer and use more CPU than a typical request for your application.
I'm not sure how to even go about debugging this, since the error seems to be after all my code has already run.
Edit: I should add this:
I 2013-09-29 00:09:06.919
DEBUG: Writing output!
E 2013-09-29 00:10:03.975
You can see there's nearly a full minute between logging "Writing output!" just before self.response is called, and when the error occurs.

Deadlineexceedederror happens in app engine, if any request to a frontend instance does not get a response within 60 seconds. So what is happening in your case must be that when there are no running instance and your app receives a new user request, a new instance is started for processing. This will lead to an overall response time = instance startup time like library loading and initial data access + the time for processing the user request and this causes a deadlineexceeded error. Then when you are accessing your app immediately , there is an already running instance and so response time = the time for processing the user request and you do not get any error.
Please check the suggested approaches for handling deadlineexceedederror including warmup requests, which is like keeping an instance ready before arrival of a live user request.

httplib CannotSendRequest error in WSGI

I've used two different python oauth libraries with Django to authenticate with twitter. The setup is on apache with WSGI. When I restart the server everything works great for about 10 minutes and then the httplib seems to lock up (see the following error).
I'm running only 1 process and 1 thread of WSGI but that seems to make no difference.
I cannot figure out why it's locking up and giving this CannotSendRequest error. I've spent a lot of hours on this frustrating problem. Any hints/suggestions of what it could be would be greatly appreciated.
File "/usr/lib/python2.5/site-packages/django/core/handlers/base.py", line 92, in get_response
response = callback(request, *callback_args, **callback_kwargs)
File "mypath/auth/decorators.py", line 9, in decorated
return f(*args, **kwargs)
File "mypath/auth/views.py", line 30, in login
token = get_unauthorized_token()
File "/root/storm/eye/auth/utils.py", line 49, in get_unauthorized_token
return oauth.OAuthToken.from_string(oauth_response(req))
File "mypath/auth/utils.py", line 41, in oauth_response
connection().request(req.http_method, req.to_url())
File "/usr/lib/python2.5/httplib.py", line 866, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python2.5/httplib.py", line 883, in _send_request
self.putrequest(method, url, **skips)
File "/usr/lib/python2.5/httplib.py", line 770, in putrequest
raise CannotSendRequest()
CannotSendRequest

This exception is raised when you reuse httplib.HTTP object for new request while you havn't called its getresponse() method for previous request. Probably there was some other error before this one that left connection in broken state. The simplest reliable way to fix the problem is creating new connection for each request, not reusing it. Sure, it will be a bit slower, but I think it's not an issue having you are running application in single process and single thread.

Also check your Python version. I had a similar situation after updating to Py-2.7 from Py-2.6. In Py-2.6 everything worked without any problems. Py-2.7 httplib uses HTTP/1.1 by default which made the server did not send back the Connection: close option in the respond, therefore the connection handling was broken. In my case this worked with HTTP/1.0 though.

http.client.CannotSendRequest: Request-sent
while using http.client module HTTPConnection class ran into the error caus my host name was incorrect

setting XML value using xmlrpc & python

I need to set the value of a field in an XML file which exists on a remote Linux box. How do I find out which port I should connect to ?
But even a proper ping is not happening:
import xmlrpclib
server = xmlrpclib.ServerProxy('http://10.77.21.240:9000')
print server.ping()
print "I'm in hurray"
bUT instead I got:
Traceback (most recent call last):
File "ping.py", line 3, in <module>
print server.ping()
File "C:\Python26\lib\xmlrpclib.py", line 1199, in __call__
return self.__send(self.__name, args)
File "C:\Python26\lib\xmlrpclib.py", line 1489, in __request
verbose=self.__verbose
File "C:\Python26\lib\xmlrpclib.py", line 1235, in request
self.send_content(h, request_body)
File "C:\Python26\lib\xmlrpclib.py", line 1349, in send_content
connection.endheaders()
File "C:\Python26\lib\httplib.py", line 892, in endheaders
self._send_output()
File "C:\Python26\lib\httplib.py", line 764, in _send_output
self.send(msg)
File "C:\Python26\lib\httplib.py", line 723, in send
self.connect()
File "C:\Python26\lib\httplib.py", line 704, in connect
self.timeout)
File "C:\Python26\lib\socket.py", line 514, in create_connection
raise error, msg
socket.error: [Errno 10061] No connection could be made because the target machine actively refused it.
What did I do wrong?

A couple of things to try / think about:
Go to a command prompt on the remote host and type "netstat -nap | grep 9000". If you don't get back something interesting it means that nothing is running at port 9000.
You show the remote host at 10.77.21.240. This is an unroutable address on the net (AKA Private Network), so is the server itself (not just your app) pingable? If you are on windows, goto Start -> Run and type "cmd". At the prompt type, "ping 10.77.21.240" and see what you get.
One more thought: the process may be up and running at 9000 on a reachable host, but it may have opened the port as 127.0.0.1:9000 instead of 0.0.0.0:9000. The first address will only be reachable by processes on the same machine, the second one will open the port on all available IP addresses the machine has.
Update in response to comment: The fact that it shouldn't be a problem doesn't eliminate the possibility it is. When you are debugging something that should be working, but isn't, you need to get fairly pedantic about checking each step, allowing yourself no room for 'Oh, I know that couldn't be the problem.' -- this is a verbal 'handwave' (often accompanied by a real handwave). You'd be surprised how often the problem exists in exactly the area you are handwaving! It takes 3 seconds to do the ping test. If it works, you move on, if it doesn't work ...
The first three steps in dealing with any system problem are:
Is it plugged in?
Is it turned on?
Is it configured properly?
And you have to do this for each and every piece of hardware/software in the food chain from your keyboard to the app. I'd guess 80% of 'sudden' failures are items 1 or 2 -- yes, really. Cables are a huge pain in the ass.
When on the phone with novices I normally start by going for the long pass -- if they can get news.google.com in a browser and then click on a random story, then I know that in general the network is OK. Why the news and why a random story? To sidestep browser cache issues. I've lost count of the number of times my older sister has called me up and announced "The Internet is broken!" The first thing we do is the google news test. 99% of the time it works, so I have her fire up reverse-WinVNC (UltraVNC's SingleClick is a God-send), I get on her machine, and then we see what the real problem is.
If the long pass doesn't work, then I see if they can get to their router. Etc. etc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.