I know that requests.get() provides an HTTP interface so that the programmer can make various requests to a HTTP server.
That tells me that somewhere a port must be opened so that the request can happen.
Taking that into account, what would happen if the script is stopped (say, by a Key Board Interrupt, so the machine that is executing the script remains connected to the internet) before the request is answered/complete?
Would the port/connection remain opened?
Does the port/connection close automatically?
The short answer to the question is: requests will close a connection in the case of any exception, including KeyboardInterrupt and SystemExit.
A little digging into the requests source code reveals that requests.get ultimately calls the HTTPAdapter.send method (which is where all the magic happens).
There are two ways in which a request might be made within the send method: chunked or not chunked. Which send we perform depends on the value of the request.body and the Content-Length header:
chunked = not (request.body is None or 'Content-Length' in request.headers)
In the case where the request body is None or the Content-Length is set, requests will make use of the high-level urlopen method of urllib3:
if not chunked:
resp = conn.urlopen(
method=request.method,
url=url,
body=request.body,
# ...
)
The finally block of the urllib3.PoolManager.urlopen method has code that handles closing the connection in the case where the try block didn't execute successfully:
clean_exit = False
# ...
try:
# ...
# Everything went great!
clean_exit = True
finally:
if not clean_exit:
# We hit some kind of exception, handled or otherwise. We need
# to throw the connection away unless explicitly told not to.
# Close the connection, set the variable to None, and make sure
# we put the None back in the pool to avoid leaking it.
conn = conn and conn.close()
release_this_conn = True
In the case where the response can be chunked, requests goes a bit lower level and uses the underlying low level connection provided by urllib3. In this case, requests still handles the exception, it does this with a try / except block that starts immediately after grabbing a connection, and finishes with:
low_conn = conn._get_conn(timeout=DEFAULT_POOL_TIMEOUT)
try:
# ...
except:
# If we hit any problems here, clean up the connection.
# Then, reraise so that we can handle the actual exception.
low_conn.close()
raise
Interestingly the connection may not be closed if there are no errors, depending on how you have configured connection pooling for urllib3. In the case of a successful execution, the connection is put back into the connection pool (though I cannot find a _put_conn call in the requests source for the chunked send, which might be a bug in the chunked work-flow).
On a much lower level, when a program exits, the OS kernel closes all file descriptors opened by that program. These include network sockets.
Related
it is the first time that I am working with a REST API in a jupyter notebook and I don't know what I am doing wrong here. When I try to execute the following code in a cell, the cell runs forever without throwing any errors. First I did not include the close method from the request package, but then I thought the problem might be the open connection. However including the close method also did not help. Do you know what could be the reason?
api_key = "exampletoken"
header = {'authorization':"Bearer {}".format(api_key)}
payload = {}
r = request.post('exampleurl', headers = header, data = payload)
r.close()
Thanks in advance!
runs forever without throwing any errors.
By default requests does not timeout, so it can wait infinite amount of time. This might cause behavior you described and mean server did not respond. To figure if that is cause, please set timeout for example
r = request.post('exampleurl', headers = header, data = payload, timeout=180)
would raise Exception after 180 seconds (i.e. 3 minutes) if it do not get response. If you want to know more about timeouts in requests I suggest reading realpython.com tutorial
To simplify things, assume a TCP client-server app where the client sends a request and the server responds. The server uses sendall to respond to each client.
Now assume a bad client that sends requests to the server but doesn't really handle the responses. I.e. the client never calls socket.recv. (It doesn't have to be a bad client btw...it may be a slow consumer on the other end).
What ends up happening, is that the server keeps sending responses using sendall, until I'm assuming a buffer gets full, and then at some point sendall blocks and never returns.
This seems like a common problem to me so what would be the recommended solution?
Is there something like a try-send that would raise or return an EWOULDBLOCK (or similar) if the recipient's buffer is full? I'd like to avoid non-blocking select type calls if possible (happy to go that way if there are no alternatives).
Thank you in advance.
Following rveed's comment, here's a solution that works for my case:
def send_to_socket(self, sock: socket.socket, message: bytes) -> bool:
try:
sock.settimeout(10.0) # protect against bad clients / slow consumers by making this timeout (instead of blocking)
res = sock.sendall(message)
sock.settimeout(None) # put back to blocking (if needed for subsequent calls to recv, etc. using this socket)
if res is not None:
return False
return True
except socket.timeout as st:
# do whatever you need to here
return False
except Exception as ex:
# handle other exceptions here
return False
If needed, instead of setting the timeout to None afterwards (i.e. back to blocking), you can store the previous timeout value (using gettimeout) and restore to that.
I am trying to using python download a batch of files, and I use requests module with stream turned on, in other words, I retrieve each file in 200K blocks.
However, sometimes, the downloading may stop as it just gets stuck (no response) and there is no error. I guess this is because the connection between my computer and server was not stable enough. Here is my question, how to check this kind of stop and make a new connection?
You probably don't want to detect this from outside, when you can just use timeouts to have requests fail instead of stopping is the server stops sending bytes.
Since you didn't show us your code, it's hard to show you how to change it… but I'll show you how to change some other code:
# hanging
text = requests.get(url).text
# not hanging
try:
text = requests.get(url, timeout=10.0).text
except requests.exceptions.Timeout:
# failed, do something else
# trying until success
while True:
try:
text = requests.get(url, timeout=10.0).text
break
except requests.exceptions.Timeout:
pass
If you do want to detect it from outside for some reason, you'll need to use multiprocessing or similar to move the requests-driven code to a child process. Ideally you'll want it to post updates on some Queue or set and notify some Condition-protected shared flag variable every 200KB, then the main process can block on the Queue or Condition and kill the child process if it times out. For example (pseudocode):
def _download(url, q):
create request
for each 200kb block downloaded:
q.post(buf)
def download(url):
q = multiprocessing.Queue()
with multiprocessing.Process(_download, args=(url, q)) as proc:
try:
return ''.join(iter(functools.partial(q.get, timeout=10.0)))
except multiprocessing.Empty:
proc.kill()
# failed, do something else
I am trying to force Python to retry loading the page when I get a timeout error. Is there a way that I can make it retry a specific number of times, possibly after a specific time delay?
Any help would be appreciated.
Thank you.
urllib2 doesn't have anything built-in for that, but you can write it yourself.
The tricky part is that, as the urlopen docs say, no matter what goes wrong, you just get a URLError. So, how do you know whether it was a timeout, or something else?
Well, if you look up URLError, it says it will have a reason which will be a socket.error for remote URLs. And if you look up socket.error it tells you that it's a subclass of either IOError or OSError (depending on your Python version). And if you look up OSError, it tells you that it has an errno that represents the underlying error.
So, which errno value do you get for timeout? I'm willing to bet it's EINPROGRESS, but let's find out for sure:
>>> urllib.urlopen('http://127.0.0.1', timeout=0)
urllib2.URLError: <urlopen error [Errno 36] Operation now in progress>
>>> errno.errorcode[36]
'EINPROGRESS'
(You could just use the number 36, but that's not guaranteed to be the same across platforms; errno.EINPROGRESS should be more portable.)
So:
import errno
import urllib2
def retrying_urlopen(retries, *args, **kwargs):
for i in range(retries):
try:
return urllib2.urlopen(*args, **kwargs)
except URLError as e:
if e.reason.errno == errno.EINPROGRESS:
continue
raise
If you think this sucks and should be a lot less clunky… well, I think everyone agrees. Exceptions have been radically improved twice, with another big one coming up, plus various small changes along the way. But if you stick with 2.7, you don't get the benefits of those improvements.
If moving to Python 3.4 isn't possible, maybe moving to a third-party module like requests or urllib3 is. Both of those libraries have a separate exception type for Timeout, instead of making you grub through the details of a generic URLError.
Check out the requests library. If you'd like to wait only for a specified amount of time (not for the entire download, just until you get a response from the server), just add the timeout argument to the standard URL request, in seconds:
r = requests.get(url, timeout=10)
If the timeout time is exceeded, it raises a requests.exceptions.Timeout exception, which can be handled however you wish. As an example, you could put the request in a try/except block, catch the exception if it's raised, and retry the connection again for a specified number of times before failing completely.
You might also want to check out requests.adapters.HTTPAdapter, which has a max_retries argument. It's typically used within a Requests Session, and according to the docs, it provides a general-case interface for Requests sessions to contact HTTP and HTTPS urls by implementing the Transport Adapter interface.
Even I am new to Python, but I think even a simple solution like this could do the trick,
begin with considering stuff as None, where stuff is page_source. Also remember that I have only considered the URLError exception. You might want to add more as desired.
import urllib2
import time
stuff=None
max_attempts=4
r=0
while stuff is None and r<max_attempts:
try:
response = urllib2.urlopen('http://www.google.com/ncr', timeout=10)
stuff = response.read()
except urllib2.URLError:
r=r+1
print "Re-trying, attempt -- ",r
time.sleep(5)
pass
print stuff
Hope that helps.
Regards,
Md. Mohsin
I have a client that connects to an HTTP stream and logs the text data it consumes.
I send the streaming server an HTTP GET request... The server replies and continuously publishes data... It will either publish text or send a ping (text) message regularly... and will never close the connection.
I need to read and log the data it consumes in a non-blocking manner.
I am doing something like this:
import urllib2
req = urllib2.urlopen(url)
for dat in req:
with open('out.txt', 'a') as f:
f.write(dat)
My questions are:
will this ever block when the stream is continuous?
how much data is read in each chunk and can it be specified/tuned?
is this the best way to read/log an http stream?
Hey, that's three questions in one! ;-)
It could block sometimes - even if your server is generating data quite quickly, network bottlenecks could in theory cause your reads to block.
Reading the URL data using "for dat in req" will mean reading a line at a time - not really useful if you're reading binary data such as an image. You get better control if you use
chunk = req.read(size)
which can of course block.
Whether it's the best way depends on specifics not available in your question. For example, if you need to run with no blocking calls whatever, you'll need to consider a framework like Twisted. If you don't want blocking to hold you up and don't want to use Twisted (which is a whole new paradigm compared to the blocking way of doing things), then you can spin up a thread to do the reading and writing to file, while your main thread goes on its merry way:
def func(req):
#code the read from URL stream and write to file here
...
t = threading.Thread(target=func)
t.start() # will execute func in a separate thread
...
t.join() # will wait for spawned thread to die
Obviously, I've omitted error checking/exception handling etc. but hopefully it's enough to give you the picture.
You're using too high-level an interface to have good control about such issues as blocking and buffering block sizes. If you're not willing to go all the way to an async interface (in which case twisted, already suggested, is hard to beat!), why not httplib, which is after all in the standard library? HTTPResponse instance .read(amount) method is more likely to block for no longer than needed to read amount bytes, than the similar method on the object returned by urlopen (although admittedly there are no documented specs about that on either module, hmmm...).
Another option is to use the socket module directly. Establish a connection, send the HTTP request, set the socket to non-blocking mode, and then read the data with socket.recv() handling 'Resource temporarily unavailable' exceptions (which means that there is nothing to read). A very rough example is this:
import socket, time
BUFSIZE = 1024
s = socket.socket()
s.connect(('localhost', 1234))
s.send('GET /path HTTP/1.0\n\n')
s.setblocking(False)
running = True
while running:
try:
print "Attempting to read from socket..."
while True:
data = s.recv(BUFSIZE)
if len(data) == 0: # remote end closed
print "Remote end closed"
running = False
break
print "Received %d bytes: %r" % (len(data), data)
except socket.error, e:
if e[0] != 11: # Resource temporarily unavailable
print e
raise
# perform other program tasks
print "Sleeping..."
time.sleep(1)
However, urllib.urlopen() has some benefits if the web server redirects, you need URL based basic authentication etc. You could make use of the select module which will tell you when there is data to read.
Yes when you catch up with the server it will block until the server produces more data
Each dat will be one line including the newline on the end
twisted is a good option
I would swap the with and for around in your example, do you really want to open and close the file for every line that arrives?