Python Requests hanging/freezing - python

I'm using the requests library to get a lot of webpages from somewhere. He's the pertinent code:
response = requests.Session()
retries = Retry(total=5, backoff_factor=.1)
response.mount('http://', HTTPAdapter(max_retries=retries))
response = response.get(url)
After a while it just hangs/freezes (never on the same webpage) while getting the page. Here's the traceback when I interrupt it:
File "/Users/Student/Hockey/Scrape/html_pbp.py", line 21, in get_pbp
response = r.read().decode('utf-8')
File "/anaconda/lib/python3.6/http/client.py", line 456, in read
return self._readall_chunked()
File "/anaconda/lib/python3.6/http/client.py", line 566, in _readall_chunked
value.append(self._safe_read(chunk_left))
File "/anaconda/lib/python3.6/http/client.py", line 612, in _safe_read
chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/anaconda/lib/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
KeyboardInterrupt
Does anybody know what could be causing it? Or (more importantly) does anybody know a way to stop it if it takes more than a certain amount of time so that I could try again?

Seems like setting a (read) timeout might help you.
Something along the lines of:
response = response.get(url, timeout=5)
(This will set both connect and read timeout to 5 seconds.)
In requests, unfortunately, neither connect nor read timeouts are set by default, even though the docs say it's good to set it:
Most requests to external servers should have a timeout attached, in case the server is not responding in a timely manner. By default, requests do not time out unless a timeout value is set explicitly. Without a timeout, your code may hang for minutes or more.
Just for completeness, the connect timeout is the number of seconds requests will wait for your client to establish a connection to a remote machine, and the read timeout is the number of seconds the client will wait between bytes sent from the server.

Patching the documented "send" function will fix this for all requests - even in many dependent libraries and sdk's. When patching libs, be sure to patch supported/documented functions, otherwise you may wind up silently losing the effect of your patch.
import requests
DEFAULT_TIMEOUT = 180
old_send = requests.Session.send
def new_send(*args, **kwargs):
if kwargs.get("timeout", None) is None:
kwargs["timeout"] = DEFAULT_TIMEOUT
return old_send(*args, **kwargs)
requests.Session.send = new_send
The effects of not having any timeout are quite severe, and the use of a default timeout can almost never break anything - because TCP itself has timeouts as well.
On Windows the default TCP timeout is 240 seconds, TCP RFC recommend a minimum of 100 seconds for RTO*retry. Somewhere in that range is a safe default.

To set timeout globally instead of specifying in every request:
from requests.adapters import TimeoutSauce
REQUESTS_TIMEOUT_SECONDS = float(os.getenv("REQUESTS_TIMEOUT_SECONDS", 5))
class CustomTimeout(TimeoutSauce):
def __init__(self, *args, **kwargs):
if kwargs["connect"] is None:
kwargs["connect"] = REQUESTS_TIMEOUT_SECONDS
if kwargs["read"] is None:
kwargs["read"] = REQUESTS_TIMEOUT_SECONDS
super().__init__(*args, **kwargs)
# Set it globally, instead of specifying ``timeout=..`` kwarg on each call.
requests.adapters.TimeoutSauce = CustomTimeout
sess = requests.Session()
sess.get(...)
sess.post(...)

Related

Gremlin Python - "Server disconnected - please try to reconnect" error

I have a Flask web app in which I want to keep a persistent connection to an AWS Neptune graph database. This connection is established as follows:
from gremlin_python.process.anonymous_traversal import traversal
from gremlin_python.driver.driver_remote_connection import DriverRemoteConnection
neptune_endpt = 'db-instance-x.xxxxxxxxxx.xx-xxxxx-x.neptune.amazonaws.com'
remoteConn = DriverRemoteConnection(f'wss://{neptune_endpt}:8182/gremlin','g')
self.g = traversal().withRemote(remoteConn)
The issue I'm facing is that the connection automatically drops off if left idle, and I cannot find a way to detect if the connection has dropped off (so that I can reconnect by using the code snippet above).
I have seen this similar issue: Gremlin server withRemote connection closed - how to reconnect automatically? however that question has no solution as well. This similar question has no answer either.
I've tried the following two solutions (both of which did not work):
I setup my webapp behind four Gunicorn workers with a timeout of a 100 seconds, hoping that worker restarts would take care of Gremlin timeouts.
I tried catching exceptions to detect if the connection has dropped off. Every time I use self.g to do some traversal on my graph, I try to "refresh" the connection, by which I mean this:
def _refresh_neptune(self):
try:
self.g = traversal().withRemote(self.conn)
except:
self.conn = DriverRemoteConnection(f'wss://{neptune_endpt}:8182/gremlin','g')
self.g = traversal().withRemote(self.conn)
Here self.conn was initialized as:
self.conn = DriverRemoteConnection(f'wss://{neptune_endpt}:8182/gremlin','g')
Is there any way to get around this connection error?
Thanks
Update: Added the error message below:
File "/home/ubuntu/.virtualenvs/rundev/lib/python3.6/site-packages/gremlin_python/process/traversal.py
", line 58, in toList
return list(iter(self))
File "/home/ubuntu/.virtualenvs/rundev/lib/python3.6/site-packages/gremlin_python/process/traversal.py
", line 48, in __next__
self.traversal_strategies.apply_strategies(self)
File "/home/ubuntu/.virtualenvs/rundev/lib/python3.6/site-packages/gremlin_python/process/traversal.py
", line 573, in apply_strategies
traversal_strategy.apply(traversal)
File "/home/ubuntu/.virtualenvs/rundev/lib/python3.6/site-packages/gremlin_python/driver/remote_connec
tion.py", line 149, in apply
remote_traversal = self.remote_connection.submit(traversal.bytecode)
File "/home/ubuntu/.virtualenvs/rundev/lib/python3.6/site-packages/gremlin_python/driver/driver_remote
_connection.py", line 56, in submit
results = result_set.all().result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/home/ubuntu/.virtualenvs/rundev/lib/python3.6/site-packages/gremlin_python/driver/resultset.py"
, line 90, in cb
f.result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/ubuntu/.virtualenvs/rundev/lib/python3.6/site-packages/gremlin_python/driver/connection.py
", line 83, in _receive
status_code = self._protocol.data_received(data, self._results)
File "/home/ubuntu/.virtualenvs/rundev/lib/python3.6/site-packages/gremlin_python/driver/protocol.py",
line 81, in data_received
'message': 'Server disconnected - please try to reconnect', 'attributes': {}})
gremlin_python.driver.protocol.GremlinServerError: 500: Server disconnected - please try to reconnect
I am not sure that this is the best way to solve this, but I'm also using gremlin-python and Neptune and I've had the same issue. I worked around it by implementing a Transport that you can provide to DriverRemoteConnection.
DriverRemoteConnection(
url=endpoint,
traversal_source=self._traversal_source,
transport_factory=Transport
)
gremlin-python returns connections to the pool on exception and the exception returned when a connection is closed is GremlinServerError which is also raised for other errors.
gremlin_python/driver/connection.py#L69 -
gremlin_python/driver/protocol.py#L80
The custom transport is the same as gremlin-python's TornadoTransport but the read and write methods are extended to:
Reopen closed connections, if the web socket client is closed
Raise a StreamClosedError, if the web socket client returns None from read_message
Dead connections that are added back to the pool are able to be reopended and then, you can then handle the StreamClosedError to apply some retry logic. I did it by overriding the submit and submitAsync methods in DriverRemoteConnection, but you could catch and retry anywhere.
class Transport(AbstractBaseTransport):
def __init__(self):
self._ws = None
self._loop = ioloop.IOLoop(make_current=False)
self._url = None
# Because the transport will try to reopen the underlying ws connection
# track if the closed() method has been called to prevent the transport
# from reopening.
self._explicit_closed = True
#property
def closed(self):
return not self._ws.protocol
def connect(self, url, headers=None):
self._explicit_closed = False
# Set the endpoint URL
self._url = httpclient.HTTPRequest(url, headers=headers) if headers else url
# Open the connection
self._connect()
def write(self, message):
# Before writing, try to ensure that the connection is open.
if self.closed:
self._connect()
self._loop.run_sync(lambda: self._ws.write_message(message, binary=True))
def read(self):
result = self._loop.run_sync(self._ws.read_message)
# If the read call returns None, the stream has closed.
if result is None:
self._ws.close() # Ensure we close the stream
raise StreamClosedError()
return result
def close(self):
self._ws.close()
self._loop.close()
self._explicit_closed = True
def _connect(self):
# If close() was called explicitly on the transport, don't allow
# subsequent calls to write() to reopen the connection.
if self._explicit_closed:
raise TransportClosedError(
"Transport has been closed and can not be reopened."
)
# Check if the ws is closed, if it is not, close it.
if self._ws and not self.closed:
self._ws.close()
# Open the ws connection
self._ws = self._loop.run_sync(
lambda: websocket.websocket_connect(url=self._url)
)
class TransportClosedError(Exception):
pass
This will work in with gremlin-pythons connection pooling as well.
If you don't need pooling, an alternate approach is to set the pool size to 1 and implement some form of keep-alive like is discussed here. TINKERPOP-2352
It looks like the web socket ping/keep-alive in gremlin-python is not implemented yet TINKERPOP-1886.

requests process hangs

I'm using requests to get a URL, such as:
while True:
try:
rv = requests.get(url, timeout=1)
doSth(rv)
except socket.timeout as e:
print e
except Exception as e:
print e
After it runs for a while, it quits working. No exception or any error, just like it suspended. I then stop the process by typing Ctrl+C from the console. It shows that the process is waiting for data:
.............
httplib_response = conn.getresponse(buffering=True) #httplib.py
response.begin() #httplib.py
version, status, reason = self._read_status() #httplib.py
line = self.fp.readline(_MAXLINE + 1) #httplib.py
data = self._sock.recv(self._rbufsize) #socket.py
KeyboardInterrupt
Why is this happening? Is there a solution?
It appears that the server you're sending your request to is throttling you - that is, it's sending bytes with less than 1 second between each package (thus not triggering your timeout parameter), but slow enough for it to appear to be stuck.
The only fix for this I can think of is to reduce the timeout parameter, unless you can fix this throttling issue with the Server provider.
Do keep in mind that you'll need to consider latency when setting the timeout parameter, otherwise your connection will be dropped too quickly and might not work at all.
The default requests doesn't not set a timeout for connection or read.
If for some reason, the server cannot get back to the client within the time, the client will stuck at connecting or read, mostly the read for the response.
The quick resolution is to set a timeout value in the requests object, the approach is well described here: http://docs.python-requests.org/en/master/user/advanced/#timeouts
(Thanks to the guys.)
If this resolves the issue, please kindly mark this a resolution. Thanks.

How can I trigger an IncompleteRead (on purpose) in a Python web application?

I've got some Python code that makes requests using the requests library and occasionally experiences an IncompleteRead error. I'm trying to update this code to handle this error more gracefully and would like to test that it works, so I'm wondering how to actually trigger the conditions under which IncompleteRead is raised.
I realize I can do some mocking in a unit test; I'd just like to actually reproduce the circumstances (if I can) under which this error was previously occurring and ensure my code is able to deal with it properly.
Adding a second answer, more to the point this time. I took a dive into some source code, and found information that may help
The IncompleteRead exception bubbles up from httplib, part of the python standard library. Most likely, it comes from this function:
def _safe_read(self, amt):
"""
Read the number of bytes requested, compensating for partial reads.
Normally, we have a blocking socket, but a read() can be interrupted
by a signal (resulting in a partial read).
Note that we cannot distinguish between EOF and an interrupt when zero
bytes have been read. IncompleteRead() will be raised in this
situation.
This function should be used when <amt> bytes "should" be present for
reading. If the bytes are truly not available (due to EOF), then the
IncompleteRead exception can be used to detect the problem.
"""
So, either the socket was closed before the HTTP response was consumed, or the reader tried to get too many bytes out of it. Judging by search results (so take this with a grain of salt), there is no other arcane situation that can make this happen.
The first scenario can be debugged with strace. If I'm reading this correctly, the 2nd scenario can be caused by the requests module, if:
A Content-Length header is present that exceeds the actual amount of data sent by the server.
A chunked response is incorrectly assembled (has an erroneous length byte before one of the chunks), or a regular response is being interpreted as chunked.
This function raises the Exception:
def _update_chunk_length(self):
# First, we'll figure out length of a chunk and then
# we'll try to read it from socket.
if self.chunk_left is not None:
return
line = self._fp.fp.readline()
line = line.split(b';', 1)[0]
try:
self.chunk_left = int(line, 16)
except ValueError:
# Invalid chunked protocol response, abort.
self.close()
raise httplib.IncompleteRead(line)
Try checking the Content-Length header of your buffered responses, or the chunk format of your chunked responses.
To produce the error, try:
Forcing an invalid Content-Length
Using the chunked response protocol, with a too-large length byte at the beginning of a chunk
Closing the socket mid-response
By looking at the places where raise IncompleteRead appears at https://github.com/python/cpython/blob/v3.8.0/Lib/http/client.py, I think the standard library's http.client module (named httplib back in Python 2) raises this exception in only the following two circumstances:
When a response's body is shorter than claimed by the response's Content-Length header, or
When a chunked response claims that the next chunk is of length n, but there are fewer than n bytes remaining in the response body.
If you install Flask (pip install Flask), you can paste this into a file to create a test server you can run with endpoints that artificially create both of these circumstances:
from flask import Flask, make_response
app = Flask(__name__)
#app.route('/test')
def send_incomplete_response():
response = make_response('fourteen chars')
response.headers['Content-Length'] = '10000'
return response
#app.route('/test_chunked')
def send_chunked_response_with_wrong_sizes():
# Example response based on
# https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Transfer-Encoding
# but with the stated size of the second chunk increased to 900
resp_text = """7\r\nMozilla\r\n900\r\nDeveloper\r\n7\r\nNetwork\r\n0\r\n\r\n"""
response = make_response(resp_text)
response.headers['Transfer-Encoding'] = 'chunked'
return response
app.run()
and then test them with http.client:
>>> import http.client
>>>
>>> conn = http.client.HTTPConnection('localhost', 5000)
>>> conn.request('GET', '/test')
>>> response = conn.getresponse()
>>> response.read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.8/http/client.py", line 467, in read
s = self._safe_read(self.length)
File "/usr/lib/python3.8/http/client.py", line 610, in _safe_read
raise IncompleteRead(data, amt-len(data))
http.client.IncompleteRead: IncompleteRead(14 bytes read, 9986 more expected)
>>>
>>> conn = http.client.HTTPConnection('localhost', 5000)
>>> conn.request('GET', '/test_chunked')
>>> response = conn.getresponse()
>>> response.read()
Traceback (most recent call last):
File "/usr/lib/python3.8/http/client.py", line 571, in _readall_chunked
value.append(self._safe_read(chunk_left))
File "/usr/lib/python3.8/http/client.py", line 610, in _safe_read
raise IncompleteRead(data, amt-len(data))
http.client.IncompleteRead: IncompleteRead(28 bytes read, 2276 more expected)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.8/http/client.py", line 461, in read
return self._readall_chunked()
File "/usr/lib/python3.8/http/client.py", line 575, in _readall_chunked
raise IncompleteRead(b''.join(value))
http.client.IncompleteRead: IncompleteRead(7 bytes read)
In real life, the most likely reason this might happen sporadically is if a connection was closed early by the server. For example, you can also try running this Flask server, which sends a response body very slowly, with a total of 20 seconds of sleeping:
from flask import Flask, make_response, Response
from time import sleep
app = Flask(__name__)
#app.route('/test_generator')
def send_response_with_delays():
def generate():
yield 'foo'
sleep(10)
yield 'bar'
sleep(10)
yield 'baz'
response = Response(generate())
response.headers['Content-Length'] = '9'
return response
app.run()
If you run that server in a terminal, then initiate a request to it and start reading the response like this...
>>> import http.client
>>> conn = http.client.HTTPConnection('localhost', 5000)
>>> conn.request('GET', '/test_generator')
>>> response = conn.getresponse()
>>> response.read()
... and then flick back to the terminal running your server and kill it (e.g. with CTRL-C, on Unix), then you'll see your .read() call error out with a familiar message:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.8/http/client.py", line 467, in read
s = self._safe_read(self.length)
File "/usr/lib/python3.8/http/client.py", line 610, in _safe_read
raise IncompleteRead(data, amt-len(data))
http.client.IncompleteRead: IncompleteRead(6 bytes read, 3 more expected)
Other, less probable causes include your server systematically generating an incorrect Content-Length header (maybe due to some broken handling of Unicode), or your Content-Length header (or the lengths included in a chunked message) being corrupted in transit.
Okay, that covers the standard library. What about Requests? Requests by default defers its work to urllib3 which in turn defers to http.client, so you might expect the exception from http.client to simply bubble up when using Requests. However, life is more complicated than that, for two reasons:
Both urllib3 and requests catch exceptions in the layer beneath them and raise their own versions. For instance, there are urllib3.exceptions.IncompleteRead and requests.exceptions.ChunkedEncodingError.
The current handling of Content-Length checking across all three of these modules is horribly broken, and has been for years. I've done my best to explain it in detail at https://github.com/psf/requests/issues/4956#issuecomment-573325001 if you're interested, but the short version is that http.client won't check Content-Length if you call .read(123) instead of just .read(), that urllib3 may or may not check depending upon various complicated details of how you call it, and that Requests - as a consequence of the previous two issues - currently doesn't check it at all, ever. However, this hasn't always been the case; there have been some attempts to fix it made and unmade, so perhaps at some point in the past - like when this question was asked in 2016 - the state of play was a bit different. Oh, and for extra confusion, while urllib3 has its own version it still sometimes lets the standard library's IncompleteRead exception bubble up, just to mess with you.
Hopefully, point 2 will get fixed in time - I'm having a go right now at nudging it in that direction. Point 1 will remain a complication, but the conditions that trigger these exceptions - whether the underlying http.client.IncompleteRead or the urllib3 or requests alternatives - should remain as I describe at the start of this answer.
When testing code that relies on external behavior (such as server responses, system sensors, etc) the usual approach is to fake the external factors instead of working to produce them.
Create a test version of the function or class you're using to make HTTP requests. If you're using requests directly across your codebase, stop: direct coupling with libraries and external services is very hard to test.
You mention that you want to make sure your code can handle this exception, and you'd rather avoid mocking for this reason. Mocking is just as safe, as long as you're wrapping the modules you need to mock all across your codebase. If you can't mock to test, you're missing layers in your design (or asking too much of your testing suite).
So, for example:
class FooService(object):
def make_request(*args):
# use requests.py to perform HTTP requests
# NOBODY uses requests.py directly without passing through here
class MockFooService(FooService):
def make_request(*args):
raise IncompleteRead()
The 2nd class is a testing utility written solely for the purpose of testing this specific case. As your tests grow in coverage and completeness, you may need more sophisticated language (to avoid incessant subclassing and repetition), but it's usually good to start with the simplest code that will read easily and test the desired cases.

Periodic REST call failing using python httplib

To monitor the server heartbeat using a REST call. I am calling a REST api periodically every 30 seconds.
HEARTBEAT_URI='/heartbeat'
class HearBeatCheck:
"""
Has other code
"""
#staticmethod
def check_heartbeat(host, port, username, password, timeout=60):
http_connection = httplib.HTTPConnection(host,port,timeout=timeout)
http_connection.request('GET', HEARTBEAT_URI, headers={'username':username,
'password':password})
response = http_connection.getresponse()
response.read()
if response.status in [200, 201, 202, 203, 204, 205, 206]:
return True
else:
return False
Now when I make this call every 30 seconds keeping the service down (heartbeat api would return 500 or some other code) error I get using the default timeout value is CannotsendRequest error. If I modify the periodic interval to make this call in every 60 seconds I get BadStatusLine error.
Now I understand CannotSendRequest is raised because of not reading the response and reusing the same http connection object but in this case I am setting the timeout higher than the polling interval and everytime a periodic call is made a new http_connection object is used anyway. Can someone shed some light if I am missing something that is obvious? Thanks.

Why doesn't requests.get() return? What is the default timeout that requests.get() uses?

In my script, requests.get never returns:
import requests
print ("requesting..")
# This call never returns!
r = requests.get(
"http://www.some-site.example",
proxies = {'http': '222.255.169.74:8080'},
)
print(r.ok)
What could be the possible reason(s)? Any remedy? What is the default timeout that get uses?
What is the default timeout that get uses?
The default timeout is None, which means it'll wait (hang) until the connection is closed.
Just specify a timeout value, like this:
r = requests.get(
'http://www.example.com',
proxies={'http': '222.255.169.74:8080'},
timeout=5
)
From requests documentation:
You can tell Requests to stop waiting for a response after a given
number of seconds with the timeout parameter:
>>> requests.get('http://github.com', timeout=0.001)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
requests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001)
Note:
timeout is not a time limit on the entire response download; rather,
an exception is raised if the server has not issued a response for
timeout seconds (more precisely, if no bytes have been received on the
underlying socket for timeout seconds).
It happens a lot to me that requests.get() takes a very long time to return even if the timeout is 1 second. There are a few way to overcome this problem:
1. Use the TimeoutSauce internal class
From: https://github.com/kennethreitz/requests/issues/1928#issuecomment-35811896
import requests from requests.adapters import TimeoutSauce
class MyTimeout(TimeoutSauce):
def __init__(self, *args, **kwargs):
if kwargs['connect'] is None:
kwargs['connect'] = 5
if kwargs['read'] is None:
kwargs['read'] = 5
super(MyTimeout, self).__init__(*args, **kwargs)
requests.adapters.TimeoutSauce = MyTimeout
This code should cause us to set the read timeout as equal to the
connect timeout, which is the timeout value you pass on your
Session.get() call. (Note that I haven't actually tested this code, so
it may need some quick debugging, I just wrote it straight into the
GitHub window.)
2. Use a fork of requests from kevinburke: https://github.com/kevinburke/requests/tree/connect-timeout
From its documentation: https://github.com/kevinburke/requests/blob/connect-timeout/docs/user/advanced.rst
If you specify a single value for the timeout, like this:
r = requests.get('https://github.com', timeout=5)
The timeout value will be applied to both the connect and the read
timeouts. Specify a tuple if you would like to set the values
separately:
r = requests.get('https://github.com', timeout=(3.05, 27))
NOTE: The change has since been merged to the main Requests project.
3. Using evenlet or signal as already mentioned in the similar question:
Timeout for python requests.get entire response
I wanted a default timeout easily added to a bunch of code (assuming that timeout solves your problem)
This is the solution I picked up from a ticket submitted to the repository for Requests.
credit: https://github.com/kennethreitz/requests/issues/2011#issuecomment-477784399
The solution is the last couple of lines here, but I show more code for better context. I like to use a session for retry behaviour.
import requests
import functools
from requests.adapters import HTTPAdapter,Retry
def requests_retry_session(
retries=10,
backoff_factor=2,
status_forcelist=(500, 502, 503, 504),
session=None,
) -> requests.Session:
session = session or requests.Session()
retry = Retry(
total=retries,
read=retries,
connect=retries,
backoff_factor=backoff_factor,
status_forcelist=status_forcelist,
)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
# set default timeout
for method in ('get', 'options', 'head', 'post', 'put', 'patch', 'delete'):
setattr(session, method, functools.partial(getattr(session, method), timeout=30))
return session
then you can do something like this:
requests_session = requests_retry_session()
r = requests_session.get(url=url,...
In my case, the reason of "requests.get never returns" is because requests.get() attempt to connect to the host resolved with ipv6 ip first. If something went wrong to connect that ipv6 ip and get stuck, then it retries ipv4 ip only if I explicit set timeout=<N seconds> and hit the timeout.
My solution is monkey-patching the python socket to ignore ipv6(or ipv4 if ipv4 not working), either this answer or this answer are works for me.
You might wondering why curl command is works, because curl connect ipv4 without waiting for ipv6 complete. You can trace the socket syscalls with strace -ff -e network -s 10000 -- curl -vLk '<your url>' command. For python, strace -ff -e network -s 10000 -- python3 <your python script> command can be used.
Patching the documented "send" function will fix this for all requests - even in many dependent libraries and sdk's. When patching libs, be sure to patch supported/documented functions, not TimeoutSauce - otherwise you may wind up silently losing the effect of your patch.
import requests
DEFAULT_TIMEOUT = 180
old_send = requests.Session.send
def new_send(*args, **kwargs):
if kwargs.get("timeout", None) is None:
kwargs["timeout"] = DEFAULT_TIMEOUT
return old_send(*args, **kwargs)
requests.Session.send = new_send
The effects of not having any timeout are quite severe, and the use of a default timeout can almost never break anything - because TCP itself has default timeouts as well.
Reviewed all the answers and came to conclusion that the problem still exists. On some sites requests may hang infinitely and using multiprocessing seems to be overkill. Here's my approach(Python 3.5+):
import asyncio
import aiohttp
async def get_http(url):
async with aiohttp.ClientSession(conn_timeout=1, read_timeout=3) as client:
try:
async with client.get(url) as response:
content = await response.text()
return content, response.status
except Exception:
pass
loop = asyncio.get_event_loop()
task = loop.create_task(get_http('http://example.com'))
loop.run_until_complete(task)
result = task.result()
if result is not None:
content, status = task.result()
if status == 200:
print(content)
UPDATE
If you receive a deprecation warning about using conn_timeout and read_timeout, check near the bottom of THIS reference for how to use the ClientTimeout data structure. One simple way to apply this data structure per the linked reference to the original code above would be:
async def get_http(url):
timeout = aiohttp.ClientTimeout(total=60)
async with aiohttp.ClientSession(timeout=timeout) as client:
try:
etc.

Categories