I created batch delayed http (async) client which allows to trigger multiple async http requests and most importantly it allows to delay the start of requests so for example 100 requests are not triggered at a time.
But it has an issue. The http .fetch() method has a handleMethod parameter which handles the response, but I found out that if the delay (sleep) after the fetch isn't long enough the handle method is not even triggered. (maybe the request is killed or what meanwhile).
It is probably related to .run_sync method. How to fix that? I want to put delays but dont want this issue happen.
I need to parse the response regardless how long the request takes, regardless the following sleep call (that call has another reason as i said, and should not be related to response handling at all)
class BatchDelayedHttpClient:
def __init__(self, requestList):
# class members
self.httpClient = httpclient.AsyncHTTPClient()
self.requestList = requestList
ioloop.IOLoop.current().run_sync(self.execute)
#gen.coroutine
def execute(self):
print("exec start")
for request in self.requestList:
print("requesting " + request["url"])
self.httpClient.fetch(request["url"], request["handleMethod"], method=request["method"], headers=request["headers"], body=request["body"])
yield gen.sleep(request["sleep"])
print("exec end")
I have the following HTTP server written using Tornado:
def reindex(index):
# After some initialization, we execute a process and wait for its output
result = subprocess.check_output([indexerBinPath, arg])
class ReindexRequestHandler(tornado.web.RequestHandler):
#tornado.web.asynchronous
def post(self):
reindexRequest = json.loads(self.request.body)
p = self.application.settings.get('pool')
p.apply_async(reindex, [ reindexRequest['IndexName'] ], callback = self.onIndexingFinished)
def onIndexingFinished(self, output):
self.flush()
self.finish()
logger.info('Async callback: finished')
application = tornado.web.Application([
(r"/reindex", ReindexRequestHandler)
], pool = Pool(8), queue = Queue())
if __name__ == "__main__":
application.listen(8625)
try:
tornado.ioloop.IOLoop.instance().start()
except KeyboardInterrupt:
tornado.ioloop.IOLoop.instance().stop()
In the POST handler, I asynchronously execute the reindex function which in turn launches a process and wait for it to finish. That works fine - the process is always executed correctly. The process may, depending on its arguments, take up to several minutes to finish. If it completes within seconds, everything works fine.
However, when it takes e.g. over 3 minutes to complete, the HTTP client which sent the POST request never gets the answer. From the standpoint of the server, it looks ok - I can see Async callback: finished logged. However, the HTTP client waits indefinitely for the response (until it fails with a timeout). I had tried both Fiddler's request composer and the .NET HttpClient class.
Why does the HTTP client never gets the response if the request takes long to process?
I had a similar handler and the self.finish() will trigger the response back to the client. So if you move that line to above your p.apply_async it ought to work as you intend.
In my script, requests.get never returns:
import requests
print ("requesting..")
# This call never returns!
r = requests.get(
"http://www.some-site.example",
proxies = {'http': '222.255.169.74:8080'},
)
print(r.ok)
What could be the possible reason(s)? Any remedy? What is the default timeout that get uses?
What is the default timeout that get uses?
The default timeout is None, which means it'll wait (hang) until the connection is closed.
Just specify a timeout value, like this:
r = requests.get(
'http://www.example.com',
proxies={'http': '222.255.169.74:8080'},
timeout=5
)
From requests documentation:
You can tell Requests to stop waiting for a response after a given
number of seconds with the timeout parameter:
>>> requests.get('http://github.com', timeout=0.001)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
requests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001)
Note:
timeout is not a time limit on the entire response download; rather,
an exception is raised if the server has not issued a response for
timeout seconds (more precisely, if no bytes have been received on the
underlying socket for timeout seconds).
It happens a lot to me that requests.get() takes a very long time to return even if the timeout is 1 second. There are a few way to overcome this problem:
1. Use the TimeoutSauce internal class
From: https://github.com/kennethreitz/requests/issues/1928#issuecomment-35811896
import requests from requests.adapters import TimeoutSauce
class MyTimeout(TimeoutSauce):
def __init__(self, *args, **kwargs):
if kwargs['connect'] is None:
kwargs['connect'] = 5
if kwargs['read'] is None:
kwargs['read'] = 5
super(MyTimeout, self).__init__(*args, **kwargs)
requests.adapters.TimeoutSauce = MyTimeout
This code should cause us to set the read timeout as equal to the
connect timeout, which is the timeout value you pass on your
Session.get() call. (Note that I haven't actually tested this code, so
it may need some quick debugging, I just wrote it straight into the
GitHub window.)
2. Use a fork of requests from kevinburke: https://github.com/kevinburke/requests/tree/connect-timeout
From its documentation: https://github.com/kevinburke/requests/blob/connect-timeout/docs/user/advanced.rst
If you specify a single value for the timeout, like this:
r = requests.get('https://github.com', timeout=5)
The timeout value will be applied to both the connect and the read
timeouts. Specify a tuple if you would like to set the values
separately:
r = requests.get('https://github.com', timeout=(3.05, 27))
NOTE: The change has since been merged to the main Requests project.
3. Using evenlet or signal as already mentioned in the similar question:
Timeout for python requests.get entire response
I wanted a default timeout easily added to a bunch of code (assuming that timeout solves your problem)
This is the solution I picked up from a ticket submitted to the repository for Requests.
credit: https://github.com/kennethreitz/requests/issues/2011#issuecomment-477784399
The solution is the last couple of lines here, but I show more code for better context. I like to use a session for retry behaviour.
import requests
import functools
from requests.adapters import HTTPAdapter,Retry
def requests_retry_session(
retries=10,
backoff_factor=2,
status_forcelist=(500, 502, 503, 504),
session=None,
) -> requests.Session:
session = session or requests.Session()
retry = Retry(
total=retries,
read=retries,
connect=retries,
backoff_factor=backoff_factor,
status_forcelist=status_forcelist,
)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
# set default timeout
for method in ('get', 'options', 'head', 'post', 'put', 'patch', 'delete'):
setattr(session, method, functools.partial(getattr(session, method), timeout=30))
return session
then you can do something like this:
requests_session = requests_retry_session()
r = requests_session.get(url=url,...
In my case, the reason of "requests.get never returns" is because requests.get() attempt to connect to the host resolved with ipv6 ip first. If something went wrong to connect that ipv6 ip and get stuck, then it retries ipv4 ip only if I explicit set timeout=<N seconds> and hit the timeout.
My solution is monkey-patching the python socket to ignore ipv6(or ipv4 if ipv4 not working), either this answer or this answer are works for me.
You might wondering why curl command is works, because curl connect ipv4 without waiting for ipv6 complete. You can trace the socket syscalls with strace -ff -e network -s 10000 -- curl -vLk '<your url>' command. For python, strace -ff -e network -s 10000 -- python3 <your python script> command can be used.
Patching the documented "send" function will fix this for all requests - even in many dependent libraries and sdk's. When patching libs, be sure to patch supported/documented functions, not TimeoutSauce - otherwise you may wind up silently losing the effect of your patch.
import requests
DEFAULT_TIMEOUT = 180
old_send = requests.Session.send
def new_send(*args, **kwargs):
if kwargs.get("timeout", None) is None:
kwargs["timeout"] = DEFAULT_TIMEOUT
return old_send(*args, **kwargs)
requests.Session.send = new_send
The effects of not having any timeout are quite severe, and the use of a default timeout can almost never break anything - because TCP itself has default timeouts as well.
Reviewed all the answers and came to conclusion that the problem still exists. On some sites requests may hang infinitely and using multiprocessing seems to be overkill. Here's my approach(Python 3.5+):
import asyncio
import aiohttp
async def get_http(url):
async with aiohttp.ClientSession(conn_timeout=1, read_timeout=3) as client:
try:
async with client.get(url) as response:
content = await response.text()
return content, response.status
except Exception:
pass
loop = asyncio.get_event_loop()
task = loop.create_task(get_http('http://example.com'))
loop.run_until_complete(task)
result = task.result()
if result is not None:
content, status = task.result()
if status == 200:
print(content)
UPDATE
If you receive a deprecation warning about using conn_timeout and read_timeout, check near the bottom of THIS reference for how to use the ClientTimeout data structure. One simple way to apply this data structure per the linked reference to the original code above would be:
async def get_http(url):
timeout = aiohttp.ClientTimeout(total=60)
async with aiohttp.ClientSession(timeout=timeout) as client:
try:
etc.
I'm currently testing something with Threading/ workpool; I create 400 Threads which download a total of 5000 URLS... The problem is that some of the 400 threads are "freezing", when looking into my Processes I see that +- 15 threads in every run freeze, and after a time eventually close 1 by 1.
My question is if there is a way to have some sort of 'timer' / 'counter' that kills a thread if it isn't finished after x seconds.
# download2.py - Download many URLs using multiple threads.
import os
import urllib2
import workerpool
import datetime
from threading import Timer
class DownloadJob(workerpool.Job):
"Job for downloading a given URL."
def __init__(self, url):
self.url = url # The url we'll need to download when the job runs
def run(self):
try:
url = urllib2.urlopen(self.url).read()
except:
pass
# Initialize a pool, 400 threads in this case
pool = workerpool.WorkerPool(size=400)
# Loop over urls.txt and create a job to download the URL on each line
print datetime.datetime.now()
for url in open("urls.txt"):
job = DownloadJob(url.strip())
pool.put(job)
# Send shutdown jobs to all threads, and wait until all the jobs have been completed
pool.shutdown()
pool.wait()
print datetime.datetime.now()
The problem is that some of the 400 threads are "freezing"...
That's most likely because of this line...
url = urllib2.urlopen(self.url).read()
By default, Python will wait forever for a remote server to respond, so if a one of your URLs points to a server which is ignoring the SYN packet, or is otherwise just really slow, the thread could potentially be blocked forever.
You can use the timeout parameter of urlopen() set a limit as to how long the thread will wait for the remote host to respond...
url = urllib2.urlopen(self.url, timeout=5).read() # Time out after 5 seconds
...or you can set it globally instead with socket.setdefaulttimeout() by putting these lines at the top of your code...
import socket
socket.setdefaulttimeout(5) # Time out after 5 seconds
urlopen accepts a timeout value, that would be the best way to handle it I think.
But I agree with the commenter that 400 threads is probably way too many
I need to do the following test:
send a GET request to a server (http://remote/...)
wait for the server to send a POST request in response (http://local/...)
parse the POST data and do some assertions
Selenium does not fit this case: it can't listen to connections, and I can send a GET without Selenium as well.
so, I make a unit test:
class MobiMoneyTestCase(TestCase):
def test_can_send_response(self):
resp = requests.post('http://url/api/', data={'callback': 'http://localhost:8000'})
class Handler(SimpleHTTPRequestHandler):
def do_GET(self):
assert self.path == '...'
httpd = SocketServer.ThreadingTCPServer(('localhost', 8000),Handler)
The test has to wait 5 seconds for the POST request and then fail if nothing happened. How can I merge these items in the test? If I put sleep(5) in the test_can..., the httpd handler does not reply until the countdown ends.
Basically you want to timeout a process if it's too long ? You should check out the signal module in that case.
There is an neat implementation (with decorator) here : Timeout function if it takes too long to finish