I have 600 urls to request, when I use grequests, I found that sometimes it finishes so fast within 10 secs, but sometimes just stuck there(can't reach the statement of printing 'done').
Here is my code:
l=[]
urls=[...]
reqs=[grequests.get(i) for i in urls]
rs=grequests.map(reqs)
print('done')
Is it because majority of the requests are already finished within 10 secs(with status 200), just a few requests are still waiting for response? I am using Pycharm, in the variable monitor window, I can indeed see that lots of requests already get 200 status. How can I fix this problem? Is there any params that I can set to limit the maximum request time for each request, therefore if some requests exceed a certain time, it will return anyway.
Just set the timeout:
import grequests
l=[]
urls = [ ]
reqs=[grequests.get(i, timeout=1) for i in urls] # timeout in sec
rs=grequests.map(reqs)
print(rs)
Related
I am using the elasticsearch dsl library for python. I am running scroll scans in a threaded application like so:
s = Search\
.from_dict(query)\
.using(es)\
.index(index)\
.doc_type(doc_type)\
.extra(slice=slice)
for hit in s.scan():
yield hit
There can be 32 threads all running on the same scroll with a difference slice. Occasionally this will hit a 429 Too Many Requests error from this and it kills the whole process. Heart-breakingly, this may occur 1 hour into the scroll and the process just needs to start over.
How can I recover rom a 429 Too Many Requests error? Is it possible to retry at the last scroll offset without restarting the entire scroll?
use the retry lib. I have a custom TooManyRequestsError to only catch 429s
#retry(TooManyRequestsError, delay=3, jitter=3, max_delay=30)
def get(url: str) -> requests.Response:
res = requests.get(url)
if res.status_code == 429:
logger.warning(f'Too many requests! {url}')
raise TooManyRequestsError()
res.raise_for_status()
return res
What I mean by "deterministic time"? For example AWS offer a service "AWS Lambda". The process started as lambda function has time limit, after that lambda function will stop execution and will assume that task was finished with error. And example task - send data to http endpoint. Depending of a network connection to http endpoint, or other factors, process of sending data can take a long time. If I need to send the same data to the many endpoints, then full process time will take one process time times endpoints amount. Which increase a chance that lambda function will be stopped before all data will be send to all endpoints.
To solve this I need to send data to different endpoints in parallel mode using threads.
The problem with threads - started thread can't be stopped. If http request will take more time than it dedicated by lambda function time limit, lambda function will be aborted and return error. So I need to use timeout with http request, to abort it, if it take more time than expected.
If http request will be canceled by timeout or endpoint will return error, I need to save not processed data somewhere to not lost the data. The time needed to save unprocessed data can be predicted, because I control the storage where data will be saved.
And the last part that consume time - procedure or loop where threads are scheduled executor.submit(). If there is only one endpoint or small number of them then the consumed time will be small. And there is no necessary to control this. But if I have deal with many endpoints, I have to take this into account.
So basically full time will consists of:
scheduling threads
http request execution
saving unprocessed data
There is example of how I can manage time using threads
import concurrent.futures
from functools import partial
import requests
import time
start = time.time()
def send_data(data):
host = 'http://127.0.0.1:5000/endpoint'
try:
result = requests.post(host, json=data, timeout=(0.1, 0.5))
# print('done')
if result.status_code == 200:
return {'status': 'ok'}
if result.status_code != 200:
return {'status': 'error', 'msg': result.text}
except requests.exceptions.Timeout as err:
return {'status': 'error', 'msg': 'timeout'}
def get_data(n):
return {"wait": n}
def done_cb(a, b, future):
pass # save unprocessed data
def main():
executor = concurrent.futures.ThreadPoolExecutor()
futures = []
max_time = 0.5
for i in range(1):
future = executor.submit(send_data, *[{"wait": 10}])
future.add_done_callback(partial(done_cb, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new threads')
# save unprocessed data
break
try:
for item in concurrent.futures.as_completed(futures, timeout=1):
item.result()
except concurrent.futures.TimeoutError as err:
pass
I was thinking of how I can use asyncio library instead of threads, to do the same thing.
import asyncio
import time
from functools import partial
import requests
start = time.time()
def send_data(data):
...
def get_data(n):
return {"wait": n}
def done_callback(a,b, future):
pass # save unprocessed data
def main(loop):
max_time = 0.5
futures = []
start_appending = time.time()
for i in range(1):
event_data = get_data(1)
future = (loop.run_in_executor(None, send_data, event_data))
future.add_done_callback(partial(done_callback, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new futures')
# save unprocessed data
break
finished, unfinished = loop.run_until_complete(
asyncio.wait(futures, timeout=1)
)
_loop = asyncio.get_event_loop()
result = main(_loop)
Function send_data() the same as in previous code snipped.
Because request library is not async code I use run_in_executor() to create future object. The main problems I have is that done_callback() is not executed when the thread that started but executor done it's job. But only when the futures will be "processed" by asyncio.wait() expression.
Basically I seeking the way to start execute asyncio future, like ThreadPoolExecutor start execute threads, and not wait for asyncio.wait() expression to call done_callback(). If you have other ideas how to write python code that will work with threads or coroutines and will complete in deterministic time. Please share it, I will be glad to read them.
And other question. If thread or future done its job, it can return result, that I can use in done_callback(), for example to remove message from queue by id returned in result. But if thread or future was canceled, I don't have result. And I have to use functools.partial() pass in done_callback additional data, that can help me to understand for what data this callback was called. If passed data are small this is not a problem. If data will be big, I need to put data in array/list/dictionary and pass in callback only index of array or put "full data: in callback.
Can I somehow get access to variable that was passed to future/thread, from done_callback(), that was triggered on canceled future/thread?
You can use asyncio.wait_for to wait for a future (or multiple futures, when combined with asyncio.gather) and cancel them in case of a timeout. Unlike threads, asyncio supports cancellation, so you can cancel a task whenever you feel like it, and it will be cancelled at the first blocking call it makes (typically a network call).
Note that for this to work, you should be using asyncio-native libraries such as aiohttp for HTTP. Trying to combine requests with asyncio using run_in_executor will appear to work for simple tasks, but it will not bring you the benefits of using asyncio, such as being able to spawn a massive number of tasks without encumbering the OS, or the possibility of cancellation.
I want to set a time limit to a request, so that, if the queue is down, the client doesn't have to wait a long time to get a connection error. For now, when I make a request to queue that is down, the applications hangs lots of time before I get an exceptio.
I tried to set time_limit, soft_time_limit, timeout and soft_timeout in the client requests but none of them worked.
How I do set a timeout that a request can wait to get a response, before it can fail?
Here is the code that I use to call.
task = clusterWorking.apply_async(queue=q, soft_time_limit=2, time_limit=5)
task = clusterWorking.apply_async(queue=q, timeout=1, soft_timeout=1)
Here is the server code.
#task(name='manager.pingdaemon.clusterWorking')
def clusterWorking():
return "up"
You can use get(timeout)
http://celery.readthedocs.org/en/latest/reference/celery.result.html?highlight=get#celery.result.ResultSet.get
try:
task = clusterWorking.apply_async(queue=q).get(5)
except TimeoutError:
pass
I'm currently testing something with Threading/ workpool; I create 400 Threads which download a total of 5000 URLS... The problem is that some of the 400 threads are "freezing", when looking into my Processes I see that +- 15 threads in every run freeze, and after a time eventually close 1 by 1.
My question is if there is a way to have some sort of 'timer' / 'counter' that kills a thread if it isn't finished after x seconds.
# download2.py - Download many URLs using multiple threads.
import os
import urllib2
import workerpool
import datetime
from threading import Timer
class DownloadJob(workerpool.Job):
"Job for downloading a given URL."
def __init__(self, url):
self.url = url # The url we'll need to download when the job runs
def run(self):
try:
url = urllib2.urlopen(self.url).read()
except:
pass
# Initialize a pool, 400 threads in this case
pool = workerpool.WorkerPool(size=400)
# Loop over urls.txt and create a job to download the URL on each line
print datetime.datetime.now()
for url in open("urls.txt"):
job = DownloadJob(url.strip())
pool.put(job)
# Send shutdown jobs to all threads, and wait until all the jobs have been completed
pool.shutdown()
pool.wait()
print datetime.datetime.now()
The problem is that some of the 400 threads are "freezing"...
That's most likely because of this line...
url = urllib2.urlopen(self.url).read()
By default, Python will wait forever for a remote server to respond, so if a one of your URLs points to a server which is ignoring the SYN packet, or is otherwise just really slow, the thread could potentially be blocked forever.
You can use the timeout parameter of urlopen() set a limit as to how long the thread will wait for the remote host to respond...
url = urllib2.urlopen(self.url, timeout=5).read() # Time out after 5 seconds
...or you can set it globally instead with socket.setdefaulttimeout() by putting these lines at the top of your code...
import socket
socket.setdefaulttimeout(5) # Time out after 5 seconds
urlopen accepts a timeout value, that would be the best way to handle it I think.
But I agree with the commenter that 400 threads is probably way too many
I have the following code which creates an HTTPConnectionPool using TwistedMatrix Python framework, and an Agent for HTTP requests:
self.pool = HTTPConnectionPool(reactor, persistent=True)
self.pool.retryAutomatically = False
self.pool.maxPersistentPerHost = 1
self.agent = Agent(reactor, pool=self.pool)
then I create requests to connect to a local server:
d = self.agent.request(
"GET",
url,
Headers({"Host": ["localhost:8333"]}),
None)
The problem is: the local server sometimes behaves incorrectly when multiple simultaneous requests are made, so I would like to limit the number of simultaneous requests to 1.
The additional requests should be queued until the pending request completes.
I've tried with self.pool.maxPersistentPerHost = 1 but it doesn't work.
Does twisted.web.client.Agent with HTTPConnectionPool support limiting the maximum number of connections per host, or do I have to implement a request FIFO queue myself?
The reason setting maxPersistentPerHost to 1 didn't help is that maxPersistentPerHost is for controlling the maximum number of persistent connections to cache per host. It does not prevent additional connections from being opened in order to service new requests, it will only cause them to be closed immediately after a response is received, if the maximum number of cached connections has already been reached.
You can enforce serialization in a number of ways. One way to have a "FIFO queue" is with twisted.internet.defer.DeferredLock. Use it together with Agent like this:
lock = DeferredLock()
d1 = lock.run(agent.request, url, ...)
d2 = lock.run(agent.request, url, ...)
The second request will not run until after the first as completed.