>>> import asyncio
>>> help(asyncio.wait)
..
Help on function wait in module asyncio.tasks:
wait(fs, *, loop=None, timeout=None, return_when='ALL_COMPLETED')
Wait for the Futures and coroutines given by fs to complete.
Coroutines will be wrapped in Tasks.
Returns two sets of Future: (done, pending).
Usage:
done, pending = yield from asyncio.wait(fs)
Note: This does not raise TimeoutError! Futures that aren't done
when the timeout occurs are returned in the second set.
(END)
I dont quite understand last Note in this help (what is second set? is it pending/reprocessing set? how do I execute pending tasks and combine the results of both done and pending and then save in DB)
My problem:
I'm using asyncio with aiohttp, have millions of urls , few of them might raise timeout error. I want to send them in a queue for reprocessing or it should take care by eventpool.
import asyncio
import aiohttp
sem = asyncio.Semaphore(10)
def process_data(url):
with (yield from sem):
response = yield from aiohttp.request('GET', url)
print(response)
loop = asyncio.get_event_loop()
c = asyncio.wait([process_data(url) for url in url_list], timeout=10)
loop.run_until_complete(c)
PS: I'm not using wait_for method.
Here are the two sets from the help:
Returns two sets of Future: (done, pending).
The second set is the pending set, jobs that haven't finished within the timeout. It will return a tuple with two lists of futures, one of those that are done, and one that are still pending.
instead of:
c = asyncio.wait([process_data(url) for url in url_list], timeout=10)
loop.run_until_complete(c)
you should probably:
def dostuff():
done, pending = yield from asyncio.wait([process_data(url) for url in url_list], timeout=10)
# do something with pending
loop.run_until_complete(dostuff())
Here is more information:
https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.wait
Related
Objective:
I am trying to scrape multiple URLs simultaneously. I don't want to make too many requests at the same time so I am using this solution to limit it.
Problem:
Requests are being made for ALL tasks instead of for a limited number at a time.
Stripped-down Code:
async def download_all_product_information():
# TO LIMIT THE NUMBER OF CONCURRENT REQUESTS
async def gather_with_concurrency(n, *tasks):
semaphore = asyncio.Semaphore(n)
async def sem_task(task):
async with semaphore:
return await task
return await asyncio.gather(*(sem_task(task) for task in tasks))
# FUNCTION TO ACTUALLY DOWNLOAD INFO
async def get_product_information(url_to_append):
url = 'https://www.amazon.com.br' + url_to_append
print('Product Information - Page ' + str(current_page_number) + ' for category ' + str(
category_index) + '/' + str(len(all_categories)) + ' in ' + gender)
source = await get_source_code_or_content(url, should_render_javascript=True)
time.sleep(random.uniform(2, 5))
return source
# LOOP WHERE STUFF GETS DONE
for current_page_number in range(1, 401):
for gender in os.listdir(base_folder):
all_tasks = []
# check all products in the current page
all_products_in_current_page = open_list(os.path.join(base_folder, gender, category, current_page))
for product_specific_url in all_products_in_current_page:
current_task = asyncio.create_task(get_product_information(product_specific_url))
all_tasks.append(current_task)
await gather_with_concurrency(random.randrange(8, 15), *all_tasks)
async def main():
await download_all_product_information()
# just to make sure there are not any problems caused by two event loops
if asyncio.get_event_loop().is_running(): # only patch if needed (i.e. running in Notebook, Spyder, etc)
import nest_asyncio
nest_asyncio.apply()
# for asynchronous functionality
if __name__ == '__main__':
asyncio.run(main())
What am I doing wrong? Thanks!
What is wrong is this line:
current_task = asyncio.create_task(get_product_information(product_specific_url))
When you create a "task" it is imediatelly scheduled for execution. As soons
as your code yield execution to the asyncio loop (at any "await" expression), asyncio will loop executing all your tasks.
The semaphore, in the original snippet you pointed too, guarded the creation of the tasks itself, ensuring only "n" tasks would be active at a time. What is passed in to gather_with_concurrency in that snippet are co-routines.
Co-routines, unlike tasks, are objects that are ready to be awaited, but are not yet scheduled. They canbe passed around for free, just like any other object - they will only be executed when they are either awaited, or wrapped by a task (and then when the code passes control to the asyncio loop).
In your code, you are creating the co-routine, with the get_product_information call, and immediately wrapping it in a task. In the await instruction in the line that calls gather_with_concurrency itself, they are all run at once.
The fix is simple: do not create a task at this point, just inside the code guarded by your semaphore. Add just the raw co-routines to your list:
...
all_coroutines = []
# check all products in the current page
all_products_in_current_page = open_list(os.path.join(base_folder, gender, category, current_page))
for product_specific_url in all_products_in_current_page:
current_coroutine = get_product_information(product_specific_url)
all_coroutines.append(current_coroutine)
await gather_with_concurrency(random.randrange(8, 15), *all_coroutines)
There is still an unrelated incorrectness in this code that will make concurrency fail: you are making a synchronous call to time.sleepinside gather_product_information. This will stall the asyncio loop at this point
until the sleep is over. The correct thing to do is to use await asyncio.sleep(...) .
I have an API build with FastAPI which endpoint submits a task to a celery worker, waits for worker to finish its job and return a result to the user.
Question is what is the correct way to wait the result?
Endpoint code
from tasks import celery_application, some_task
from celery.result import AsyncResult
#api.post('/submit')
async def submit(data: str):
task = some_task.apply_async(kwargs={'data': data}, queue='some_queue')
result = AsyncResult(id=task.task_id, app=celery_application).get()
return {'task_result': result}
The problem with AsyncResult that it is that get method blocks the application, it waits for the result synchronously and the api freezes in the meantime.
One of the solutions I came up with is checking for result in a loop for n seconds
from tasks import celery_application, some_task
import asyncio
import redis
r = redis.Redis.from_url(REDIS_CONN_URI)
#api.post('/submit')
async def submit(data: str):
task = some_task.apply_async(kwargs={'data': data}, queue='some_queue')
result = None
for _ in range(100):
if r.exists(task.task_id):
result = r.get(task.task_id)
break
await asyncio.sleep(0.3)
return {'task_result': result}
But it only works partially. Although endpoint is not blocked and can be accessed. Endpoint gets blocked when it tries to reach send task again.
This may be a dummy question but I cannot seem to be able to run python google-clood-bigquery asynchronously.
My goal is to run multiple queries concurrently and wait for all to finish in an asyncio.wait() query gatherer. I'm using asyncio.create_tast() to launch the queries.
The problem is that each query waits for the precedent one to complete before starting.
Here is my query function (quite simple):
async def exec_query(self, query, **kwargs) -> bigquery.table.RowIterator:
job = self.api.query(query, **kwargs)
return job.result()
Since I cannot await job.result() should I await something else?
If you are working inside of a coroutine and want to run different queries without blocking the event_loop then you can use the run_in_executor function which basically runs your queries in background threads without blocking the loop. Here's a good example of how to use that.
Make sure though that that's exactly what you need; jobs created to run queries in the Python API are already asynchronous and they only block when you call job.result(). This means that you don't need to use asyncio unless you are inside of a coroutine.
Here's a quick possible example of retrieving results as soon as the jobs are finished:
from concurrent.futures import ThreadPoolExecutor, as_completed
import google.cloud.bigquery as bq
client = bq.Client.from_service_account_json('path/to/key.json')
query1 = 'SELECT 1'
query2 = 'SELECT 2'
threads = []
results = []
executor = ThreadPoolExecutor(5)
for job in [client.query(query1), client.query(query2)]:
threads.append(executor.submit(job.result))
# Here you can run any code you like. The interpreter is free
for future in as_completed(threads):
results.append(list(future.result()))
results will be:
[[Row((2,), {'f0_': 0})], [Row((1,), {'f0_': 0})]]
just to share a different solution:
import numpy as np
from time import sleep
query1 = """
SELECT
language.name,
average(language.bytes)
FROM `bigquery-public-data.github_repos.languages`
, UNNEST(language) AS language
GROUP BY language.name"""
query2 = 'SELECT 2'
def dummy_callback(future):
global jobs_done
jobs_done[future.job_id] = True
jobs = [bq.query(query1), bq.query(query2)]
jobs_done = {job.job_id: False for job in jobs}
[job.add_done_callback(dummy_callback) for job in jobs]
# blocking loop to wait for jobs to finish
while not (np.all(list(jobs_done.values()))):
print('waiting for jobs to finish ... sleeping for 1s')
sleep(1)
print('all jobs done, do your stuff')
Rather than using as_completed I prefer to use the built-in async functionality from the bigquery jobs themselves. This also makes it possible for me to decompose the datapipeline into separate Cloud Functions, without having to keep the main ThreadPoolExecutor live for the duration of the whole pipeline. Incidentally, this was the reason why I was looking into this: my pipelines are longer than the max timeout of 9 minutes for Cloud Functions (or even 15 minutes for Cloud Run).
Downside is I need to keep track of all the job_ids across the various functions, but that is relatively easy to solve when configuring the pipeline by specifying inputs and outputs such that they form a directed acyclic graph.
In fact I found a way to wrap my query in an asyinc call quite easily thanks to the asyncio.create_task() function.
I just needed to wrap the job.result() in a coroutine; here is the implementation. It does run asynchronously now.
class BQApi(object):
def __init__(self):
self.api = bigquery.Client.from_service_account_json(BQ_CONFIG["credentials"])
async def exec_query(self, query, **kwargs) -> bigquery.table.RowIterator:
job = self.api.query(query, **kwargs)
task = asyncio.create_task(self.coroutine_job(job))
return await task
#staticmethod
async def coroutine_job(job):
return job.result()
I used #dkapitan 's answer to provide an async wrapper:
async def async_bigquery(client, query):
done = False
def callback(future):
nonlocal done
done = True
job = client.query(query)
job.add_done_callback(callback)
while not done:
await asyncio.sleep(.1)
return job
What I mean by "deterministic time"? For example AWS offer a service "AWS Lambda". The process started as lambda function has time limit, after that lambda function will stop execution and will assume that task was finished with error. And example task - send data to http endpoint. Depending of a network connection to http endpoint, or other factors, process of sending data can take a long time. If I need to send the same data to the many endpoints, then full process time will take one process time times endpoints amount. Which increase a chance that lambda function will be stopped before all data will be send to all endpoints.
To solve this I need to send data to different endpoints in parallel mode using threads.
The problem with threads - started thread can't be stopped. If http request will take more time than it dedicated by lambda function time limit, lambda function will be aborted and return error. So I need to use timeout with http request, to abort it, if it take more time than expected.
If http request will be canceled by timeout or endpoint will return error, I need to save not processed data somewhere to not lost the data. The time needed to save unprocessed data can be predicted, because I control the storage where data will be saved.
And the last part that consume time - procedure or loop where threads are scheduled executor.submit(). If there is only one endpoint or small number of them then the consumed time will be small. And there is no necessary to control this. But if I have deal with many endpoints, I have to take this into account.
So basically full time will consists of:
scheduling threads
http request execution
saving unprocessed data
There is example of how I can manage time using threads
import concurrent.futures
from functools import partial
import requests
import time
start = time.time()
def send_data(data):
host = 'http://127.0.0.1:5000/endpoint'
try:
result = requests.post(host, json=data, timeout=(0.1, 0.5))
# print('done')
if result.status_code == 200:
return {'status': 'ok'}
if result.status_code != 200:
return {'status': 'error', 'msg': result.text}
except requests.exceptions.Timeout as err:
return {'status': 'error', 'msg': 'timeout'}
def get_data(n):
return {"wait": n}
def done_cb(a, b, future):
pass # save unprocessed data
def main():
executor = concurrent.futures.ThreadPoolExecutor()
futures = []
max_time = 0.5
for i in range(1):
future = executor.submit(send_data, *[{"wait": 10}])
future.add_done_callback(partial(done_cb, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new threads')
# save unprocessed data
break
try:
for item in concurrent.futures.as_completed(futures, timeout=1):
item.result()
except concurrent.futures.TimeoutError as err:
pass
I was thinking of how I can use asyncio library instead of threads, to do the same thing.
import asyncio
import time
from functools import partial
import requests
start = time.time()
def send_data(data):
...
def get_data(n):
return {"wait": n}
def done_callback(a,b, future):
pass # save unprocessed data
def main(loop):
max_time = 0.5
futures = []
start_appending = time.time()
for i in range(1):
event_data = get_data(1)
future = (loop.run_in_executor(None, send_data, event_data))
future.add_done_callback(partial(done_callback, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new futures')
# save unprocessed data
break
finished, unfinished = loop.run_until_complete(
asyncio.wait(futures, timeout=1)
)
_loop = asyncio.get_event_loop()
result = main(_loop)
Function send_data() the same as in previous code snipped.
Because request library is not async code I use run_in_executor() to create future object. The main problems I have is that done_callback() is not executed when the thread that started but executor done it's job. But only when the futures will be "processed" by asyncio.wait() expression.
Basically I seeking the way to start execute asyncio future, like ThreadPoolExecutor start execute threads, and not wait for asyncio.wait() expression to call done_callback(). If you have other ideas how to write python code that will work with threads or coroutines and will complete in deterministic time. Please share it, I will be glad to read them.
And other question. If thread or future done its job, it can return result, that I can use in done_callback(), for example to remove message from queue by id returned in result. But if thread or future was canceled, I don't have result. And I have to use functools.partial() pass in done_callback additional data, that can help me to understand for what data this callback was called. If passed data are small this is not a problem. If data will be big, I need to put data in array/list/dictionary and pass in callback only index of array or put "full data: in callback.
Can I somehow get access to variable that was passed to future/thread, from done_callback(), that was triggered on canceled future/thread?
You can use asyncio.wait_for to wait for a future (or multiple futures, when combined with asyncio.gather) and cancel them in case of a timeout. Unlike threads, asyncio supports cancellation, so you can cancel a task whenever you feel like it, and it will be cancelled at the first blocking call it makes (typically a network call).
Note that for this to work, you should be using asyncio-native libraries such as aiohttp for HTTP. Trying to combine requests with asyncio using run_in_executor will appear to work for simple tasks, but it will not bring you the benefits of using asyncio, such as being able to spawn a massive number of tasks without encumbering the OS, or the possibility of cancellation.
I have a tornado application which needs to run a blocking function on ProcessPoolExecutor. This blocking function employs a library which emits incremental results via blinker events. I'd like to collect these events and send them back to my tornado app as they occur.
At first, tornado seemed ideal for this use case because its asynchronous. I thought I could simply pass a tornado.queues.Queue object to the function to be run on the pool and then put() events onto this queue as part of my blinker event callback.
However, reading the docs of tornado.queues.Queue, I learned they are not managed across processes like multiprocessing.Queue and are not thread safe.
Is there a way to retrieve these events from the pool as they occur? Should I wrap multiprocessing.Queue so it produces Futures? That seems unlikely to work as I doubt the internals of multiprocessing are compatible with tornado.
[EDIT]
There are some good clues here: https://gist.github.com/hoffrocket/8050711
To collect anything but the return value of a task passed to a ProcessPoolExecutor, you must use a multiprocessing.Queue (or other object from the multiprocessing library). Then, since multiprocessing.Queue only exposes a synchronous interface, you must use another thread in the parent process to read from the queue (without reaching into implementation details. There's a file descriptor that could be used here, but we'll ignore that for now since it's undocumented and subject to change).
Here's a quick untested example:
queue = multiprocessing.Queue()
proc_pool = concurrent.futures.ProcessPoolExecutor()
thread_pool = concurrent.futures.ThreadPoolExecutor()
async def read_events():
while True:
event = await thread_pool.submit(queue.get)
print(event)
async def foo():
IOLoop.current.spawn_callback(read_events)
await proc_pool.submit(do_something_and_write_to_queue)
You can do it more simply than that. Here's a coroutine that submits four slow function calls to subprocesses and awaits them:
from concurrent.futures import ProcessPoolExecutor
from time import sleep
from tornado import gen, ioloop
pool = ProcessPoolExecutor()
def calculate_slowly(x):
sleep(x)
return x
async def parallel_tasks():
# Create futures in a randomized order.
futures = [gen.convert_yielded(pool.submit(calculate_slowly, i))
for i in [1, 3, 2, 4]]
wait_iterator = gen.WaitIterator(*futures)
while not wait_iterator.done():
try:
result = await wait_iterator.next()
except Exception as e:
print("Error {} from {}".format(e, wait_iterator.current_future))
else:
print("Result {} received from future number {}".format(
result, wait_iterator.current_index))
ioloop.IOLoop.current().run_sync(parallel_tasks)
It outputs:
Result 1 received from future number 0
Result 2 received from future number 2
Result 3 received from future number 1
Result 4 received from future number 3
You can see that the coroutine receives results in the order they complete, not the order they were submitted: future number 1 resolves after future number 2, because future number 1 slept longer. convert_yielded transforms the Futures returned by ProcessPoolExecutor into Tornado-compatible Futures that can be awaited in a coroutine.
Each future resolves to the value returned by calculate_slowly: in this case it's the same number that was passed into calculate_slowly, and the same number of seconds as calculate_slowly sleeps.
To include this in a RequestHandler, try something like this:
class MainHandler(web.RequestHandler):
async def get(self):
self.write("Starting....\n")
self.flush()
futures = [gen.convert_yielded(pool.submit(calculate_slowly, i))
for i in [1, 3, 2, 4]]
wait_iterator = gen.WaitIterator(*futures)
while not wait_iterator.done():
result = await wait_iterator.next()
self.write("Result {} received from future number {}\n".format(
result, wait_iterator.current_index))
self.flush()
if __name__ == "__main__":
application = web.Application([
(r"/", MainHandler),
])
application.listen(8888)
ioloop.IOLoop.instance().start()
You can observe if you curl localhost:8888 that the server responds incrementally to the client request.