I am working on a big data problem and am stuck with some concurrency and async io issues. The problem is as follows:
1) Have multiple huge files (~4gb each x upto 15) which I am processing using ProcessPoolExecutor from concurrent.futures module this way :
def process(source):
files = os.list(source)
with ProcessPoolExecutor() as executor:
future_to_url = {executor.submit(process_individual_file, source, input_file):input_file for input_file in files}
for future in as_completed(future_to_url):
data = future.result()
2) Now in each file, I want to go line by line, process line to create a particular json, group such 2K jsons together and hit an API with that request to get response. Here is the code:
def process_individual_file(source, input_file):
limit = 2000
with open(source+input_file) as sf:
for line in sf:
json_array.append(form_json(line))
limit -= 1
if limit == 0:
response = requests.post(API_URL, json=json_array)
#check response status here
limit = 2000
3) Now the problem, the number of lines in each file being really large and that API call blocking and bit slow to respond, the program is taking huge amount of time to complete.
4) What I want to achieve is to make that API call async so that I can keep processing next batch of 2000 when that API call is happening.
5) Things I tried till now : I was trying to implement this using asyncio but there we need to collect the set of future tasks and wait for completion using event loop. Something like this:
async def process_individual_file(source, input_file):
tasks = []
limit = 2000
with open(source+input_file) as sf:
for line in sf:
json_array.append(form_json(line))
limit -= 1
if limit == 0:
tasks.append(asyncio.ensure_future(call_api(json_array)))
limit = 2000
await asyncio.wait(tasks)
ioloop = asyncio.get_event_loop()
ioloop.run_until_complete(process_individual_file(source, input_file))
ioloop.close()
6) I am really not understanding this because this is indirectly the same as previous as it waits to collect all tasks before launching them. Can someone help me with what should be the correct architecture of this problem ? How can I call the API async way, without collecting all tasks and with ability to process next batch parallely ?
I am really not understanding this because this is indirectly the
same as previous as it waits to collect all tasks before launching
them.
No, you wrong here. When you create asyncio.Task with asyncio.ensure_future it starts executing call_api coroutine immediately. This is how tasks in asyncio work:
import asyncio
async def test(i):
print(f'{i} started')
await asyncio.sleep(i)
async def main():
tasks = [
asyncio.ensure_future(test(i))
for i
in range(3)
]
await asyncio.sleep(0)
print('At this moment tasks are already started')
await asyncio.wait(tasks)
if __name__ == '__main__':
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Output:
0 started
1 started
2 started
At this moment tasks are already started
Problem with your approach is that process_individual_file is not actually asynchronous: it does large amount of CPU-related job without returning control to your asyncio event loop. It's a problem - function blocks event loop making impossible tasks to be executed.
Very simple, but effective solution I think you can use - is to return control to event loop manually using asyncio.sleep(0) after a few amount of executing process_individual_file, for example, on reading each line:
async def process_individual_file(source, input_file):
tasks = []
limit = 2000
with open(source+input_file) as sf:
for line in sf:
await asyncio.sleep(0) # Return control to event loop to allow it execute tasks
json_array.append(form_json(line))
limit -= 1
if limit == 0:
tasks.append(asyncio.ensure_future(call_api(json_array)))
limit = 2000
await asyncio.wait(tasks)
Upd:
there will be more than millions of requests to be done and hence I am
feeling uncomfortable to store future objects for all of them in a
list
It makes much sense. Nothing good will happen if you run million parallel network requests. Usual way to set limit in this case is to use synchronization primitives like asyncio.Semaphore.
I advice you to make generator to get json_array from file, and acquire Semaphore before adding new task and release it on task ready. You will get clean code protected from many parallel running tasks.
This will look like something like this:
def get_json_array(input_file):
json_array = []
limit = 2000
with open(input_file) as sf:
for line in sf:
json_array.append(form_json(line))
limit -= 1
if limit == 0:
yield json_array # generator will allow split file-reading logic from adding tasks
json_array = []
limit = 2000
sem = asyncio.Semaphore(50) # don't allow more than 50 parallel requests
async def process_individual_file(input_file):
for json_array in get_json_array(input_file):
await sem.acquire() # file reading wouldn't resume until there's some place for newer tasks
task = asyncio.ensure_future(call_api(json_array))
task.add_done_callback(lambda t: sem.release()) # on task done - free place for next tasks
task.add_done_callback(lambda t: print(t.result())) # print result on some call_api done
Related
Objective:
I am trying to scrape multiple URLs simultaneously. I don't want to make too many requests at the same time so I am using this solution to limit it.
Problem:
Requests are being made for ALL tasks instead of for a limited number at a time.
Stripped-down Code:
async def download_all_product_information():
# TO LIMIT THE NUMBER OF CONCURRENT REQUESTS
async def gather_with_concurrency(n, *tasks):
semaphore = asyncio.Semaphore(n)
async def sem_task(task):
async with semaphore:
return await task
return await asyncio.gather(*(sem_task(task) for task in tasks))
# FUNCTION TO ACTUALLY DOWNLOAD INFO
async def get_product_information(url_to_append):
url = 'https://www.amazon.com.br' + url_to_append
print('Product Information - Page ' + str(current_page_number) + ' for category ' + str(
category_index) + '/' + str(len(all_categories)) + ' in ' + gender)
source = await get_source_code_or_content(url, should_render_javascript=True)
time.sleep(random.uniform(2, 5))
return source
# LOOP WHERE STUFF GETS DONE
for current_page_number in range(1, 401):
for gender in os.listdir(base_folder):
all_tasks = []
# check all products in the current page
all_products_in_current_page = open_list(os.path.join(base_folder, gender, category, current_page))
for product_specific_url in all_products_in_current_page:
current_task = asyncio.create_task(get_product_information(product_specific_url))
all_tasks.append(current_task)
await gather_with_concurrency(random.randrange(8, 15), *all_tasks)
async def main():
await download_all_product_information()
# just to make sure there are not any problems caused by two event loops
if asyncio.get_event_loop().is_running(): # only patch if needed (i.e. running in Notebook, Spyder, etc)
import nest_asyncio
nest_asyncio.apply()
# for asynchronous functionality
if __name__ == '__main__':
asyncio.run(main())
What am I doing wrong? Thanks!
What is wrong is this line:
current_task = asyncio.create_task(get_product_information(product_specific_url))
When you create a "task" it is imediatelly scheduled for execution. As soons
as your code yield execution to the asyncio loop (at any "await" expression), asyncio will loop executing all your tasks.
The semaphore, in the original snippet you pointed too, guarded the creation of the tasks itself, ensuring only "n" tasks would be active at a time. What is passed in to gather_with_concurrency in that snippet are co-routines.
Co-routines, unlike tasks, are objects that are ready to be awaited, but are not yet scheduled. They canbe passed around for free, just like any other object - they will only be executed when they are either awaited, or wrapped by a task (and then when the code passes control to the asyncio loop).
In your code, you are creating the co-routine, with the get_product_information call, and immediately wrapping it in a task. In the await instruction in the line that calls gather_with_concurrency itself, they are all run at once.
The fix is simple: do not create a task at this point, just inside the code guarded by your semaphore. Add just the raw co-routines to your list:
...
all_coroutines = []
# check all products in the current page
all_products_in_current_page = open_list(os.path.join(base_folder, gender, category, current_page))
for product_specific_url in all_products_in_current_page:
current_coroutine = get_product_information(product_specific_url)
all_coroutines.append(current_coroutine)
await gather_with_concurrency(random.randrange(8, 15), *all_coroutines)
There is still an unrelated incorrectness in this code that will make concurrency fail: you are making a synchronous call to time.sleepinside gather_product_information. This will stall the asyncio loop at this point
until the sleep is over. The correct thing to do is to use await asyncio.sleep(...) .
Question on asyncio. I have this working just not sure if it's the correct way or if there is a easier way.
The short versions of what I am trying to do is continuously to execute the run() 10x concurrently
To do this I had to create a function work_it() with a While True Loop
The run() function take about 5 minutes to complete. Database calls, processing, aiohttp reqeusts, and etc.
Is this the best way to to do this or is there another way to have asyncio continuously run a function over and over again with 10 concurrent processes.
Also is asyncio.gather the correct function to use? Am I better of using an executor?
Thanks in advance.
Erik
db = Database()
conn = db.connect()
async def run(worker_id=None):
"""
Using Shared Database Conneciton
Create a object. Query the database, process the data, and do a http post with aiohttp
Returns: True\False based on the http post
"""
# my_object = Object_Model(db)
# await do_sql_queries
# await process_data
# Lots of processing
# result = await aiohttp_requests
nap_time = random.randint(1,5)
print(f'Worker-{worker_id} sleeping for {nap_time}')
await asyncio.sleep(nap_time)
return True
async def work_it(worker_id=None):
"""
This worker should run forever
"""
while True:
start = time.monotonic()
result = await run(worker_id)
duration = time.monotonic() - start
print(f'Worker-{worker_id} ran for {duration:.6f} seconds')
async def main():
"""
Start 10 "workers"
"""
workers = 10
tasks = []
for worker_id in range(1, workers+1):
print(f'Building Task {worker_id}')
tasks.append(work_it(worker_id))
print(f'Await Gather')
await asyncio.gather(*tasks)
asyncio.run(main())
What I mean by "deterministic time"? For example AWS offer a service "AWS Lambda". The process started as lambda function has time limit, after that lambda function will stop execution and will assume that task was finished with error. And example task - send data to http endpoint. Depending of a network connection to http endpoint, or other factors, process of sending data can take a long time. If I need to send the same data to the many endpoints, then full process time will take one process time times endpoints amount. Which increase a chance that lambda function will be stopped before all data will be send to all endpoints.
To solve this I need to send data to different endpoints in parallel mode using threads.
The problem with threads - started thread can't be stopped. If http request will take more time than it dedicated by lambda function time limit, lambda function will be aborted and return error. So I need to use timeout with http request, to abort it, if it take more time than expected.
If http request will be canceled by timeout or endpoint will return error, I need to save not processed data somewhere to not lost the data. The time needed to save unprocessed data can be predicted, because I control the storage where data will be saved.
And the last part that consume time - procedure or loop where threads are scheduled executor.submit(). If there is only one endpoint or small number of them then the consumed time will be small. And there is no necessary to control this. But if I have deal with many endpoints, I have to take this into account.
So basically full time will consists of:
scheduling threads
http request execution
saving unprocessed data
There is example of how I can manage time using threads
import concurrent.futures
from functools import partial
import requests
import time
start = time.time()
def send_data(data):
host = 'http://127.0.0.1:5000/endpoint'
try:
result = requests.post(host, json=data, timeout=(0.1, 0.5))
# print('done')
if result.status_code == 200:
return {'status': 'ok'}
if result.status_code != 200:
return {'status': 'error', 'msg': result.text}
except requests.exceptions.Timeout as err:
return {'status': 'error', 'msg': 'timeout'}
def get_data(n):
return {"wait": n}
def done_cb(a, b, future):
pass # save unprocessed data
def main():
executor = concurrent.futures.ThreadPoolExecutor()
futures = []
max_time = 0.5
for i in range(1):
future = executor.submit(send_data, *[{"wait": 10}])
future.add_done_callback(partial(done_cb, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new threads')
# save unprocessed data
break
try:
for item in concurrent.futures.as_completed(futures, timeout=1):
item.result()
except concurrent.futures.TimeoutError as err:
pass
I was thinking of how I can use asyncio library instead of threads, to do the same thing.
import asyncio
import time
from functools import partial
import requests
start = time.time()
def send_data(data):
...
def get_data(n):
return {"wait": n}
def done_callback(a,b, future):
pass # save unprocessed data
def main(loop):
max_time = 0.5
futures = []
start_appending = time.time()
for i in range(1):
event_data = get_data(1)
future = (loop.run_in_executor(None, send_data, event_data))
future.add_done_callback(partial(done_callback, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new futures')
# save unprocessed data
break
finished, unfinished = loop.run_until_complete(
asyncio.wait(futures, timeout=1)
)
_loop = asyncio.get_event_loop()
result = main(_loop)
Function send_data() the same as in previous code snipped.
Because request library is not async code I use run_in_executor() to create future object. The main problems I have is that done_callback() is not executed when the thread that started but executor done it's job. But only when the futures will be "processed" by asyncio.wait() expression.
Basically I seeking the way to start execute asyncio future, like ThreadPoolExecutor start execute threads, and not wait for asyncio.wait() expression to call done_callback(). If you have other ideas how to write python code that will work with threads or coroutines and will complete in deterministic time. Please share it, I will be glad to read them.
And other question. If thread or future done its job, it can return result, that I can use in done_callback(), for example to remove message from queue by id returned in result. But if thread or future was canceled, I don't have result. And I have to use functools.partial() pass in done_callback additional data, that can help me to understand for what data this callback was called. If passed data are small this is not a problem. If data will be big, I need to put data in array/list/dictionary and pass in callback only index of array or put "full data: in callback.
Can I somehow get access to variable that was passed to future/thread, from done_callback(), that was triggered on canceled future/thread?
You can use asyncio.wait_for to wait for a future (or multiple futures, when combined with asyncio.gather) and cancel them in case of a timeout. Unlike threads, asyncio supports cancellation, so you can cancel a task whenever you feel like it, and it will be cancelled at the first blocking call it makes (typically a network call).
Note that for this to work, you should be using asyncio-native libraries such as aiohttp for HTTP. Trying to combine requests with asyncio using run_in_executor will appear to work for simple tasks, but it will not bring you the benefits of using asyncio, such as being able to spawn a massive number of tasks without encumbering the OS, or the possibility of cancellation.
I have a tornado application which needs to run a blocking function on ProcessPoolExecutor. This blocking function employs a library which emits incremental results via blinker events. I'd like to collect these events and send them back to my tornado app as they occur.
At first, tornado seemed ideal for this use case because its asynchronous. I thought I could simply pass a tornado.queues.Queue object to the function to be run on the pool and then put() events onto this queue as part of my blinker event callback.
However, reading the docs of tornado.queues.Queue, I learned they are not managed across processes like multiprocessing.Queue and are not thread safe.
Is there a way to retrieve these events from the pool as they occur? Should I wrap multiprocessing.Queue so it produces Futures? That seems unlikely to work as I doubt the internals of multiprocessing are compatible with tornado.
[EDIT]
There are some good clues here: https://gist.github.com/hoffrocket/8050711
To collect anything but the return value of a task passed to a ProcessPoolExecutor, you must use a multiprocessing.Queue (or other object from the multiprocessing library). Then, since multiprocessing.Queue only exposes a synchronous interface, you must use another thread in the parent process to read from the queue (without reaching into implementation details. There's a file descriptor that could be used here, but we'll ignore that for now since it's undocumented and subject to change).
Here's a quick untested example:
queue = multiprocessing.Queue()
proc_pool = concurrent.futures.ProcessPoolExecutor()
thread_pool = concurrent.futures.ThreadPoolExecutor()
async def read_events():
while True:
event = await thread_pool.submit(queue.get)
print(event)
async def foo():
IOLoop.current.spawn_callback(read_events)
await proc_pool.submit(do_something_and_write_to_queue)
You can do it more simply than that. Here's a coroutine that submits four slow function calls to subprocesses and awaits them:
from concurrent.futures import ProcessPoolExecutor
from time import sleep
from tornado import gen, ioloop
pool = ProcessPoolExecutor()
def calculate_slowly(x):
sleep(x)
return x
async def parallel_tasks():
# Create futures in a randomized order.
futures = [gen.convert_yielded(pool.submit(calculate_slowly, i))
for i in [1, 3, 2, 4]]
wait_iterator = gen.WaitIterator(*futures)
while not wait_iterator.done():
try:
result = await wait_iterator.next()
except Exception as e:
print("Error {} from {}".format(e, wait_iterator.current_future))
else:
print("Result {} received from future number {}".format(
result, wait_iterator.current_index))
ioloop.IOLoop.current().run_sync(parallel_tasks)
It outputs:
Result 1 received from future number 0
Result 2 received from future number 2
Result 3 received from future number 1
Result 4 received from future number 3
You can see that the coroutine receives results in the order they complete, not the order they were submitted: future number 1 resolves after future number 2, because future number 1 slept longer. convert_yielded transforms the Futures returned by ProcessPoolExecutor into Tornado-compatible Futures that can be awaited in a coroutine.
Each future resolves to the value returned by calculate_slowly: in this case it's the same number that was passed into calculate_slowly, and the same number of seconds as calculate_slowly sleeps.
To include this in a RequestHandler, try something like this:
class MainHandler(web.RequestHandler):
async def get(self):
self.write("Starting....\n")
self.flush()
futures = [gen.convert_yielded(pool.submit(calculate_slowly, i))
for i in [1, 3, 2, 4]]
wait_iterator = gen.WaitIterator(*futures)
while not wait_iterator.done():
result = await wait_iterator.next()
self.write("Result {} received from future number {}\n".format(
result, wait_iterator.current_index))
self.flush()
if __name__ == "__main__":
application = web.Application([
(r"/", MainHandler),
])
application.listen(8888)
ioloop.IOLoop.instance().start()
You can observe if you curl localhost:8888 that the server responds incrementally to the client request.
I've been trying to make a bot in Slack that remains responsive even if it hasn't finished processing earlier commands, so it could go and do something that takes some time without locking up. It should return whatever is finished first.
I think I'm getting part of the way there: it now doesn't ignore stuff that's typed in before an earlier command is finished running. But it still doesn't allow threads to "overtake" each other - a command called first will return first, even if it takes much longer to complete.
import asyncio
from slackclient import SlackClient
import time, datetime as dt
token = "my token"
sc = SlackClient(token)
#asyncio.coroutine
def sayHello(waitPeriod = 5):
yield from asyncio.sleep(waitPeriod)
msg = 'Hello! I waited {} seconds.'.format(waitPeriod)
return msg
#asyncio.coroutine
def listen():
yield from asyncio.sleep(1)
x = sc.rtm_connect()
info = sc.rtm_read()
if len(info) == 1:
if r'/hello' in info[0]['text']:
print(info)
try:
waitPeriod = int(info[0]['text'][6:])
except:
print('Can not read a time period. Using 5 seconds.')
waitPeriod = 5
msg = yield from sayHello(waitPeriod = waitPeriod)
print(msg)
chan = info[0]['channel']
sc.rtm_send_message(chan, msg)
asyncio.async(listen())
def main():
print('here we go')
loop = asyncio.get_event_loop()
asyncio.async(listen())
loop.run_forever()
if __name__ == '__main__':
main()
When I type /hello 12 and /hello 2 into the Slack chat window, the bot does respond to both commands now. However it doesn't process the /hello 2 command until it's finished doing the /hello 12 command. My understanding of asyncio is a work in progress, so it's quite possible I'm making a very basic error. I was told in a previous question that things like sc.rtm_read() are blocking functions. Is that the root of my problem?
Thanks a lot,
Alex
What is happening is your listen() coroutine is blocking at the yield from sayHello() statement. Only once sayHello() completes will listen() be able to continue on its merry way. The crux is that the yield from statement (or await from Python 3.5+) is blocking. It chains the two coroutines together and the 'parent' coroutine can't complete until the linked 'child' coroutine completes. (However, 'neighbouring' coroutines that aren't part of the same linked chain are free to proceed in the meantime).
The simple way to release sayHello() without holding up listen() in this case is to use listen() as a dedicated listening coroutine and to offload all subsequent actions into their own Task wrappers instead, thus not hindering listen() from responding to subsequent incoming messages. Something along these lines.
#asyncio.coroutine
def sayHello(waitPeriod, sc, chan):
yield from asyncio.sleep(waitPeriod)
msg = 'Hello! I waited {} seconds.'.format(waitPeriod)
print(msg)
sc.rtm_send_message(chan, msg)
#asyncio.coroutine
def listen():
# connect once only if possible:
x = sc.rtm_connect()
# use a While True block instead of repeatedly calling a new Task at the end
while True:
yield from asyncio.sleep(0) # use 0 unless you need to wait a full second?
#x = sc.rtm_connect() # probably not necessary to reconnect each loop?
info = sc.rtm_read()
if len(info) == 1:
if r'/hello' in info[0]['text']:
print(info)
try:
waitPeriod = int(info[0]['text'][6:])
except:
print('Can not read a time period. Using 5 seconds.')
waitPeriod = 5
chan = info[0]['channel']
asyncio.async(sayHello(waitPeriod, sc, chan))