Dask Distributed: access client futures from separate process - python

I have launched many simulations with Dask Distributed:
from time import sleep
from distributed import Client, as_completed
def simulation(x):
""" Proxy function for simulation """
sleep(60 * 60 * 24) # wait one day
return hash(x)
def save(result):
with open("result", "w") as f:
print(result, file=f)
if __name__ == "__main__":
client = Client("localhost:8786")
futures = client.map(simulation, range(1000))
for future in as_completed(future):
result = future.result()
save(result)
However, this code has a bug: open("result", "w") should be open(str(result), "w"). I'd like to correct that mistake, the re-process of the clients futures.
However, I do not know of a way to do that without stopping the Python process with a keyboard interrupt than re-submitting the jobs to the Dask cluster. I don't want to do that because these simulations have taken a couple days.
I want to access all the futures the client has and save all the existing results. How do I make that happen?
Possibly relevant questions
"Dask Distributed Getting Futures after Client Closed" isn't relevant because the client connection is still open:

client.has_what is the method you're looking for:
from distributed import Client, Future
if __name__ == "__main__":
client = Client("localhost:8786")
futures = [Future(key) for keys in client.has_what().values() for key in keys]
for future in as_completed(futures):
...

Related

Unable to parallelise workloads on the KubeCluster operator for Dask

I want to be able to run multiple "workflows" in parallel, where each workflow submits Dask tasks and waits for it's own Dask tasks to complete. Some of these workflows will then need to use the results of it's first set of tasks to run more tasks in Dask. I want the workflows to share a single Dask cluster running in Kubernetes.
I've implemented a basic proof of concept, which works with a local cluster but fails on KubeCluster with the error AttributeError: 'NoneType' object has no attribute '__await__'. This is because the operator version of KubeCluster doesn't seem to accept asynchronous=True as an argument like the old version did.
I'm fairly new to python and very new to Dask, so might be doing something very daft. I got a little further than this with the legacy KubeCluster approach but couldn't get Client.upload_file() to work asynchronously, which I'm definitely also going to need.
Very grateful for any direction in getting this to work - I'm not attached to any particular implementation strategy, as long as I can wait on a subset of Dask tasks to complete whilst others run in parallel on the same cluster, which seems like a fairly basic requirement for a distributed computing platform.
import logging
import time
import dask
from dask.distributed import Client
import asyncio
from dask_kubernetes.operator import KubeCluster
logging.basicConfig(format='%(asctime)s %(levelname)-8s %(name)-20s %(message)s', level=logging.DEBUG, datefmt='%Y-%m-%d %H:%M:%S')
logger = logging.getLogger('AsyncTest')
def work(workflow, instance):
time.sleep(5)
logger.info('work done for workflow ' + str(workflow) + ', instance ' + str(instance))
async def workflow(id, client):
tasks = []
for x in range(5): tasks.append(client.submit(work, id, x))
for task in tasks: await task
return 'workflow 1 done'
async def all_workflows():
cluster = KubeCluster(custom_cluster_spec='cluster-spec.yml', namespace='dask')
cluster.adapt(minimum=5, maximum=50)
client = Client(cluster, asynchronous=True)
dask.config.set({"distributed.admin.tick.limit": "60s"})
cluster.scale(15)
logger.info('is the client async? ' + str(client.asynchronous)) # Returns false
# work is parallelised as expected if I use this local client instead
# client = await Client(processes=False, asynchronous=True)
tasks = [asyncio.create_task(workflow(1, client)), asyncio.create_task(workflow(2, client)), asyncio.create_task(workflow(3, client))]
await asyncio.gather(*tasks)
await client.close()
logger.info('all workflows done')
if __name__ == '__main__':
asyncio.get_event_loop().run_until_complete(all_workflows())

How to use BigQuery Storage API to concurrently read streams in Python threads

I have a large table (external to BigQuery as the data is in Google Cloud Storage). I want to scan the table using BigQuery to a client machine. For throughput, I fetch multiple streams concurrently in multiple threads.
From all I can tell, concurrency is not working. There's actually some penalty when using multiple threads.
import concurrent.futures
import logging
import queue
import threading
import time
from google.cloud.bigquery_storage import types
from google.cloud import bigquery_storage
PROJECT_ID = 'abc'
CREDENTIALS = {....}
def main():
table = "projects/{}/datasets/{}/tables/{}".format(PROJECT_ID, 'db', 'tb')
requested_session = types.ReadSession()
requested_session.table = table
requested_session.data_format = types.DataFormat.AVRO
requested_session.read_options.selected_fields = ["a", "b"]
requested_session.read_options
client = bigquery_storage.BigQueryReadClient(credentials=CREDENTIALS)
session = client.create_read_session(
parent="projects/{}".format(PROJECT_ID),
read_session=requested_session,
max_stream_count=0,
)
if not session.streams:
return
n_streams = len(session.streams)
print("Total streams", n_streams) # this prints 1000
q_out = queue.Queue(1024)
concurrency = 4
with concurrent.futures.ThreadPoolExecutor(concurrency) as pool:
tasks = [
pool.submit(download_row,
client._transport.__class__,
client._transport._grpc_channel,
s.name,
q_out)
for s in session.streams
]
t0 = time.perf_counter()
ntotal = 0
ndone = 0
while True:
page = q_out.get()
if page is None:
ndone += 1
if ndone == len(tasks):
break
else:
for row in page:
ntotal += 1
if ntotal % 10000 == 0:
qps = int(ntotal / (time.perf_counter() - t0))
print(f'QPS so far: {qps}')
for t in tasks:
t.result()
def download_row(transport_cls, channel, stream_name, q_out):
try:
transport = transport_cls(channel=channel)
client = bigquery_storage.BigQueryReadClient(
transport=transport,
)
reader = client.read_rows(stream_name)
for page in reader.rows().pages:
q_out.put(page)
finally:
q_out.put(None)
if __name__ == '__main__':
main()
Google BigQuery Storage API doc and multiple source claim one can fetch multiple "streams" concurrently for higher throughput, yet I didn't find any functional example. I've followed the advice to share a GRPC "channel" across the threads.
The data items are large. The QPS I got is roughly
150, concurrency=1
120, concurrency=2
140, concurrency=4
Each "page" contains about 200 rows.
Thoughts:
BigQuery quota? I only saw request rate limit, and did not see limit on volume of
data traffic per second. The quotas do not appear to be limiting for my case.
BigQuery server side options? Doesn't seem to be relevant. BigQuery should accept
concurrent requests with enough capability.
GPRC usage? I think this is the main direction for digging. But I don't know what's
wrong in my code.
Can anyone shed some light on this? Thanks.
Python threads do not run in parallel because of the GIL.
You are creating threads, and not multiprocesses. And by definition Python is single core because of GIL.
ThreadPoolExecutor has been available since Python 3.2, it is not
widely used, perhaps because of misunderstandings of the capabilities
and limitations of Threads in Python. This is enforced by the Global
Interpreter Lock ("GIL"). More
Look into using multiprocessing module, a good read is here.
UPDATE:
Also in your code you need one more param: requested_streams
n_streams = 2
session = client.create_read_session(
table_ref,
parent,
requested_streams=n_streams,
format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
sharding_strategy=(bigquery_storage_v1beta1.enums.ShardingStrategy.BALANCED),
)

Python Multiprocessing: Signal job completion without passing Event object through a queue

Problem Outline
I have a python flask server where one of the endpoints has a moderate amount of work to do (the real code reads, resizes and returns an image). I want to optimise the endpoint so that it can be called multiple times in parallel.
The code I currently have (shown below) does not work because it relies on passing a multiprocessing.Event object through a multiprocessing.JoinableQueue which is not allowed and results in the following error:
RuntimeError: Condition objects should only be shared between processes
through inheritance
How can I use a separate process to compute some jobs and notify the main thread when a specific job is complete?
Proof of Concept
Flask can be multithreaded so if one request is waiting on a result other threads can continue to process other requests. I have a basic proof of concept here that shows that parallel requests can be optimised using multiprocessing: https://github.com/alanbacon/flask_multiprocessing
The example code on github spawns a new process for every request which I understand has considerable overheads, plus I've noticed that my proof-of-concept server crashes if there are more than 10 or 20 concurrent requests, I suspect this is because there are too many processes being spawned.
Current Attempt
I have tried to create a set of workers that pick jobs off a queue. When a job is complete the result is written to a shared memory area. Each job contains the work to be done and an Event object that can be set when the job is complete to signal the main thread.
Each request thread passes in a job with a newly created Event object, it then immediately waits on that event before returning the result. While one server request thread is waiting the server is able to use other threads to continue to serve other requests.
The problem as mentioned above is that Event objects can not be passed around in this way.
What approach should I take to circumvent this problem?
from flask import Flask, request, Response,
import multiprocessing
import uuid
app = Flask(__name__)
# flask config
app.config['PROPAGATE_EXCEPTIONS'] = True
app.config['DEBUG'] = False
def simpleWorker(complexity):
temp = 0
for i in range(0, complexity):
temp += 1
mgr = multiprocessing.Manager()
results = mgr.dict()
joinableQueue = multiprocessing.JoinableQueue()
lock = multiprocessing.Lock()
def mpWorker(joinableQueue, lock, results):
while True:
next_task = joinableQueue.get() # blocking call
if next_task is None: # poison pill to kill worker
break
simpleWorker(next_task['complexity']) # pretend to do heavy work
result = next_task['val'] * 2 # compute result
ID = next_task['ID']
with lock:
results[ID] = result # output result to shared memory
next_task['event'].set() # tell main process result is calculated
joinableQueue.task_done() # remove task from queue
#app.route("/work/<ID>", methods=['GET'])
def work(ID=None):
if request.method == 'GET':
# send a task to the consumer and wait for it to finish
uid = str(uuid.uuid4())
event = multiprocessing.Event()
# pass event to job so that job can tell this thread when processing is
# complete
joinableQueue.put({
'val': ID,
'ID': uid,
'event': event,
'complexity': 100000000
})
event.wait() # wait for result to be calculated
# get result from shared memory area, and clean up
with lock:
result = results[ID]
del results[ID]
return Response(str(result), 200)
if __name__ == "__main__":
num_consumers = multiprocessing.cpu_count() * 2
consumers = [
multiprocessing.Process(
target=mpWorker,
args=(joinableQueue, lock, results))
for i in range(num_consumers)
]
for c in consumers:
c.start()
host = '127.0.0.1'
port = 8080
app.run(host=host, port=port, threaded=True)

How to write python code that will work with threads or coroutines and will complete in deterministic time?

What I mean by "deterministic time"? For example AWS offer a service "AWS Lambda". The process started as lambda function has time limit, after that lambda function will stop execution and will assume that task was finished with error. And example task - send data to http endpoint. Depending of a network connection to http endpoint, or other factors, process of sending data can take a long time. If I need to send the same data to the many endpoints, then full process time will take one process time times endpoints amount. Which increase a chance that lambda function will be stopped before all data will be send to all endpoints.
To solve this I need to send data to different endpoints in parallel mode using threads.
The problem with threads - started thread can't be stopped. If http request will take more time than it dedicated by lambda function time limit, lambda function will be aborted and return error. So I need to use timeout with http request, to abort it, if it take more time than expected.
If http request will be canceled by timeout or endpoint will return error, I need to save not processed data somewhere to not lost the data. The time needed to save unprocessed data can be predicted, because I control the storage where data will be saved.
And the last part that consume time - procedure or loop where threads are scheduled executor.submit(). If there is only one endpoint or small number of them then the consumed time will be small. And there is no necessary to control this. But if I have deal with many endpoints, I have to take this into account.
So basically full time will consists of:
scheduling threads
http request execution
saving unprocessed data
There is example of how I can manage time using threads
import concurrent.futures
from functools import partial
import requests
import time
start = time.time()
def send_data(data):
host = 'http://127.0.0.1:5000/endpoint'
try:
result = requests.post(host, json=data, timeout=(0.1, 0.5))
# print('done')
if result.status_code == 200:
return {'status': 'ok'}
if result.status_code != 200:
return {'status': 'error', 'msg': result.text}
except requests.exceptions.Timeout as err:
return {'status': 'error', 'msg': 'timeout'}
def get_data(n):
return {"wait": n}
def done_cb(a, b, future):
pass # save unprocessed data
def main():
executor = concurrent.futures.ThreadPoolExecutor()
futures = []
max_time = 0.5
for i in range(1):
future = executor.submit(send_data, *[{"wait": 10}])
future.add_done_callback(partial(done_cb, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new threads')
# save unprocessed data
break
try:
for item in concurrent.futures.as_completed(futures, timeout=1):
item.result()
except concurrent.futures.TimeoutError as err:
pass
I was thinking of how I can use asyncio library instead of threads, to do the same thing.
import asyncio
import time
from functools import partial
import requests
start = time.time()
def send_data(data):
...
def get_data(n):
return {"wait": n}
def done_callback(a,b, future):
pass # save unprocessed data
def main(loop):
max_time = 0.5
futures = []
start_appending = time.time()
for i in range(1):
event_data = get_data(1)
future = (loop.run_in_executor(None, send_data, event_data))
future.add_done_callback(partial(done_callback, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new futures')
# save unprocessed data
break
finished, unfinished = loop.run_until_complete(
asyncio.wait(futures, timeout=1)
)
_loop = asyncio.get_event_loop()
result = main(_loop)
Function send_data() the same as in previous code snipped.
Because request library is not async code I use run_in_executor() to create future object. The main problems I have is that done_callback() is not executed when the thread that started but executor done it's job. But only when the futures will be "processed" by asyncio.wait() expression.
Basically I seeking the way to start execute asyncio future, like ThreadPoolExecutor start execute threads, and not wait for asyncio.wait() expression to call done_callback(). If you have other ideas how to write python code that will work with threads or coroutines and will complete in deterministic time. Please share it, I will be glad to read them.
And other question. If thread or future done its job, it can return result, that I can use in done_callback(), for example to remove message from queue by id returned in result. But if thread or future was canceled, I don't have result. And I have to use functools.partial() pass in done_callback additional data, that can help me to understand for what data this callback was called. If passed data are small this is not a problem. If data will be big, I need to put data in array/list/dictionary and pass in callback only index of array or put "full data: in callback.
Can I somehow get access to variable that was passed to future/thread, from done_callback(), that was triggered on canceled future/thread?
You can use asyncio.wait_for to wait for a future (or multiple futures, when combined with asyncio.gather) and cancel them in case of a timeout. Unlike threads, asyncio supports cancellation, so you can cancel a task whenever you feel like it, and it will be cancelled at the first blocking call it makes (typically a network call).
Note that for this to work, you should be using asyncio-native libraries such as aiohttp for HTTP. Trying to combine requests with asyncio using run_in_executor will appear to work for simple tasks, but it will not bring you the benefits of using asyncio, such as being able to spawn a massive number of tasks without encumbering the OS, or the possibility of cancellation.

Collect incremental results from Tornado's ProcessPoolExecutor

I have a tornado application which needs to run a blocking function on ProcessPoolExecutor. This blocking function employs a library which emits incremental results via blinker events. I'd like to collect these events and send them back to my tornado app as they occur.
At first, tornado seemed ideal for this use case because its asynchronous. I thought I could simply pass a tornado.queues.Queue object to the function to be run on the pool and then put() events onto this queue as part of my blinker event callback.
However, reading the docs of tornado.queues.Queue, I learned they are not managed across processes like multiprocessing.Queue and are not thread safe.
Is there a way to retrieve these events from the pool as they occur? Should I wrap multiprocessing.Queue so it produces Futures? That seems unlikely to work as I doubt the internals of multiprocessing are compatible with tornado.
[EDIT]
There are some good clues here: https://gist.github.com/hoffrocket/8050711
To collect anything but the return value of a task passed to a ProcessPoolExecutor, you must use a multiprocessing.Queue (or other object from the multiprocessing library). Then, since multiprocessing.Queue only exposes a synchronous interface, you must use another thread in the parent process to read from the queue (without reaching into implementation details. There's a file descriptor that could be used here, but we'll ignore that for now since it's undocumented and subject to change).
Here's a quick untested example:
queue = multiprocessing.Queue()
proc_pool = concurrent.futures.ProcessPoolExecutor()
thread_pool = concurrent.futures.ThreadPoolExecutor()
async def read_events():
while True:
event = await thread_pool.submit(queue.get)
print(event)
async def foo():
IOLoop.current.spawn_callback(read_events)
await proc_pool.submit(do_something_and_write_to_queue)
You can do it more simply than that. Here's a coroutine that submits four slow function calls to subprocesses and awaits them:
from concurrent.futures import ProcessPoolExecutor
from time import sleep
from tornado import gen, ioloop
pool = ProcessPoolExecutor()
def calculate_slowly(x):
sleep(x)
return x
async def parallel_tasks():
# Create futures in a randomized order.
futures = [gen.convert_yielded(pool.submit(calculate_slowly, i))
for i in [1, 3, 2, 4]]
wait_iterator = gen.WaitIterator(*futures)
while not wait_iterator.done():
try:
result = await wait_iterator.next()
except Exception as e:
print("Error {} from {}".format(e, wait_iterator.current_future))
else:
print("Result {} received from future number {}".format(
result, wait_iterator.current_index))
ioloop.IOLoop.current().run_sync(parallel_tasks)
It outputs:
Result 1 received from future number 0
Result 2 received from future number 2
Result 3 received from future number 1
Result 4 received from future number 3
You can see that the coroutine receives results in the order they complete, not the order they were submitted: future number 1 resolves after future number 2, because future number 1 slept longer. convert_yielded transforms the Futures returned by ProcessPoolExecutor into Tornado-compatible Futures that can be awaited in a coroutine.
Each future resolves to the value returned by calculate_slowly: in this case it's the same number that was passed into calculate_slowly, and the same number of seconds as calculate_slowly sleeps.
To include this in a RequestHandler, try something like this:
class MainHandler(web.RequestHandler):
async def get(self):
self.write("Starting....\n")
self.flush()
futures = [gen.convert_yielded(pool.submit(calculate_slowly, i))
for i in [1, 3, 2, 4]]
wait_iterator = gen.WaitIterator(*futures)
while not wait_iterator.done():
result = await wait_iterator.next()
self.write("Result {} received from future number {}\n".format(
result, wait_iterator.current_index))
self.flush()
if __name__ == "__main__":
application = web.Application([
(r"/", MainHandler),
])
application.listen(8888)
ioloop.IOLoop.instance().start()
You can observe if you curl localhost:8888 that the server responds incrementally to the client request.

Categories