I'm trying to delete a lot of files in s3. I am planning on using a multiprocessing.Pool for doing all these deletes, but I'm not sure how to keep the s3.client alive between jobs. I'm wanting to do something like
import boto3
import multiprocessing as mp
def work(key):
s3_client = boto3.client('s3')
s3_client.delete_object(Bucket='bucket', Key=key)
with mp.Pool() as pool:
pool.map(work, lazy_iterator_of_billion_keys)
But the problem with this is that a significant amount of time is spent doing the s3_client = boto3.client('s3') at the start of each job. The documentation says to make a new resource instance for each process so I need a way to make a s3 client for each process.
Is there any way to make a persistent s3 client for each process in the pool or cache the clients?
Also, I am planning on optimizing the deletes by sending batches of keys and using s3_client.delete_objects, but showed s3_client.delete_object in my example for simplicity.
Check this snippet from the RealPython concurrency tutorial. They create a single request Session for each process since you cannot share resources because each pool has its own memory space. Instead, they create a global session object to initialize the multiprocessing pool, otherwise, each time the function is called it would instantiate a Session object which is an expensive operation.
So, following that logic, you could instantiate the boto3 client that way and you would only create one client per process.
import requests
import multiprocessing
import time
session = None
def set_global_session():
global session
if not session:
session = requests.Session()
def download_site(url):
with session.get(url) as response:
name = multiprocessing.current_process().name
print(f"{name}:Read {len(response.content)} from {url}")
def download_all_sites(sites):
with multiprocessing.Pool(initializer=set_global_session) as pool:
pool.map(download_site, sites)
if __name__ == "__main__":
sites = [
"https://www.jython.org",
"http://olympus.realpython.org/dice",
] * 80
start_time = time.time()
download_all_sites(sites)
duration = time.time() - start_time
print(f"Downloaded {len(sites)} in {duration} seconds")
I ended up solving this using functools.lru_cache and a helper function for getting the s3 client. An LRU cache will stay consistent in a process, so it will preserve the connection. The helper function looks like
from functools import lru_cache
#lru_cache()
def s3_client():
return boto3.client('s3')
and then that is called in my work function like
def work(key):
s3_client = s3_client()
s3_client.delete_object(Bucket='bucket', Key=key)
I was able to test this and benchmark it in the following way:
import os
from time import time
def benchmark(key):
t1 = time()
s3 = get_s3()
print(f'[{os.getpid()}] [{s3.head_object(Bucket='bucket', Key=key)}] :: Total time: {time() - t1} s')
with mp.Pool() as p:
p.map(benchmark, big_list_of_keys)
And this result showed that the first function call for each pid would take about 0.5 seconds and then subsequent calls for the same pid would take about 2e-6 seconds. This was proof enough to me that the client connection was being cached and working as I expected.
Interestingly, if I don't have #lru_cache() on s3_client() then subsequent calls would take about 0.005 seconds, so there must be some internal caching that happens automatically with boto3 that I wasn't aware of.
And for testing purposes, I benchmarked Milton's answer in the following way
s3 = None
def set_global_session():
global s3
if not s3:
s3 = boto3.client('s3')
with mp.Pool(initializer=set_global_session) as p:
p.map(benchmark, big_list_of_keys)
And this also had averaging 3e-6 seconds per job, so pretty much the same as using functools.lru_cache on a helper function.
Related
I have a Python function that requests data via API and involves a rotating expiring key. The volume of requests necessitates some parallelization of the function. I am doing this with the multiprocessing.pool module ThreadPool. Example code:
import requests
from multiprocessing.pool import ThreadPool
from tqdm import tqdm
# Input is a list-of-dicts results of a previous process.
results = [...]
# Process starts by retrieving an authorization key.
headers = {"authorization": get_new_authorization()}
# api_call() is called on each existing result with the retrieved key.
results = thread(api_call, [(headers, result) for result in results])
# Function calls API with passed headers for given URL and returns dict.
def api_call(headers_plus_result):
headers, result = headers_plus_result
r = requests.get(result["url"]), headers=headers)
return json.loads(r.text)
# Threading function with default num_threads.
def thread(worker, jobs, num_threads=5):
pool = ThreadPool(num_threads)
results = list()
for result in tqdm(pool.imap_unordered(worker, jobs), total=len(jobs)):
if result:
results.append(result)
pool.close()
pool.join()
if results:
return results
# Function to get new authorization key.
def get_new_authorization():
...
return auth_key
I am trying to modify my mapping process so that, when the first worker fails (i.e. the authorization key expires), all other processes are paused until a new authorization key is retrieved. Then, the processes proceed with the new key.
Should this be inserted into the actual thread() function? If I put an exception in the api_call function itself, I don't see how I can stop the pool manager or update the header being passed to other workers.
Additionally: is using ThreadPool even the best method if I want this kind of flexibility?
A simpler possibility might be to use a multiprocessing.Event and a shared variable. The Event would indicate whether the authentication was legit or not, and the shared variable would contain the authentication.
event = mp.Event()
sharedAuthentication = mp.Array('u', 100) # 100 = max length
So a worker would run:
event.wait();
authentication = sharedAuthentication.value
Your main thread would initially set the authentication with
sharedAuthentication.value = ....
event.set()
and later modify the authentication with
event.clear()
... calculate new authentication
sharedAuthentication.value = .....
event.set()
I have a large table (external to BigQuery as the data is in Google Cloud Storage). I want to scan the table using BigQuery to a client machine. For throughput, I fetch multiple streams concurrently in multiple threads.
From all I can tell, concurrency is not working. There's actually some penalty when using multiple threads.
import concurrent.futures
import logging
import queue
import threading
import time
from google.cloud.bigquery_storage import types
from google.cloud import bigquery_storage
PROJECT_ID = 'abc'
CREDENTIALS = {....}
def main():
table = "projects/{}/datasets/{}/tables/{}".format(PROJECT_ID, 'db', 'tb')
requested_session = types.ReadSession()
requested_session.table = table
requested_session.data_format = types.DataFormat.AVRO
requested_session.read_options.selected_fields = ["a", "b"]
requested_session.read_options
client = bigquery_storage.BigQueryReadClient(credentials=CREDENTIALS)
session = client.create_read_session(
parent="projects/{}".format(PROJECT_ID),
read_session=requested_session,
max_stream_count=0,
)
if not session.streams:
return
n_streams = len(session.streams)
print("Total streams", n_streams) # this prints 1000
q_out = queue.Queue(1024)
concurrency = 4
with concurrent.futures.ThreadPoolExecutor(concurrency) as pool:
tasks = [
pool.submit(download_row,
client._transport.__class__,
client._transport._grpc_channel,
s.name,
q_out)
for s in session.streams
]
t0 = time.perf_counter()
ntotal = 0
ndone = 0
while True:
page = q_out.get()
if page is None:
ndone += 1
if ndone == len(tasks):
break
else:
for row in page:
ntotal += 1
if ntotal % 10000 == 0:
qps = int(ntotal / (time.perf_counter() - t0))
print(f'QPS so far: {qps}')
for t in tasks:
t.result()
def download_row(transport_cls, channel, stream_name, q_out):
try:
transport = transport_cls(channel=channel)
client = bigquery_storage.BigQueryReadClient(
transport=transport,
)
reader = client.read_rows(stream_name)
for page in reader.rows().pages:
q_out.put(page)
finally:
q_out.put(None)
if __name__ == '__main__':
main()
Google BigQuery Storage API doc and multiple source claim one can fetch multiple "streams" concurrently for higher throughput, yet I didn't find any functional example. I've followed the advice to share a GRPC "channel" across the threads.
The data items are large. The QPS I got is roughly
150, concurrency=1
120, concurrency=2
140, concurrency=4
Each "page" contains about 200 rows.
Thoughts:
BigQuery quota? I only saw request rate limit, and did not see limit on volume of
data traffic per second. The quotas do not appear to be limiting for my case.
BigQuery server side options? Doesn't seem to be relevant. BigQuery should accept
concurrent requests with enough capability.
GPRC usage? I think this is the main direction for digging. But I don't know what's
wrong in my code.
Can anyone shed some light on this? Thanks.
Python threads do not run in parallel because of the GIL.
You are creating threads, and not multiprocesses. And by definition Python is single core because of GIL.
ThreadPoolExecutor has been available since Python 3.2, it is not
widely used, perhaps because of misunderstandings of the capabilities
and limitations of Threads in Python. This is enforced by the Global
Interpreter Lock ("GIL"). More
Look into using multiprocessing module, a good read is here.
UPDATE:
Also in your code you need one more param: requested_streams
n_streams = 2
session = client.create_read_session(
table_ref,
parent,
requested_streams=n_streams,
format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
sharding_strategy=(bigquery_storage_v1beta1.enums.ShardingStrategy.BALANCED),
)
I made an API for my AI model but I would like to not have any down time when I update the model. I search a way to load in background and once it's loaded I switch the old model with the new. I tried passing values between sub process but doesn't work well. Do you have any idea how can I do that ?
You can place the serialized model in a raw storage, like an S3 bucket if you're on AWS. In S3's case, you can use bucket versioning which might prove helpful. Then setup some sort of trigger. You can definitely get creative here, and I've thought about this a lot. In practice, the best options I've tried are:
Set up an endpoint that when called will go open the new model at whatever location you store it at. Set up a webhook on the storage/S3 bucket that will send a quick automated call to the given endpoint and auto-load that new item
Same thing as #1, but instead you just manually load it. In both cases you'll really want some security on that endpoint or anyone that finds your site can just absolutely abuse your stack.
Set a timer at startup that calls a given function nightly, internally running within the application itself. The function is invoked and then goes and reloads.
Could be other ideas I'm not smart enough (yet!) to use, just trying to start some dialogue.
Found a way to do it with async and multiprocessing
import asyncio
import random
from uvicorn import Server, Config
from fastapi import FastAPI
import time
from multiprocessing import Process, Manager
app = FastAPI()
value = {"latest": 1, "b": 2}
#app.get("/")
async def root():
global value
return {"message": value}
def background_loading(d):
time.sleep(2)
d["test"] = 3
async def update():
while True:
global value
manager = Manager()
d = manager.dict()
p1 = Process(target=background_loading, args=(d,))
p1.daemon = True
p1.start()
while p1.is_alive():
await asyncio.sleep(5)
print(f'Update to value to {d}')
value = d
if __name__ == "__main__":
loop = asyncio.new_event_loop()
config = Config(app=app, loop=loop)
server = Server(config)
loop.create_task(update())
loop.run_until_complete(server.serve())
I have launched many simulations with Dask Distributed:
from time import sleep
from distributed import Client, as_completed
def simulation(x):
""" Proxy function for simulation """
sleep(60 * 60 * 24) # wait one day
return hash(x)
def save(result):
with open("result", "w") as f:
print(result, file=f)
if __name__ == "__main__":
client = Client("localhost:8786")
futures = client.map(simulation, range(1000))
for future in as_completed(future):
result = future.result()
save(result)
However, this code has a bug: open("result", "w") should be open(str(result), "w"). I'd like to correct that mistake, the re-process of the clients futures.
However, I do not know of a way to do that without stopping the Python process with a keyboard interrupt than re-submitting the jobs to the Dask cluster. I don't want to do that because these simulations have taken a couple days.
I want to access all the futures the client has and save all the existing results. How do I make that happen?
Possibly relevant questions
"Dask Distributed Getting Futures after Client Closed" isn't relevant because the client connection is still open:
client.has_what is the method you're looking for:
from distributed import Client, Future
if __name__ == "__main__":
client = Client("localhost:8786")
futures = [Future(key) for keys in client.has_what().values() for key in keys]
for future in as_completed(futures):
...
What I mean by "deterministic time"? For example AWS offer a service "AWS Lambda". The process started as lambda function has time limit, after that lambda function will stop execution and will assume that task was finished with error. And example task - send data to http endpoint. Depending of a network connection to http endpoint, or other factors, process of sending data can take a long time. If I need to send the same data to the many endpoints, then full process time will take one process time times endpoints amount. Which increase a chance that lambda function will be stopped before all data will be send to all endpoints.
To solve this I need to send data to different endpoints in parallel mode using threads.
The problem with threads - started thread can't be stopped. If http request will take more time than it dedicated by lambda function time limit, lambda function will be aborted and return error. So I need to use timeout with http request, to abort it, if it take more time than expected.
If http request will be canceled by timeout or endpoint will return error, I need to save not processed data somewhere to not lost the data. The time needed to save unprocessed data can be predicted, because I control the storage where data will be saved.
And the last part that consume time - procedure or loop where threads are scheduled executor.submit(). If there is only one endpoint or small number of them then the consumed time will be small. And there is no necessary to control this. But if I have deal with many endpoints, I have to take this into account.
So basically full time will consists of:
scheduling threads
http request execution
saving unprocessed data
There is example of how I can manage time using threads
import concurrent.futures
from functools import partial
import requests
import time
start = time.time()
def send_data(data):
host = 'http://127.0.0.1:5000/endpoint'
try:
result = requests.post(host, json=data, timeout=(0.1, 0.5))
# print('done')
if result.status_code == 200:
return {'status': 'ok'}
if result.status_code != 200:
return {'status': 'error', 'msg': result.text}
except requests.exceptions.Timeout as err:
return {'status': 'error', 'msg': 'timeout'}
def get_data(n):
return {"wait": n}
def done_cb(a, b, future):
pass # save unprocessed data
def main():
executor = concurrent.futures.ThreadPoolExecutor()
futures = []
max_time = 0.5
for i in range(1):
future = executor.submit(send_data, *[{"wait": 10}])
future.add_done_callback(partial(done_cb, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new threads')
# save unprocessed data
break
try:
for item in concurrent.futures.as_completed(futures, timeout=1):
item.result()
except concurrent.futures.TimeoutError as err:
pass
I was thinking of how I can use asyncio library instead of threads, to do the same thing.
import asyncio
import time
from functools import partial
import requests
start = time.time()
def send_data(data):
...
def get_data(n):
return {"wait": n}
def done_callback(a,b, future):
pass # save unprocessed data
def main(loop):
max_time = 0.5
futures = []
start_appending = time.time()
for i in range(1):
event_data = get_data(1)
future = (loop.run_in_executor(None, send_data, event_data))
future.add_done_callback(partial(done_callback, 2, 3))
futures.append(future)
if time.time() - s_time > max_time:
print('stopping creating new futures')
# save unprocessed data
break
finished, unfinished = loop.run_until_complete(
asyncio.wait(futures, timeout=1)
)
_loop = asyncio.get_event_loop()
result = main(_loop)
Function send_data() the same as in previous code snipped.
Because request library is not async code I use run_in_executor() to create future object. The main problems I have is that done_callback() is not executed when the thread that started but executor done it's job. But only when the futures will be "processed" by asyncio.wait() expression.
Basically I seeking the way to start execute asyncio future, like ThreadPoolExecutor start execute threads, and not wait for asyncio.wait() expression to call done_callback(). If you have other ideas how to write python code that will work with threads or coroutines and will complete in deterministic time. Please share it, I will be glad to read them.
And other question. If thread or future done its job, it can return result, that I can use in done_callback(), for example to remove message from queue by id returned in result. But if thread or future was canceled, I don't have result. And I have to use functools.partial() pass in done_callback additional data, that can help me to understand for what data this callback was called. If passed data are small this is not a problem. If data will be big, I need to put data in array/list/dictionary and pass in callback only index of array or put "full data: in callback.
Can I somehow get access to variable that was passed to future/thread, from done_callback(), that was triggered on canceled future/thread?
You can use asyncio.wait_for to wait for a future (or multiple futures, when combined with asyncio.gather) and cancel them in case of a timeout. Unlike threads, asyncio supports cancellation, so you can cancel a task whenever you feel like it, and it will be cancelled at the first blocking call it makes (typically a network call).
Note that for this to work, you should be using asyncio-native libraries such as aiohttp for HTTP. Trying to combine requests with asyncio using run_in_executor will appear to work for simple tasks, but it will not bring you the benefits of using asyncio, such as being able to spawn a massive number of tasks without encumbering the OS, or the possibility of cancellation.