Async upload with Python Swift client from FastAPI - python

I am trying to upload files asynchronously to an OpenStack object storage using the Python Swift client. It seems the put_object() method from the Swift client is not async, so although I am running it in an async loop, the loop seems to execute sequentially and it is particularly slow. Note this code is running in a FastAPI.
Here is the code:
import asyncio
import time
from io import BytesIO
from PIL import Image
from fastapi import FastAPI
# Skipping the Cloud Storage authentication for conciseness
# [...]
app = FastAPI()
async def async_upload_image(idx, img):
print("Start upload image:{}".format(idx))
# Convert image to bytes
data = BytesIO()
img.save(data, format='png')
data = data.getvalue()
# Upload image on cloud storage
swift_connection.put_object('dev', image['path'], data)
print("Finish upload image:{}".format(idx))
return True
#app.get("/test-async")
async def test_async():
image = Image.open('./sketch-mountains-input.jpg')
images = [image, image, image, image ]
start = time.time()
futures = [async_upload_image(idx, img) for idx, img in enumerate(images)]
await asyncio.gather(*futures)
end = time.time()
print('It took {} seconds to finish execution'.format(round(end-start)))
return True
The logs return something like this which seems particularly slow:
Start upload image:0
Finish upload image:0
Start upload image:1
Finish upload image:1
Start upload image:2
Finish upload image:2
Start upload image:3
Finish upload image:3
It took 10 seconds to finish execution
I even tried to run that same code synchronously and it was faster:
Start upload image:0
Finish upload image:0
Start upload image:1
Finish upload image:1
Start upload image:2
Finish upload image:2
Start upload image:3
Finish upload image:3
It took 6 seconds to finish execution
Is there a solution to do an async put_object() / upload to OpenStack?

Related

asyncio.gather() and aiohttp not behaving as expected

I am pretty new to asyncio and I am trying to take its learning seriously so don't bash me too much.
I have written this code where I am trying to use asyncio and aiohttp to fetch input files from an external API while another coroutine fetches a file from an AWS s3 bucket and processes it.
Stripped down, my code looks like this:
async def fetchFromAPI():
...
print("Fetching from my API...")
try:
chunk_counter = 1
async with aiohttp.ClientSession(timeout=timeoutObject) as session:
async with session.get(url, ) as r:
r.raise_for_status()
with open(rawResponseFilepath, "w+") as f:
async for chunk in r.content.iter_chunked(8192):
chunk = chunk.decode('ASCII')
f.write(chunk)
print(f"Chunk #{chunk_counter} download completed.")
chunk_counter +=1
print(f"Streaming complete. Datarecord file available at {rawResponseFilepath}")
except Exception as e:
print(f"Could not query API. Error: {tb.format_exc()}.")
raise
...
async def fetchFromS3AndProcess():
...
print("Fetching from s3...")
s3_client.download_file(filename, localfile)
print("Done fetching from S3.")
processFile(filname)
print("Done processing.")
async def main():
await asyncio.gather(fetchFromAPI(), fetchFromS3AndProcess())
if __name__ == '__main__':
asyncio.run(main())
What I am observing from the print statements (exemplified):
Fetching from my API...
Fetching from s3...
Done fetching from S3.
Done processing.
Chunk #1 download completed.
Chunk #2 download completed.
.
.
.
Chunk #10000 download completed.
Streaming complete. Datarecord file available at dataRecord.dr.
What I am expecting instead is to see the download counter message be printed in between the S3 Fetching and processing print statements, signifying that the two downloads are actually starting at the same time.
Am I missing something obvious? Should I provide more info on the underlying coroutines to see if I am blocking somewhere?
If there is too little context to exemplify well I shall expand. Thanks!

How to send `files_upload_session_append_v2` in parallel? (Dropbox API, Python)

I want to upload a large file to Dropbox via Dropbox API in a parallel way (so it will be uploaded faster than in a sequential way).
The documentation says
By default, upload sessions require you to send content of the file in
sequential order via consecutive :meth:`files_upload_session_start`,
:meth:`files_upload_session_append_v2`,
:meth:`files_upload_session_finish` calls. For better performance, you
can instead optionally use a ``UploadSessionType.concurrent`` upload
session. To start a new concurrent session, set
``UploadSessionStartArg.session_type`` to
``UploadSessionType.concurrent``. After that, you can send file data in
concurrent :meth:`files_upload_session_append_v2` requests. Finally
finish the session with :meth:`files_upload_session_finish`. There are
couple of constraints with concurrent sessions to make them work. You
can not send data with :meth:`files_upload_session_start` or
:meth:`files_upload_session_finish` call, only with
:meth:`files_upload_session_append_v2` call. Also data uploaded in
:meth:`files_upload_session_append_v2` call must be multiple of 4194304
bytes (except for last :meth:`files_upload_session_append_v2` with
``UploadSessionStartArg.close`` to ``True``, that may contain any
remaining data).
but I'm not sure how to implement this (since there is no files_upload_async_session_append_v2() for example). I can't find any examples in the Internet.
I tried the following code, but there is no upload speedup compare to sequential code
async def upload_file(local_file_path: str, remote_folder_path: str, client: DropboxBase):
"""
Uploads a file to Dropbox by chunks. This method uses v2 methods of Dropbox API.
Example:
upload_file('test.txt', '/Builds/', dropbox_client)
:param local_file_path:
:param remote_folder_path: A path to a folder on Dropbox, must end with a slash.
:param client: Authorized Dropbox client.
:return:
"""
with open(local_file_path, 'rb') as file_stream:
await __upload_file_by_chunks(file_stream, local_file_path, remote_folder_path, client)
async def test(data: bytes, cursor: UploadSessionCursor, client: DropboxBase, close: bool = False):
client.files_upload_session_append_v2(data, cursor, close=close)
async def __upload_file_by_chunks(file_stream: BinaryIO, local_file_path: str, remote_folder_path: str, client: DropboxBase):
# As default size for a chunk 4 MB were chosen. I think it's a good compromise between speed and reliability.
# Also, Dropbox API guide (https://developers.dropbox.com/dbx-performance-guide) says "Consider uploading chunks in multiples of 4 MBs."
# ATTENTION: The maximum value can be placed here is 150 MB.
chunk_size_bytes = 4 * 1024 * 1024
session_id = __start_upload_session(client)
cursor = __create_upload_session_cursor(file_stream, session_id)
file_length = path.getsize(local_file_path)
test_pool = set()
# TODO: In theory this can be done in parallel, that should speed up the file upload.
# Maybe instead of while loop we can precalculate all chunks and then upload them in parallel.
while file_stream.tell() < file_length:
if __chunk_size_is_bigger_than_left_data(file_stream.tell(), file_length, chunk_size_bytes):
chunk_size_bytes = file_length - file_stream.tell()
test_pool.add(asyncio.create_task(test(file_stream.read(chunk_size_bytes), cursor, client, close=True)))
continue
test_pool.add(asyncio.create_task(test(file_stream.read(chunk_size_bytes), cursor, client)))
cursor = __create_upload_session_cursor(file_stream, session_id)
await asyncio.wait(test_pool)
client.files_upload_session_finish(bytes(), cursor, commit=CommitInfo(
path=__construct_remote_file_path(local_file_path, remote_folder_path)))
def __start_upload_session(client: DropboxBase) -> int:
session_start_response = client.files_upload_session_start(bytes(), session_type=UploadSessionType.concurrent)
return session_start_response.session_id
...
asyncio.run(upload_file(file_name, DROPBOX_TEST_FOLDER, client))
Using the "concurrent" mode is a good way to optimize the upload of a file to Dropbox using upload sessions, as it allows you to upload multiple pieces of the file in parallel.
There's a code sample here that shows how to use the "concurrent" mode, including using files_upload_session_append_v2:
https://github.com/dropbox/Developer-Samples/tree/master/Blog/performant_upload

Keeping boto3 session alive in multiprocessing pool

I'm trying to delete a lot of files in s3. I am planning on using a multiprocessing.Pool for doing all these deletes, but I'm not sure how to keep the s3.client alive between jobs. I'm wanting to do something like
import boto3
import multiprocessing as mp
def work(key):
s3_client = boto3.client('s3')
s3_client.delete_object(Bucket='bucket', Key=key)
with mp.Pool() as pool:
pool.map(work, lazy_iterator_of_billion_keys)
But the problem with this is that a significant amount of time is spent doing the s3_client = boto3.client('s3') at the start of each job. The documentation says to make a new resource instance for each process so I need a way to make a s3 client for each process.
Is there any way to make a persistent s3 client for each process in the pool or cache the clients?
Also, I am planning on optimizing the deletes by sending batches of keys and using s3_client.delete_objects, but showed s3_client.delete_object in my example for simplicity.
Check this snippet from the RealPython concurrency tutorial. They create a single request Session for each process since you cannot share resources because each pool has its own memory space. Instead, they create a global session object to initialize the multiprocessing pool, otherwise, each time the function is called it would instantiate a Session object which is an expensive operation.
So, following that logic, you could instantiate the boto3 client that way and you would only create one client per process.
import requests
import multiprocessing
import time
session = None
def set_global_session():
global session
if not session:
session = requests.Session()
def download_site(url):
with session.get(url) as response:
name = multiprocessing.current_process().name
print(f"{name}:Read {len(response.content)} from {url}")
def download_all_sites(sites):
with multiprocessing.Pool(initializer=set_global_session) as pool:
pool.map(download_site, sites)
if __name__ == "__main__":
sites = [
"https://www.jython.org",
"http://olympus.realpython.org/dice",
] * 80
start_time = time.time()
download_all_sites(sites)
duration = time.time() - start_time
print(f"Downloaded {len(sites)} in {duration} seconds")
I ended up solving this using functools.lru_cache and a helper function for getting the s3 client. An LRU cache will stay consistent in a process, so it will preserve the connection. The helper function looks like
from functools import lru_cache
#lru_cache()
def s3_client():
return boto3.client('s3')
and then that is called in my work function like
def work(key):
s3_client = s3_client()
s3_client.delete_object(Bucket='bucket', Key=key)
I was able to test this and benchmark it in the following way:
import os
from time import time
def benchmark(key):
t1 = time()
s3 = get_s3()
print(f'[{os.getpid()}] [{s3.head_object(Bucket='bucket', Key=key)}] :: Total time: {time() - t1} s')
with mp.Pool() as p:
p.map(benchmark, big_list_of_keys)
And this result showed that the first function call for each pid would take about 0.5 seconds and then subsequent calls for the same pid would take about 2e-6 seconds. This was proof enough to me that the client connection was being cached and working as I expected.
Interestingly, if I don't have #lru_cache() on s3_client() then subsequent calls would take about 0.005 seconds, so there must be some internal caching that happens automatically with boto3 that I wasn't aware of.
And for testing purposes, I benchmarked Milton's answer in the following way
s3 = None
def set_global_session():
global s3
if not s3:
s3 = boto3.client('s3')
with mp.Pool(initializer=set_global_session) as p:
p.map(benchmark, big_list_of_keys)
And this also had averaging 3e-6 seconds per job, so pretty much the same as using functools.lru_cache on a helper function.

How to get the result of a long-running Google Cloud Speech API operation later?

Below is a snippet that calls the Google Cloud Speech API long running operation to convert an audio file to text
from google.cloud import speech
speech_client = speech.Client()
audio_sample = speech_client.sample(
content=None,
source_uri=gcs_uri,
encoding='FLAC',
sample_rate_hertz=44100)
operation = audio_sample.long_running_recognize('en-US')
retry_count = 100
while retry_count > 0 and not operation.complete:
retry_count -= 1
time.sleep(60)
operation.poll()
However, as it is a long running operation, it could take a while and I ideally don't want to keep the session on while it waits. Is it possible to store some information and retrieve the result later ?
As mentioned in another answer, you could use a separate thread to poll the operation while the main thread continues. Alternatively, you could pass operation.name of the returned operation to a separate service and have that other service handle polling. In practice the service calling the long running operation could publish operation.name to a Pub/Sub topic, for example.
Below is a possible way of retrieving a long running operation by looking it up by name:
from oauth2client.client import GoogleCredentials
from googleapiclient import discovery
credentials = GoogleCredentials.get_application_default()
speech_service = discovery.build('speech', 'v1', credentials=credentials)
operation_name = .... # operation.name
get_operation_request = speech_service.operations().get(name=operation_name)
# response is a dictionary
response = get_operation_response.execute()
# handle polling
retry_count = 100
while retry_count > 0 and not response.get('done', False):
retry_count -= 1
time.sleep(60)
response = get_operation_response.execute()
When the operation is finished, the response dict might look something like the following:
{u'done': True,
u'metadata': {u'#type': u'type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata',
u'lastUpdateTime': u'2017-06-21T19:38:14.522280Z',
u'progressPercent': 100,
u'startTime': u'2017-06-21T19:38:13.599009Z'},
u'name': u'...................',
u'response': {u'#type': u'type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse',
u'results': [{u'alternatives': [{u'confidence': 0.987629,
u'transcript': u'how old is the Brooklyn Bridge'}]}]}}
After reading the source, I found that GRPC has a 10 minute timeout. If you submit a large file, transcription can take over 10 minutes. The trick is to use the HTTP backend. The HTTP backend doesn't maintain a connection like GRPC, instead everytime you poll it sends a HTTP request. To use HTTP, do
speech_client = speech.Client(_use_grpc=False)
No, there is not a way to do that. What you could do is use the threading module so it can run in the background as you have your next task run.

Run function on Flask server every x seconds to update Redis cache without clients making separate calls

I currently have a flask app that makes a call to S3 as well as an external API with the following structure before rendering the data in javascript:
from flask import Flask, render_template,make_response
from flask import request
import requests
import requests_cache
import redis
from boto3.session import Session
import json
app = Flask(__name__)
#app.route('/test')
def test1():
bucket_root = 'testbucket'
session = Session(
aws_access_key_id='s3_key',
aws_secret_access_key='s3_secret_key')
s3 = session.resource('s3')
bucket = s3.Bucket(bucket_root)
testvalues = json.dumps(s3.Object(bucket_root,'all1.json').get()['Body'].read())
r = requests.get(api_link)
return render_template('test_html.html',json_s3_test_response=r.content,
limit=limit, testvalues=testvalues)
#app.route('/test2')
def test2():
bucket_root = 'testbucket'
session = Session(
aws_access_key_id='s3_key',
aws_secret_access_key='s3_secret_key')
s3 = session.resource('s3')
bucket = s3.Bucket(bucket_root)
testvalues = json.dumps(s3.Object(bucket_root,'all2.json').get()['Body'].read())
r = requests.get(api_link)
return render_template('test_html.html',json_s3_test_response=r.content,
limit=limit, testvalues=testvalues)
#app.errorhandler(500)
def internal_error(error):
return "500 error"
#app.errorhandler(404)
def not_found(error):
return "404 error",404
#app.errorhandler(400)
def custom400(error):
return "400 error",400
//catch all?
#app.errorhandler(Exception)
def all_exception_handler(error):
return 'error', 500
Obviously I have a lot of inefficiencies here, but my main question is:
To me it seems like I'm calling S3 and the external API for each client, every time they refresh the page. This increases the chance for the app to crash due to timeouts (and my poor error handling) and diminishes performance. I would like to resolve this by periodically caching the S3 results (say every 10 mins) into a local redis server (already set up and running) as well as just pinging the external API just once from the server every few seconds before passing it onto ALL clients.
I have code that can store the data into redis every 10 mins in a regular python script, however, I'm not sure where to place this within the flask server? Do I put it as it's own function or keep the call to redis in the #app.route()?
Thank you everyone for your time and effort. Any help would be appreciated! I'm new to flask so some of this has been confusing.

Categories