Best practice: send multiple files to rest endpoint using python requests - python

What is the best way to send a lot of POST requests to a REST endpoint via Python?
E.g. I want to upload ~500k files to a database.
What I've done so far is a loop that creates for each file a new request using the requests package.
# get list of files
files = [f for f in listdir(folder_name)]
# loop through the list
for file_name in files:
try:
# open file and get content
with open(folder_name + "\\" + file_name, "r") as file:
f = file.read()
# create request
req = make_request(url, f)
# error handling, logging, ...
But as this is quite slow: what is the best practice to do that? Thank you.

First approach:
I dont know if it is the best practice you can split the files in batches of 1000 and zip it and send it as post requests using threads ( set the num threads = number of processor cores)
( The rest end point can extract the zipped contents and then process it )
second approach:
zip the files in batches and transfer it in batches
after the transfer is completed , validate in the server side
Then start the database upload at one go.

The first thing you want to do is determine exactly which part of your script is the bottleneck. You have both disk and network I/O here (reading files and sending HTTP requests, respectively).
Assuming that the HTTP requests are the actual bottleneck (highly likely), consider using aiohttp instead of requests. The docs have some good examples to get you started and there are plenty of "Quick Start" articles out there. This would allow your network requests to be cooperative, meaning that other python code can run while one of your network requests is waiting. Just to be careful to not overwhelm whatever server is receiving the requests.

Related

Uploading large number of files to S3 with boto3 [duplicate]

Hey there were some similar questions, but none exactly like this and a fair number of them were multiple years old and out of date.
I have written some code on my server that uploads jpeg photos into an s3 bucket using a key via the boto3 method upload_file. Initially this seemed great. It is a super simple solution to uploading files into s3.
The thing is, I have users. My users are sending their jpegs to my server via a phone app. While I concede that I could generate presigned upload URLs and send them to the phone app, that would require a considerable rewrite of our phone app and API.
So I just want the phone app to send the photos to the server. I then want to send the photos from the server to s3. I implemented this but it is way too slow. I cannot ask my users to tolerate those slow uploads.
What can I do to speed this up?
I did some Google searching and found this: https://medium.com/#alejandro.millan.frias/optimizing-transfer-throughput-of-small-files-to-amazon-s3-or-anywhere-really-301dca4472a5
It suggests that the solution is to increase the number of TCP/IP connections. More TCP/IP connections means faster uploads.
Okay, great!
How do I do that? How do I increase the number of TCP/IP connections so I can upload a single jpeg into AWS s3 faster?
Please help.
Ironically, we've been using boto3 for years, as well as awscli, and we like them both.
But we've often wondered why awscli's aws s3 cp --recursive, or aws s3 sync, are often so much faster than trying to do a bunch of uploads via boto3, even with concurrent.futures's ThreadPoolExecutor or ProcessPoolExecutor (and don't you even dare sharing the same s3.Bucket among your workers: it's warned against in the docs, and for good reasons; nasty crashes will eventually ensue at the most inconvenient time).
Finally, I bit the bullet and looked inside the "customization" code that awscli introduces on top of boto3.
Based on that little exploration, here is a way to speed up the upload of many files to S3 by using the concurrency already built in boto3.s3.transfer, not just for the possible multiparts of a single, large file, but for a whole bunch of files of various sizes as well. That functionality is, as far as I know, not exposed through the higher level APIs of boto3 that are described in the boto3 docs.
The following:
Uses boto3.s3.transfer to create a TransferManager, the very same one that is used by awscli's aws s3 sync, for example.
Extends the max number of threads to 20.
Augments the underlying urllib3 max pool connections capacity used by botocore to match (by default, it uses 10 connections maximum).
Gives you an optional callback capability (demoed here with a tqdm progress bar, but of course you can have whatever callback you'd like).
Is fast (over 100MB/s --tested on an ec2 instance).
I put a complete example as a gist here that includes the generation of 500 random csv files for a total of about 360MB. Here below, we assume you already have a bunch of files in filelist, for a total of totalsize bytes:
import os
import boto3
import botocore
import boto3.s3.transfer as s3transfer
def fast_upload(session, bucketname, s3dir, filelist, progress_func, workers=20):
botocore_config = botocore.config.Config(max_pool_connections=workers)
s3client = session.client('s3', config=botocore_config)
transfer_config = s3transfer.TransferConfig(
use_threads=True,
max_concurrency=workers,
)
s3t = s3transfer.create_transfer_manager(s3client, transfer_config)
for src in filelist:
dst = os.path.join(s3dir, os.path.basename(src))
s3t.upload(
src, bucketname, dst,
subscribers=[
s3transfer.ProgressCallbackInvoker(progress_func),
],
)
s3t.shutdown() # wait for all the upload tasks to finish
Example usage
from tqdm import tqdm
bucketname = '<your-bucket-name>'
s3dir = 'some/path/for/junk'
filelist = [...]
totalsize = sum([os.stat(f).st_size for f in filelist])
with tqdm(desc='upload', ncols=60,
total=totalsize, unit='B', unit_scale=1) as pbar:
fast_upload(boto3.Session(), bucketname, s3dir, filelist, pbar.update)

How do I download and upload files as effectively as possible using Python?

So here is what I want to do: I want to download JSON files from an API, then I want to upload them to a container in Azure.
Here is my current solution: I use the module requests and
file = requests.get('<link>')
to download the data, which I then create a json file with and store it on my local computer like so
local_path = os.getcwd()
local_file_name = 'file'
open(local_file_name + '.json', 'wb').write(file.content)
then I find the local file, upload it to Azure, and delete the local file. This is a process I repeat for all the files, and I feel as if it is quite ineffective. Is there a better way to do it? I am wondering if I can upload the request files directly, with a specified name for every individual file, without having to store them locally. Thoughts?
You could probably use asyncio, async and aiohttp to download files. One is for speed and the other for getting webpages like in requests. There is a fantastic article below about how to use aiohttp along with async and asyncio. I put it below.
https://www.queness.com/post/17688/aiohttp-tutorial-how-does-it-worka
Aiohttp tutorial
https://realpython.com/async-io-python/
asyncio/async tutorial

Downloading large .bz2 files with Python requests library

I am currently developing a Python script that calls a REST API to download data that is made available everyday through the API. The files that I am trying to download have a '.txt.bz2` extension.
The API documentation recommends to use curl to download data from the API. In particular, the command to download the data that is recommended is:
curl --user Username:Password https://api.endpoint.com/data/path/to/file -o my_filename.txt.bz2
Where, of course, the url of the API data endpoint here is just fictitious.
Since the documentation recommends curl, my current implementation of the Python script leverages the subprocess library to call curl within Python:
import subprocess
def data_downloader(download_url, file_name, api_username, api_password):
args = ['curl', '--user', f'{api_username}:{api_password}', f'{download_url}', '-o', f'{file_name}']
subrpocess.call(args)
return file_name
Since, however, I am extensively using the requests library in other parts of the application that I am developing, mainly to send requests to the API and walking the file system like structure of the API, I have tried to implement the download function using this library as well. In particular, I have been using this other Stackoverflow thread as the reference of my alternative implementation, and the two functions that I have implemented using the requests library look like this:
import requests
import shutil
def download_file(download_url, file_name, api_username, api_password, chunk_size):
with requests.get(download_url, auth=(api_username, api_password), stream=True) as r:
with open(file_name, 'wb') as f:
for chunk in r.iter_content(chunk_size=chunk_size):
f.write(chunk)
return file_name
def shutil_download(download_url, file_name, api_username, api_password):
with requests.get(download_url, auth=(api_username, api_password), stream=True) as r:
with open(file_name, 'wb') as f:
shutil.copyfileobj(r.raw, file_name)
return file_name
While, however, with the subprocess implementation I am able to download the entire file without any issue, when trying to perform the download using the two requests implementations, I always end up with a downloaded file with a 1Kb dimension, which clearly is wrong since most of the data that I am downloading is >10GB.
I suspect that the issue that I am experiencing is caused by the format of the data that I am attempting to download, as I have seen successful attempts of downloading .zip or .gzip files using the same logic that I am using in the two functions. Hence I am wondering if anyone may have an explanation to the issue that I am experiencing or may provide a working solution to the problem.
UPDATE
I had a chance to discuss the issue with the owner of the API and apparently, upon analysis of the logs on their side, they found out there were some issues on their side that prevented the request to go through. On my side the status code of the request was signalling a successful request, however the returned data was not the correct one.
The two functions that use the requests library work as expected and the issue can be considered solved.

Parallel file uploading via Browser using Python

The following code uploads multiple files from a browser in a serial fashion to a host. In order to make the process faster, how could the files be processed in parallel so there are multiple streams being sent to the server? Is this worth pursuing if the files are being sent to the same host? Will a browser allow multiple upload streams to the same host? If so, how might this work?
I assume that is not possible to break the file into parts and write them parallel similar to this example since they are being submitted by a browser rather then a Python client.
#!/usr/bin/python
import cgi, os
import shutil
import cgitb; cgitb.enable() # for troubleshooting
form = cgi.FieldStorage()
print """\
Content-Type: text/html\n
<html><body>
"""
if 'file' in form:
filefield = form['file']
if not isinstance(filefield, list):
filefield = [filefield]
for fileitem in filefield:
if fileitem.filename:
fn = os.path.basename(fileitem.filename)
# save file
with open('/var/www/site/files/' + fn, 'wb') as f:
shutil.copyfileobj(fileitem.file, f)
# line breaks are not occuring between interations
print 'File "' + fn + '" was uploaded successfully<br/>'
message = 'All files uploaded'
else:
message = 'No file was uploaded'
print """
<p>%s</p>
</body></html>
""" % (message)
A form submit produces a single HTTP request that naturally uses a single TCP connection to deliver the request.
To upload multiple files in parallel make multiple requests in parallel. You might need flash or java applet on the client. Check whether javascript (AJAX) allows concurrent multiple form submissions.
Improving performance in general is an interesting subject. The key is measuring to find out what the bottleneck actually is.
I would imagine that the bottleneck either the browser loading the files to upload or the available bandwidth during load (or both).
If the bottleneck is saving the files from buffers (or moving temp files to their final location) then converting the for loop into threads may help - presuming that the web server has the I/O capacity to write all those files. But threading would only help if there are a good number files to write otherwise work to repeatedly spawn threads (or load a queue and start a set of threads) would be slower than looping to processing a few small files.
For the fun of it, here is an article on python threads and a work queue: http://www.ibm.com/developerworks/aix/library/au-threadingpython/
Reference on Python CGI and file uploading and multiple file uploading via POST:
http://www.tutorialspoint.com/python/python_cgi_programming.htm
http://php.net/manual/en/features.file-upload.multiple.php

Streaming audio using Python (without GStreamer)

I'm working on a project that involves streaming .OGG (or .mp3) files from my webserver. I'd prefer not to have to download the whole file and then play it, is there a way to do that in pure Python (no GStreamer - hoping to make it truly cross platform)? Is there a way to use urllib to download the file chunks at a time and load that into, say, PyGame to do the actual audio playing?
Thanks!
I suppose your server supports Range requests. You ask the server by header Range with start byte and end byte of the range you want:
import urllib2
req = urllib2.Request(url)
req.headers['Range'] = 'bytes=%s-%s' % (startByte, endByte)
f = urllib2.urlopen(req)
f.read()
You can implement a file object and always download just a needed chunk of file from server. Almost every library accepts a file object as input.
It will be probably slow because of a network latency. You would need to download bigger chunks of the file, preload the file in a separate thread, etc. In other words, you would need to implement the streaming client logic yourself.

Categories