Parallel file uploading via Browser using Python - python

The following code uploads multiple files from a browser in a serial fashion to a host. In order to make the process faster, how could the files be processed in parallel so there are multiple streams being sent to the server? Is this worth pursuing if the files are being sent to the same host? Will a browser allow multiple upload streams to the same host? If so, how might this work?
I assume that is not possible to break the file into parts and write them parallel similar to this example since they are being submitted by a browser rather then a Python client.
#!/usr/bin/python
import cgi, os
import shutil
import cgitb; cgitb.enable() # for troubleshooting
form = cgi.FieldStorage()
print """\
Content-Type: text/html\n
<html><body>
"""
if 'file' in form:
filefield = form['file']
if not isinstance(filefield, list):
filefield = [filefield]
for fileitem in filefield:
if fileitem.filename:
fn = os.path.basename(fileitem.filename)
# save file
with open('/var/www/site/files/' + fn, 'wb') as f:
shutil.copyfileobj(fileitem.file, f)
# line breaks are not occuring between interations
print 'File "' + fn + '" was uploaded successfully<br/>'
message = 'All files uploaded'
else:
message = 'No file was uploaded'
print """
<p>%s</p>
</body></html>
""" % (message)

A form submit produces a single HTTP request that naturally uses a single TCP connection to deliver the request.
To upload multiple files in parallel make multiple requests in parallel. You might need flash or java applet on the client. Check whether javascript (AJAX) allows concurrent multiple form submissions.

Improving performance in general is an interesting subject. The key is measuring to find out what the bottleneck actually is.
I would imagine that the bottleneck either the browser loading the files to upload or the available bandwidth during load (or both).
If the bottleneck is saving the files from buffers (or moving temp files to their final location) then converting the for loop into threads may help - presuming that the web server has the I/O capacity to write all those files. But threading would only help if there are a good number files to write otherwise work to repeatedly spawn threads (or load a queue and start a set of threads) would be slower than looping to processing a few small files.
For the fun of it, here is an article on python threads and a work queue: http://www.ibm.com/developerworks/aix/library/au-threadingpython/
Reference on Python CGI and file uploading and multiple file uploading via POST:
http://www.tutorialspoint.com/python/python_cgi_programming.htm
http://php.net/manual/en/features.file-upload.multiple.php

Related

Uploading large number of files to S3 with boto3 [duplicate]

Hey there were some similar questions, but none exactly like this and a fair number of them were multiple years old and out of date.
I have written some code on my server that uploads jpeg photos into an s3 bucket using a key via the boto3 method upload_file. Initially this seemed great. It is a super simple solution to uploading files into s3.
The thing is, I have users. My users are sending their jpegs to my server via a phone app. While I concede that I could generate presigned upload URLs and send them to the phone app, that would require a considerable rewrite of our phone app and API.
So I just want the phone app to send the photos to the server. I then want to send the photos from the server to s3. I implemented this but it is way too slow. I cannot ask my users to tolerate those slow uploads.
What can I do to speed this up?
I did some Google searching and found this: https://medium.com/#alejandro.millan.frias/optimizing-transfer-throughput-of-small-files-to-amazon-s3-or-anywhere-really-301dca4472a5
It suggests that the solution is to increase the number of TCP/IP connections. More TCP/IP connections means faster uploads.
Okay, great!
How do I do that? How do I increase the number of TCP/IP connections so I can upload a single jpeg into AWS s3 faster?
Please help.
Ironically, we've been using boto3 for years, as well as awscli, and we like them both.
But we've often wondered why awscli's aws s3 cp --recursive, or aws s3 sync, are often so much faster than trying to do a bunch of uploads via boto3, even with concurrent.futures's ThreadPoolExecutor or ProcessPoolExecutor (and don't you even dare sharing the same s3.Bucket among your workers: it's warned against in the docs, and for good reasons; nasty crashes will eventually ensue at the most inconvenient time).
Finally, I bit the bullet and looked inside the "customization" code that awscli introduces on top of boto3.
Based on that little exploration, here is a way to speed up the upload of many files to S3 by using the concurrency already built in boto3.s3.transfer, not just for the possible multiparts of a single, large file, but for a whole bunch of files of various sizes as well. That functionality is, as far as I know, not exposed through the higher level APIs of boto3 that are described in the boto3 docs.
The following:
Uses boto3.s3.transfer to create a TransferManager, the very same one that is used by awscli's aws s3 sync, for example.
Extends the max number of threads to 20.
Augments the underlying urllib3 max pool connections capacity used by botocore to match (by default, it uses 10 connections maximum).
Gives you an optional callback capability (demoed here with a tqdm progress bar, but of course you can have whatever callback you'd like).
Is fast (over 100MB/s --tested on an ec2 instance).
I put a complete example as a gist here that includes the generation of 500 random csv files for a total of about 360MB. Here below, we assume you already have a bunch of files in filelist, for a total of totalsize bytes:
import os
import boto3
import botocore
import boto3.s3.transfer as s3transfer
def fast_upload(session, bucketname, s3dir, filelist, progress_func, workers=20):
botocore_config = botocore.config.Config(max_pool_connections=workers)
s3client = session.client('s3', config=botocore_config)
transfer_config = s3transfer.TransferConfig(
use_threads=True,
max_concurrency=workers,
)
s3t = s3transfer.create_transfer_manager(s3client, transfer_config)
for src in filelist:
dst = os.path.join(s3dir, os.path.basename(src))
s3t.upload(
src, bucketname, dst,
subscribers=[
s3transfer.ProgressCallbackInvoker(progress_func),
],
)
s3t.shutdown() # wait for all the upload tasks to finish
Example usage
from tqdm import tqdm
bucketname = '<your-bucket-name>'
s3dir = 'some/path/for/junk'
filelist = [...]
totalsize = sum([os.stat(f).st_size for f in filelist])
with tqdm(desc='upload', ncols=60,
total=totalsize, unit='B', unit_scale=1) as pbar:
fast_upload(boto3.Session(), bucketname, s3dir, filelist, pbar.update)

Python: How to know if file is locked in FTP [duplicate]

My application is keeping watch on a set of folders where users can upload files. When a file upload is finished I have to apply a treatment, but I don't know how to detect that a file has not finish to upload.
Any way to detect if a file is not released yet by the FTP server?
There's no generic solution to this problem.
Some FTP servers lock the file being uploaded, preventing you from accessing it, while the file is still being uploaded. For example IIS FTP server does that. Most other FTP servers do not. See my answer at Prevent file from being accessed as it's being uploaded.
There are some common workarounds to the problem (originally posted in SFTP file lock mechanism, but relevant for the FTP too):
You can have the client upload a "done" file once the upload finishes. Make your automated system wait for the "done" file to appear.
You can have a dedicated "upload" folder and have the client (atomically) move the uploaded file to a "done" folder. Make your automated system look to the "done" folder only.
Have a file naming convention for files being uploaded (".filepart") and have the client (atomically) rename the file after upload to its final name. Make your automated system ignore the ".filepart" files.
See (my) article Locking files while uploading / Upload to temporary file name for an example of implementing this approach.
Also, some FTP servers have this functionality built-in. For example ProFTPD with its HiddenStores directive.
A gross hack is to periodically check for file attributes (size and time) and consider the upload finished, if the attributes have not changed for some time interval.
You can also make use of the fact that some file formats have clear end-of-the-file marker (like XML or ZIP). So you know, that the file is incomplete.
Some FTP servers allow you to configure a hook to be called, when an upload is finished. You can make use of that. For example ProFTPD has a mod_exec module (see the ExecOnCommand directive).
I use ftputil to implement this work-around:
connect to ftp server
list all files of the directory
call stat() on each file
wait N seconds
For each file: call stat() again. If result is different, then skip this file, since it was modified during the last seconds.
If stat() result is not different, then download the file.
This whole ftp-fetching is old and obsolete technology. I hope that the customer will use a modern http API the next time :-)
If you are reading files of particular extensions, then use WINSCP for File Transfer. It will create a temporary file with extension .filepart and it will turn to the actual file extension once it fully transfer the file.
I hope, it will help someone.
This is a classic problem with FTP transfers. The only mostly reliable method I've found is to send a file, then send a second short "marker" file just to tell the recipient the transfer of the first is complete. You can use a file naming convention and just check for existence of the second file.
You might get fancy and make the content of the second file a checksum of the first file. Then you could verify the first file. (You don't have the problem with the second file because you just wait until file size = checksum size).
And of course this only works if you can get the sender to send a second file.

How to stream a file in a request?

We have a react application communicating with a django backend. Whenever the react application wants to upload a file to the backend, we send a form request with one field being the handle of the file being upload. The field is received on the Django side as an
InMemoryUploadedFile, which is an object with some chunks, which can be processed for example like this:
def save_uploaded_file(uploaded_file, handle):
"""
Saves the uploaded file using the given file handle.
We walk the chunks to avoid reading the whole file in memory
"""
for chunk in uploaded_file.chunks():
handle.write(chunk)
handle.flush()
logger.debug(f'Saved file {uploaded_file.name} with length {uploaded_file.size}')
Now, I am creating some testing framework using requests to drive our API. I am trying to emulate this mechanism, but strangely enough, requests insists on reading from the open handle before sending the request. I am doing:
requests.post(url, data, headers=headers, **kwargs)
with:
data = {'content': open('myfile', 'rb'), ...}
Note that I am not reading from the file, I am just opening it. But requests insists on reading from it, and sends the data embedded, which has several problems:
it can be huge
by being binary data, it corrupts the request
it is not what my application expects
I do not want this: I want requests simply to "stream" that file, not to read it. There is a files parameter, but that will create a multipart with the file embedded in the request, which is again not what I want. I want all fields in the data to be passed in the request, and the content field to be streamed. I know this is possible because:
the browser does it
Postman does it
the django test client does it
How can I force requests to stream a particular file in the data?
Probably, this is no longer relevant, but I will share some information that I found in the documentation.
By default, if an uploaded file is smaller than 2.5 megabytes, Django
will hold the entire contents of the upload in memory. This means that
saving the file involves only a read from memory and a write to disk
and thus is very fast. However, if an uploaded file is too large,
Django will write the uploaded file to a temporary file stored in your
system’s temporary directory.
This way, there is no need to create a streaming file upload. Rather, the solution might be to handle (read) the loaded using a buffer.

Best practice: send multiple files to rest endpoint using python requests

What is the best way to send a lot of POST requests to a REST endpoint via Python?
E.g. I want to upload ~500k files to a database.
What I've done so far is a loop that creates for each file a new request using the requests package.
# get list of files
files = [f for f in listdir(folder_name)]
# loop through the list
for file_name in files:
try:
# open file and get content
with open(folder_name + "\\" + file_name, "r") as file:
f = file.read()
# create request
req = make_request(url, f)
# error handling, logging, ...
But as this is quite slow: what is the best practice to do that? Thank you.
First approach:
I dont know if it is the best practice you can split the files in batches of 1000 and zip it and send it as post requests using threads ( set the num threads = number of processor cores)
( The rest end point can extract the zipped contents and then process it )
second approach:
zip the files in batches and transfer it in batches
after the transfer is completed , validate in the server side
Then start the database upload at one go.
The first thing you want to do is determine exactly which part of your script is the bottleneck. You have both disk and network I/O here (reading files and sending HTTP requests, respectively).
Assuming that the HTTP requests are the actual bottleneck (highly likely), consider using aiohttp instead of requests. The docs have some good examples to get you started and there are plenty of "Quick Start" articles out there. This would allow your network requests to be cooperative, meaning that other python code can run while one of your network requests is waiting. Just to be careful to not overwhelm whatever server is receiving the requests.

Streaming audio using Python (without GStreamer)

I'm working on a project that involves streaming .OGG (or .mp3) files from my webserver. I'd prefer not to have to download the whole file and then play it, is there a way to do that in pure Python (no GStreamer - hoping to make it truly cross platform)? Is there a way to use urllib to download the file chunks at a time and load that into, say, PyGame to do the actual audio playing?
Thanks!
I suppose your server supports Range requests. You ask the server by header Range with start byte and end byte of the range you want:
import urllib2
req = urllib2.Request(url)
req.headers['Range'] = 'bytes=%s-%s' % (startByte, endByte)
f = urllib2.urlopen(req)
f.read()
You can implement a file object and always download just a needed chunk of file from server. Almost every library accepts a file object as input.
It will be probably slow because of a network latency. You would need to download bigger chunks of the file, preload the file in a separate thread, etc. In other words, you would need to implement the streaming client logic yourself.

Categories