I am serving quite large files from a Pyramid Application I have written.
My only problem is download managers don't want to play nice.
I can't get resume downloading or segmenting to work with download manager like DownThemAll.
size = os.path.getsize(Path + dFile)
response = Response(content_type='application/force-download', content_disposition='attachment; filename=' + dFile)
response.app_iter = open(Path + dFile, 'rb')
response.content_length = size
I think the problem may lie with paste.httpserver but I am not sure.
The web server on the python side needs to support partial downloads, which happens through the HTTP Accept-Ranges header. This blog post digs a bit into this matter with an example in python:
Python sample: Downloading file through HTTP protocol with multi-threads
Pyramid 1.3 adds new response classes, FileResponse and FileIter for manually serving files.
Having been working on this problem for a while, I found
http://docs.webob.org/en/latest/file-example.html
to be a great help.
Related
I am currently developing a Python script that calls a REST API to download data that is made available everyday through the API. The files that I am trying to download have a '.txt.bz2` extension.
The API documentation recommends to use curl to download data from the API. In particular, the command to download the data that is recommended is:
curl --user Username:Password https://api.endpoint.com/data/path/to/file -o my_filename.txt.bz2
Where, of course, the url of the API data endpoint here is just fictitious.
Since the documentation recommends curl, my current implementation of the Python script leverages the subprocess library to call curl within Python:
import subprocess
def data_downloader(download_url, file_name, api_username, api_password):
args = ['curl', '--user', f'{api_username}:{api_password}', f'{download_url}', '-o', f'{file_name}']
subrpocess.call(args)
return file_name
Since, however, I am extensively using the requests library in other parts of the application that I am developing, mainly to send requests to the API and walking the file system like structure of the API, I have tried to implement the download function using this library as well. In particular, I have been using this other Stackoverflow thread as the reference of my alternative implementation, and the two functions that I have implemented using the requests library look like this:
import requests
import shutil
def download_file(download_url, file_name, api_username, api_password, chunk_size):
with requests.get(download_url, auth=(api_username, api_password), stream=True) as r:
with open(file_name, 'wb') as f:
for chunk in r.iter_content(chunk_size=chunk_size):
f.write(chunk)
return file_name
def shutil_download(download_url, file_name, api_username, api_password):
with requests.get(download_url, auth=(api_username, api_password), stream=True) as r:
with open(file_name, 'wb') as f:
shutil.copyfileobj(r.raw, file_name)
return file_name
While, however, with the subprocess implementation I am able to download the entire file without any issue, when trying to perform the download using the two requests implementations, I always end up with a downloaded file with a 1Kb dimension, which clearly is wrong since most of the data that I am downloading is >10GB.
I suspect that the issue that I am experiencing is caused by the format of the data that I am attempting to download, as I have seen successful attempts of downloading .zip or .gzip files using the same logic that I am using in the two functions. Hence I am wondering if anyone may have an explanation to the issue that I am experiencing or may provide a working solution to the problem.
UPDATE
I had a chance to discuss the issue with the owner of the API and apparently, upon analysis of the logs on their side, they found out there were some issues on their side that prevented the request to go through. On my side the status code of the request was signalling a successful request, however the returned data was not the correct one.
The two functions that use the requests library work as expected and the issue can be considered solved.
What is the best way to send a lot of POST requests to a REST endpoint via Python?
E.g. I want to upload ~500k files to a database.
What I've done so far is a loop that creates for each file a new request using the requests package.
# get list of files
files = [f for f in listdir(folder_name)]
# loop through the list
for file_name in files:
try:
# open file and get content
with open(folder_name + "\\" + file_name, "r") as file:
f = file.read()
# create request
req = make_request(url, f)
# error handling, logging, ...
But as this is quite slow: what is the best practice to do that? Thank you.
First approach:
I dont know if it is the best practice you can split the files in batches of 1000 and zip it and send it as post requests using threads ( set the num threads = number of processor cores)
( The rest end point can extract the zipped contents and then process it )
second approach:
zip the files in batches and transfer it in batches
after the transfer is completed , validate in the server side
Then start the database upload at one go.
The first thing you want to do is determine exactly which part of your script is the bottleneck. You have both disk and network I/O here (reading files and sending HTTP requests, respectively).
Assuming that the HTTP requests are the actual bottleneck (highly likely), consider using aiohttp instead of requests. The docs have some good examples to get you started and there are plenty of "Quick Start" articles out there. This would allow your network requests to be cooperative, meaning that other python code can run while one of your network requests is waiting. Just to be careful to not overwhelm whatever server is receiving the requests.
I am using Flask for my Web Api service.
Finding that my services sometimes (1/100 requests) respond really slow (seconds), I started debugging, which showed me that sometimes the service hangs on reading the request field.
#app.route('/scan', methods=['POST'])
def scan():
start_time = time.time()
request_description = request.form.get('requestDescription')
end_time = time.time()
app.logger.debug('delay is ' + end_time-start_time)
Here I found that delay between start_time and end_time can be up to 2 minutes.
I've read about using Flask's Werkzeug as a production server, so I tried GUnicorn as an alternative - same thing.
I feel that my problem is somehow similar to this one, with the difference that another server didn't solve the problem.
I tried to profile the app using cProfile and SnakeViz, but with the non-prod Werkzeug server - as I don't get how to profile python apps running on GUnicorn. (maybe anyone here knows how to?)
My POST requests contain description and a file. The file can vary in size, but the logs show that the issue reproduces regardless of the file size.
People also usually say that Flask should be used in Nginx-[normal server]-flask combo, but as I use the service inside Openshift, I doubt this has any meaning. (HaProxy works as a balancer)
So my settings:
Alpine 3.8.1
GUnicorn:
workers:3
threads:1
What happens under the hood when I call this?
request.form.get('requestDescription')
How can I profile Python code under GUnicorn?
Did anyone else encounter such a problem?
Any help will be appreciated
I did face this issue as well. I was uploading a video file using request.post(). Turns out that the video uploading was not the issue.
The timing bottleneck was the request.form.get(). While I am still trying to figure out the issue, you can use Flask Monitoring Dashboard to time profile the code
Turns out that the under the hood is return self._sock.recv_into(b) if you use the profiler
I'm working on a simple app that takes images optimizes them and saves them in cloud storage. I found an example that takes the file and uses PIL to optimize it. The code looks like this:
def inPlaceOptimizeImage(photo_blob):
blob_key = photo_blob.key()
new_blob_key = None
img = Image.open(photo_blob.open())
output = StringIO.StringIO()
img.save(output,img.format, optimized=True,quality=90)
opt_img = output.getvalue()
output.close()
# Create the file
file_name = files.blobstore.create(mime_type=photo_blob.content_type)
# Open the file and write to it
with files.open(file_name, 'a') as f:
f.write(opt_img)
# Finalize the file. Do this before attempting to read it.
files.finalize(file_name)
# Get the file's blob key
return files.blobstore.get_blob_key(file_name)
This works fine locally (although I don't know how well it's being optimized because when I run the uploaded image through something like http://www.jpegmini.com/ it gets reduced by 2.4x still). However when I deploy the app and try uploading images I frequently get 500 errors and these messages in the logs:
F 00:30:33.322 Exceeded soft private memory limit of 128 MB with 156 MB after servicing 7 requests total
W 00:30:33.322 While handling this request, the process that handled this request was found to be using too much memory and was terminated. This is likely to cause a new process to be used for the next request to your application. If you see this message frequently, you may have a memory leak in your application.
I have two questions:
Is this even the best way to optimize and save images in cloud storage?
How do I prevent these 500 errors from occurring?
Thanks in advance.
The error you're experiencing is happening due to the memory limits of your Instance class.
What I would suggest you to do is to edit your .yaml file in order to configure your module, and specify your Instance class to be F2 or higher.
In case you are not using modules, you should also add “module: default” at the beginning of your app.yaml file to let GAE know that this is your default module.
You can take a look to this article from the docs to see the different Instance classes available, and the way to easily configure them.
Another more basic workaround would be to limit the image size when uploading it, but you will eventually finish having a similar issue.
Regarding the previous matter and a way to optimize your images, you may want to take a look at the App Engine Images API that provides the ability to manipulate image data using a dedicated Images service. In your case, you might like the "I'm Feeling Lucky" transformation. By using this API you might not need to update your Instance class.
I'm maintaining an open-source document asset management application called NotreDAM, which is written in Django running on Apache an instance of TwistedWeb.
Whenever any user downloads a file, the application hangs for all users for the entire duration of the download. I've tracked down the download command to this point in the code, but I'm not enough versed with Python/Django to know why this may be happening.
response = HttpResponse(open(fullpath, 'rb').read(), mimetype=mimetype)
response["Last-Modified"] = http_date(statobj.st_mtime)
response["Content-Length"] = statobj.st_size
if encoding:
response["Content-Encoding"] = encoding
return response
Do you know how I could fix the application hanging while a file downloads?
The web server reads the whole file in the memory instead of streaming it. It is not well written code, but not a bug per se.
This blocks the Apache client (pre-forked) for the duration of whole file read. If IO is slow and the file is large it may take some time.
Usually you have several pre-forked Apache clients configured to satisfy this kind of requests, but on a badly configured web server you may exhibit this kind of problems and this is not a Django issue. Your web server is probably running only one pre-forked process, potentially in a debug mode.
notreDAM serves the asset files using the django.views.static.serve() command, which according to the Django docs "Using this method is inefficient and insecure. Do not use this in a production setting. Use this only for development." So there we go. I have to use another command.