Library for sending HTTP requests in file-like objects - python

I'm currently using Python requests for HTTP requests, but due to limitations in the API, I'm unable to keep using the library.
I need a library which will allow me to write the request body in a streaming file-like fashion, as the data which I'll be sending won't all be immediately available, plus I'd like to save as much memory as possible when making a request. Is there an easy-to-use library which will allow me to send a PUT request like this:
request = HTTPRequest()
request.headers['content-type'] = 'application/octet-stream'
# etc
request.connect()
# send body
with open('myfile', 'rb') as f:
while True:
chunk = f.read(64 * 1024)
request.body.write(chunk)
if not len(chunk) == 64 * 1024:
break
# finish
request.close()
More specifically, I have one thread to work with. Using this thread, I receive callbacks as I receive a stream over the network. Essentially, those callbacks look like this:
class MyListener(Listener):
def on_stream_start(stream_name):
pass
def on_stream_chunk(chunk):
pass
def on_stream_end(total_size):
pass
I need to essentially create my upload request in the on_stream_start method, upload chunks in the on_stream_chunk method, then finish the upload in the on_stream_end method. Thus, I need a library which supports a method like write(chunk) to be able to do something similar to the following:
class MyListener(Listener):
request = None
def on_stream_start(stream_name):
request = RequestObject(get_url(), "PUT")
request.headers.content_type = "application/octet-stream"
# ...
def on_stream_chunk(chunk):
request.write_body(chunk + sha256(chunk).hexdigest())
def on_stream_end(total_size):
request.close()
The requests library supports file-like objects and generators for reading but nothing for writing out the requests: pull instead of push. Is there a library which will allow me to push data up the line to the server?

As far as I can tell httplib's HTTPConnection.request does exactly what you want.
I tracked down the function which actually does the sending, and as long as you're passing a file-like object (and not a string), it chunks it up:
Definition: httplib.HTTPConnection.send(self, data)
Source:
def send(self, data):
"""Send `data' to the server."""
if self.sock is None:
if self.auto_open:
self.connect()
else:
raise NotConnected()
if self.debuglevel > 0:
print "send:", repr(data)
blocksize = 8192
if hasattr(data,'read') and not isinstance(data, array):
if self.debuglevel > 0: print "sendIng a read()able"
## {{{ HERE IS THE CHUCKING LOGIC
datablock = data.read(blocksize)
while datablock:
self.sock.sendall(datablock)
datablock = data.read(blocksize)
## }}}
else:
self.sock.sendall(data)

I do something like this in a few places in my codebase. You need an upload file wrapper, and you need another thread or a greenthread - I'm using eventlet for fake threading in my instance. Call requests.put, which will block on read() on your file-like object wrapper. The thread you call put in will block waiting, so you need to do the receiving in another.
Sorry for not posting code, I just saw this when I was zipping through. I hope this is enough to help, if not maybe I can edit and add more later.

Requests actually supports multipart encoded requests with the files parameter:
Multipart POST example in the official documentation:
url = 'http://httpbin.org/post'
files = {'file': open('report.xls', 'rb')}
r = requests.post(url, files=files)
r.text
{
...
"files": {
"file": "<censored...binary...data>"
},
...
}
You can create your own file-like streaming object if you like, too, but you cannot mix a stream and files in the same request.
A simple case that might work for you would be to open the file and return a chunking, generator-based reader:
def read_as_gen(filename, chunksize=-1): # -1 defaults to read the file to the end, like a regular .read()
with open(filename, mode='rb') as f:
while True:
chunk = f.read(chunksize)
if len(chunk) > 0:
yield chunk
else:
raise StopIteration
# Now that we can read the file as a generator with a chunksize, give it to the files parameter
files = {'file': read_as_gen(filename, 64*1024)}
# ... post as normal.
But if you had to block the chunking on something else, like another network buffer, you could handle that in the same manner:
def read_buffer_as_gen(buffer_params, chunksize=-1): # -1 defaults to read the file to the end, like a regular .read()
with buffer_open(*buffer_params) as buf: # some function to open up your buffer
# you could also just pass in the buffer itself and skip the `with` block
while True:
chunk = buf.read(chunksize)
if len(chunk) > 0:
yield chunk
else:
raise StopIteration

This may help
import urllib2
request = urllib2.Request(uri, data=data)
request.get_method = lambda: 'PUT' # or 'DELETE'
response = urllib2.urlopen(request)

Related

How to Upload a large File (≥3GB) to FastAPI backend?

I am trying to upload a large file (≥3GB) to my FastAPI server, without loading the entire file into memory, as my server has only 2GB of free memory.
Server side:
async def uploadfiles(upload_file: UploadFile = File(...):
Client side:
m = MultipartEncoder(fields = {"upload_file":open(file_name,'rb')})
prefix = "http://xxx:5000"
url = "{}/v1/uploadfiles".format(prefix)
try:
req = requests.post(
url,
data=m,
verify=False,
)
which returns:
HTTP 422 {"detail":[{"loc":["body","upload_file"],"msg":"field required","type":"value_error.missing"}]}
I am not sure what MultipartEncoder actually sends to the server, so that the request does not match. Any ideas?
With requests-toolbelt library, you have to pass the filename as well, when declaring the field for upload_file, as well as set the Content-Type header—which is the main reason for the error you get, as you are sending the request without setting the Content-Type header to multipart/form-data, followed by the necessary boundary string—as shown in the documentation. Example:
filename = 'my_file.txt'
m = MultipartEncoder(fields={'upload_file': (filename, open(filename, 'rb'))})
r = requests.post(url, data=m, headers={'Content-Type': m.content_type})
print(r.request.headers) # confirm that the 'Content-Type' header has been set
However, I wouldn't recommend using a library (i.e., requests-toolbelt) that hasn't provided a new release for over three years now. I would suggest using Python requests instead, as demonstrated in this answer and that answer (also see Streaming Uploads and Chunk-Encoded Requests), or, preferably, use the HTTPX library, which supports async requests (if you had to send multiple requests simultaneously), as well as streaming File uploads by default, meaning that only one chunk at a time will be loaded into memory (see the documentation). Examples are given below.
Option 1 (Fast) - Upload File and Form data using .stream()
As previously explained in detail in this answer, when you declare an UploadFile object, FastAPI/Starlette, under the hood, uses a SpooledTemporaryFile with the max_size attribute set to 1MB, meaning that the file data is spooled in memory until the file size exceeds the max_size, at which point the contents are written to disk; more specifically, to a temporary file on your OS's temporary directory—see this answer on how to find/change the default temporary directory—that you later need to read the data from, using the .read() method. Hence, this whole process makes uploading file quite slow; especially, if it is a large file (as you'll see in Option 2 below later on).
To avoid that and speed up the process, as the linked answer above suggested, one can access the request body as a stream. As per Starlette documentation, if you use the .stream() method, the (request) byte chunks are provided without storing the entire body to memory (and later to a temporary file, if the body size exceeds 1MB). This method allows you to read and process the byte chunks as they arrive. The below takes the suggested solution a step further, by using the streaming-form-data library, which provides a Python parser for parsing streaming multipart/form-data input chunks. This means that not only you can upload Form data along with File(s), but you also don't have to wait for the entire request body to be received, in order to start parsing the data. The way it's done is that you initialise the main parser class (passing the HTTP request headers that help to determine the input Content-Type, and hence, the boundary string used to separate each body part in the multipart payload, etc.), and associate one of the Target classes to define what should be done with a field when it has been extracted out of the request body. For instance, FileTarget would stream the data to a file on disk, whereas ValueTarget would hold the data in memory (this class can be used for either Form or File data as well, if you don't need the file(s) saved to the disk). It is also possible to define your own custom Target classes. I have to mention that streaming-form-data library does not currently support async calls to I/O operations, meaning that the writing of chunks happens synchronously (within a def function). Though, as the endpoint below uses .stream() (which is an async function), it will give up control for other tasks/requests to run on the event loop, while waiting for data to become available from the stream. You could also run the function for parsing the received data in a separate thread and await it, using Starlette's run_in_threadpool()—e.g., await run_in_threadpool(parser.data_received, chunk)—which is used by FastAPI internally when you call the async methods of UploadFile, as shown here. For more details on def vs async def, please have a look at this answer.
You can also perform certain validation tasks, e.g., ensuring that the input size is not exceeding a certain value. This can be done using the MaxSizeValidator. However, as this would only be applied to the fields you defined—and hence, it wouldn't prevent a malicious user from sending extremely large request body, which could result in consuming server resources in a way that the application may end up crashing—the below incorporates a custom MaxBodySizeValidator class that is used to make sure that the request body size is not exceeding a pre-defined value. The both validators desribed above solve the problem of limiting upload file (as well as the entire request body) size in a likely better way than the one desribed here, which uses UploadFile, and hence, the file needs to be entirely received and saved to the temporary directory, before performing the check (not to mention that the approach does not take into account the request body size at all)—using as ASGI middleware such as this would be an alternative solution for limiting the request body. Also, in case you are using Gunicorn with Uvicorn, you can also define limits with regards to, for example, the number of HTTP header fields in a request, the size of an HTTP request header field, and so on (see the documentation). Similar limits can be applied when using reverse proxy servers, such as Nginx (which also allows you to set the maximum request body size using the client_max_body_size directive).
A few notes for the example below. Since it uses the Request object directly, and not UploadFile and Form objects, the endpoint won't be properly documented in the auto-generated docs at /docs (if that's important for your app at all). This also means that you have to perform some checks yourself, such as whether the required fields for the endpoint were received or not, and if they were in the expected format. For instance, for the data field, you could check whether data.value is empty or not (empty would mean that the user has either not included that field in the multipart/form-data, or sent an empty value), as well as if isinstance(data.value, str). As for the file(s), you can check whether file_.multipart_filename is not empty; however, since a filename could likely not be included in the Content-Disposition by some user, you may also want to check if the file exists in the filesystem, using os.path.isfile(filepath) (Note: you need to make sure there is no pre-existing file with the same name in that specified location; otherwise, the aforementioned function would always return True, even when the user did not send the file).
Regarding the applied size limits, the MAX_REQUEST_BODY_SIZE below must be larger than MAX_FILE_SIZE (plus all the Form values size) you expcect to receive, as the raw request body (that you get from using the .stream() method) includes a few more bytes for the --boundary and Content-Disposition header for each of the fields in the body. Hence, you should add a few more bytes, depending on the Form values and the number of files you expect to receive (hence the MAX_FILE_SIZE + 1024 below).
app.py
from fastapi import FastAPI, Request, HTTPException, status
from streaming_form_data import StreamingFormDataParser
from streaming_form_data.targets import FileTarget, ValueTarget
from streaming_form_data.validators import MaxSizeValidator
import streaming_form_data
from starlette.requests import ClientDisconnect
import os
MAX_FILE_SIZE = 1024 * 1024 * 1024 * 4 # = 4GB
MAX_REQUEST_BODY_SIZE = MAX_FILE_SIZE + 1024
app = FastAPI()
class MaxBodySizeException(Exception):
def __init__(self, body_len: str):
self.body_len = body_len
class MaxBodySizeValidator:
def __init__(self, max_size: int):
self.body_len = 0
self.max_size = max_size
def __call__(self, chunk: bytes):
self.body_len += len(chunk)
if self.body_len > self.max_size:
raise MaxBodySizeException(body_len=self.body_len)
#app.post('/upload')
async def upload(request: Request):
body_validator = MaxBodySizeValidator(MAX_REQUEST_BODY_SIZE)
filename = request.headers.get('Filename')
if not filename:
raise HTTPException(status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
detail='Filename header is missing')
try:
filepath = os.path.join('./', os.path.basename(filename))
file_ = FileTarget(filepath, validator=MaxSizeValidator(MAX_FILE_SIZE))
data = ValueTarget()
parser = StreamingFormDataParser(headers=request.headers)
parser.register('file', file_)
parser.register('data', data)
async for chunk in request.stream():
body_validator(chunk)
parser.data_received(chunk)
except ClientDisconnect:
print("Client Disconnected")
except MaxBodySizeException as e:
raise HTTPException(status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE,
detail=f'Maximum request body size limit ({MAX_REQUEST_BODY_SIZE} bytes) exceeded ({e.body_len} bytes read)')
except streaming_form_data.validators.ValidationError:
raise HTTPException(status_code=status.HTTP_413_REQUEST_ENTITY_TOO_LARGE,
detail=f'Maximum file size limit ({MAX_FILE_SIZE} bytes) exceeded')
except Exception:
raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail='There was an error uploading the file')
if not file_.multipart_filename:
raise HTTPException(status_code=status.HTTP_422_UNPROCESSABLE_ENTITY, detail='File is missing')
print(data.value.decode())
print(file_.multipart_filename)
return {"message": f"Successfuly uploaded {filename}"}
As mentioned earlier, to upload the data (on client side), you can use the HTTPX library, which supports streaming file uploads by default, and thus allows you to send large streams/files without loading them entirely into memory. You can pass additional Form data as well, using the data argument. Below, a custom header, i.e., Filename, is used to pass the filename to the server, so that the server instantiates the FileTarget class with that name (you could use the X- prefix for custom headers, if you wish; however, it is not officially recommended anymore).
To upload multiple files, use a header for each file (or, use random names on server side, and once the file has been fully uploaded, you can optionally rename it using the file_.multipart_filename attribute), pass a list of files, as described in the documentation (Note: use a different field name for each file, so that they won't overlap when parsing them on server side, e.g., files = [('file', open('bigFile.zip', 'rb')),('file_2', open('bigFile2.zip', 'rb'))], and finally, define the Target classes on server side accordingly.
test.py
import httpx
import time
url ='http://127.0.0.1:8000/upload'
files = {'file': open('bigFile.zip', 'rb')}
headers={'Filename': 'bigFile.zip'}
data = {'data': 'Hello World!'}
with httpx.Client() as client:
start = time.time()
r = client.post(url, data=data, files=files, headers=headers)
end = time.time()
print(f'Time elapsed: {end - start}s')
print(r.status_code, r.json(), sep=' ')
Upload both File and JSON body
In case you would like to upload both file(s) and JSON instead of Form data, you can use the approach described in Method 3 of this answer, thus also saving you from performing manual checks on the received Form fields, as explained earlier (see the linked answer for more details). To do that, make the following changes in the code above.
app.py
#...
from fastapi import Form
from pydantic import BaseModel, ValidationError
from typing import Optional
from fastapi.encoders import jsonable_encoder
class Base(BaseModel):
name: str
point: Optional[float] = None
is_accepted: Optional[bool] = False
def checker(data: str = Form(...)):
try:
model = Base.parse_raw(data)
except ValidationError as e:
raise HTTPException(detail=jsonable_encoder(e.errors()), status_code=status.HTTP_422_UNPROCESSABLE_ENTITY)
return model
#...
#app.post('/upload')
async def upload(request: Request):
#...
# place this after the try-except block
model = checker(data.value.decode())
print(model.dict())
test.py
#...
import json
data = {'data': json.dumps({"name": "foo", "point": 0.13, "is_accepted": False})}
#...
Option 2 (Slow) - Upload File and Form data using UploadFile and Form
If you would like to use a normal def endpoint instead, see this answer.
app.py
from fastapi import FastAPI, File, UploadFile, Form, HTTPException, status
import aiofiles
import os
CHUNK_SIZE = 1024 * 1024 # adjust the chunk size as desired
app = FastAPI()
#app.post("/upload")
async def upload(file: UploadFile = File(...), data: str = Form(...)):
try:
filepath = os.path.join('./', os.path.basename(file.filename))
async with aiofiles.open(filepath, 'wb') as f:
while chunk := await file.read(CHUNK_SIZE):
await f.write(chunk)
except Exception:
raise HTTPException(status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail='There was an error uploading the file')
finally:
await file.close()
return {"message": f"Successfuly uploaded {file.filename}"}
As mentioned earlier, using this option would take longer for the file upload to complete, and as HTTPX uses a default timeout of 5 seconds, you will most likely get a ReadTimeout exception (as the server will need some time to read the SpooledTemporaryFile in chunks and write the contents to a permanent location on the disk). Thus, you can configure the timeout (see the Timeout class in the source code too), and more specifically, the read timeout, which "specifies the maximum duration to wait for a chunk of data to be received (for example, a chunk of the response body)". If set to None instead of some positive numerical value, there will be no timeout on read.
test.py
import httpx
import time
url ='http://127.0.0.1:8000/upload'
files = {'file': open('bigFile.zip', 'rb')}
headers={'Filename': 'bigFile.zip'}
data = {'data': 'Hello World!'}
timeout = httpx.Timeout(None, read=180.0)
with httpx.Client(timeout=timeout) as client:
start = time.time()
r = client.post(url, data=data, files=files, headers=headers)
end = time.time()
print(f'Time elapsed: {end - start}s')
print(r.status_code, r.json(), sep=' ')

Upload large video file as chunks and send some parameters along with that using python flask?

I was able to upload large file to server using the below code -
#app.route("/upload", methods=["POST"])
def upload():
with open("/tmp/output_file", "bw") as f:
chunk_size = 4096
while True:
chunk = request.stream.read(chunk_size)
if len(chunk) == 0:
return
f.write(chunk)
But if I use request.form['userId'] or any parameter which is sent as form data in the above code it fails.
As per one of the blog post it says- Flask’s request has a stream, that will have the file data you are uploading. You can read from it treating it as a file-like object. The trick seems to be that you shouldn’t use other request attributes like request.form or request.file because this will materialize the stream into memory/file. Flask by default saves files to disk if they exceed 500Kb, so don’t touch file.
Is there a way where we can send additional parameters like userId along with the file being uploaded in flask?
use headers in requests.
if you want to send user name along with data
headers['username'] = 'name of the user'
r = requests.post(url, data=chunk, headers=headers)

Flask streaming doesn't return back response until finished

So I am trying to stream the chunks of data returned back from the sql database. The chunks seem to be streamed, however when I hit the endpoint, it shows the response at the very end when the request is completed, instead of showing the streamed data chunk by chunk. I know there are already questions about this but adding mimetype doesn't seem to work for me. I have the following code:
Any help is highly appreciated!
def generate_chunks():
result = _get_query_service(repo_url, True).stream_query(qry)
chunk_counter = 0
while True:
chunk = result.fetchmany(5)
chunk_counter += 1
if not chunk:
break
for value in chunk:
yield str(chunk)
return Response(stream_with_context(generate_chunks()), content_type='application/json', status=200)
Actually it was a small thing. The above code works.
But tools like Postman and Insomnia do not support streaming data.
If you want to see your data streamed in action, use CURL or python requests.
For CURL, you need to add --no-buffer option to see the streamed data.
curl --no-buffer -v http://localhost:8082/healthy
For Python requests, you need to add stream=True. Example:
r = requests.post('http://localhost:8082/stream_query', json=dc, stream=True)
r.encoding = 'utf-8'
for line in r.iter_content(chunk_size=10): # prints the streamed data in chunks
print(line)

How to `pause`, and `resume` download work?

Usually, downloading a file from the server is something like this:
fp = open(file, 'wb')
req = urllib2.urlopen(url)
for line in req:
fp.write(line)
fp.close()
During downloading, the download process just has to be finished. If the process is stopped or interrupted, the download process needs to start again. So, I would like to enable my program to pause and resume the download, how do I implement such? Thanks.
The web server must support Range request header to allow pause/resume download:
Range: <unit>=<range-start>-<range-end>
Then the client can make a request with the Range header if he/she wants to retrieve the specified bytes, for example:
Range: bytes=0-1024
In this case the server can respond with a 200 OK indicating that it doesn't support Range requests,
Or it can respond with 206 Partial Content like this:
HTTP/1.1 206 Partial Content
Accept-Ranges: bytes
Content-Length: 1024
Content-Range: bytes 64-512/1024
Response body.... till 512th byte of the file
See:
Range request header
Content-Range response header
Accept-Ranges response header
206 Partial Content
HTTP 1.1 specification
In python, you can do:
import urllib, os
class myURLOpener(urllib.FancyURLopener):
"""Create sub-class in order to overide error 206. This error means a
partial file is being sent,
which is ok in this case. Do nothing with this error.
"""
def http_error_206(self, url, fp, errcode, errmsg, headers, data=None):
pass
loop = 1
dlFile = "2.6Distrib.zip"
existSize = 0
myUrlclass = myURLOpener()
if os.path.exists(dlFile):
outputFile = open(dlFile,"ab")
existSize = os.path.getsize(dlFile)
#If the file exists, then only download the remainder
myUrlclass.addheader("Range","bytes=%s-" % (existSize))
else:
outputFile = open(dlFile,"wb")
webPage = myUrlclass.open("http://localhost/%s" % dlFile)
#If the file exists, but we already have the whole thing, don't download again
if int(webPage.headers['Content-Length']) == existSize:
loop = 0
print "File already downloaded"
numBytes = 0
while loop:
data = webPage.read(8192)
if not data:
break
outputFile.write(data)
numBytes = numBytes + len(data)
webPage.close()
outputFile.close()
for k,v in webPage.headers.items():
print k, "=", v
print "copied", numBytes, "bytes from", webPage.url
You can find the source: http://code.activestate.com/recipes/83208-resuming-download-of-a-file/
It only works for http dls

S3 Backup Memory Usage in Python

I currently use WebFaction for my hosting with the basic package that gives us 80MB of RAM. This is more than adequate for our needs at the moment, apart from our backups. We do our own backups to S3 once a day.
The backup process is this: dump the database, tar.gz all the files into one backup named with the correct date of the backup, upload to S3 using the python library provided by Amazon.
Unfortunately, it appears (although I don't know this for certain) that either my code for reading the file or the S3 code is loading the entire file in to memory. As the file is approximately 320MB (for today's backup) it is using about 320MB just for the backup. This causes WebFaction to quit all our processes meaning the backup doesn't happen and our site goes down.
So this is the question: Is there any way to not load the whole file in to memory, or are there any other python S3 libraries that are much better with RAM usage. Ideally it needs to be about 60MB at the most! If this can't be done, how can I split the file and upload separate parts?
Thanks for your help.
This is the section of code (in my backup script) that caused the processes to be quit:
filedata = open(filename, 'rb').read()
content_type = mimetypes.guess_type(filename)[0]
if not content_type:
content_type = 'text/plain'
print 'Uploading to S3...'
response = connection.put(BUCKET_NAME, 'daily/%s' % filename, S3.S3Object(filedata), {'x-amz-acl': 'public-read', 'Content-Type': content_type})
It's a little late but I had to solve the same problem so here's my answer.
Short answer: in Python 2.6+ yes! This is because the httplib supports file-like objects as of v2.6. So all you need is...
fileobj = open(filename, 'rb')
content_type = mimetypes.guess_type(filename)[0]
if not content_type:
content_type = 'text/plain'
print 'Uploading to S3...'
response = connection.put(BUCKET_NAME, 'daily/%s' % filename, S3.S3Object(fileobj), {'x-amz-acl': 'public-read', 'Content-Type': content_type})
Long answer...
The S3.py library uses python's httplib to do its connection.put() HTTP requests. You can see in the source that it just passes the data argument to the httplib connection.
From S3.py...
def _make_request(self, method, bucket='', key='', query_args={}, headers={}, data='', metadata={}):
...
if (is_secure):
connection = httplib.HTTPSConnection(host)
else:
connection = httplib.HTTPConnection(host)
final_headers = merge_meta(headers, metadata);
# add auth header
self._add_aws_auth_header(final_headers, method, bucket, key, query_args)
connection.request(method, path, data, final_headers) # <-- IMPORTANT PART
resp = connection.getresponse()
if resp.status < 300 or resp.status >= 400:
return resp
# handle redirect
location = resp.getheader('location')
if not location:
return resp
...
If we take a look at the python httplib documentation we can see that...
HTTPConnection.request(method, url[, body[, headers]])
This will send a request to the server using the HTTP request method method and the selector url. If the body argument is present, it should be a string of data to send after the headers are finished. Alternatively, it may be an open file object, in which case the contents of the file is sent; this file object should support fileno() and read() methods. The header Content-Length is automatically set to the correct value. The headers argument should be a mapping of extra HTTP headers to send with the request.
Changed in version 2.6: body can be a file object.
don't read the whole file into your filedata variable. you could use a loop and then just read ~60 MB and submit them to amazon.
backup = open(filename, 'rb')
while True:
part_of_file = backup.read(60000000) # not exactly 60 MB....
response = connection.put() # submit part_of_file here to amazon

Categories