S3 Backup Memory Usage in Python

S3 Backup Memory Usage in Python - python

I currently use WebFaction for my hosting with the basic package that gives us 80MB of RAM. This is more than adequate for our needs at the moment, apart from our backups. We do our own backups to S3 once a day.
The backup process is this: dump the database, tar.gz all the files into one backup named with the correct date of the backup, upload to S3 using the python library provided by Amazon.
Unfortunately, it appears (although I don't know this for certain) that either my code for reading the file or the S3 code is loading the entire file in to memory. As the file is approximately 320MB (for today's backup) it is using about 320MB just for the backup. This causes WebFaction to quit all our processes meaning the backup doesn't happen and our site goes down.
So this is the question: Is there any way to not load the whole file in to memory, or are there any other python S3 libraries that are much better with RAM usage. Ideally it needs to be about 60MB at the most! If this can't be done, how can I split the file and upload separate parts?
Thanks for your help.
This is the section of code (in my backup script) that caused the processes to be quit:
filedata = open(filename, 'rb').read()
content_type = mimetypes.guess_type(filename)[0]
if not content_type:
content_type = 'text/plain'
print 'Uploading to S3...'
response = connection.put(BUCKET_NAME, 'daily/%s' % filename, S3.S3Object(filedata), {'x-amz-acl': 'public-read', 'Content-Type': content_type})

It's a little late but I had to solve the same problem so here's my answer.
Short answer: in Python 2.6+ yes! This is because the httplib supports file-like objects as of v2.6. So all you need is...
fileobj = open(filename, 'rb')
content_type = mimetypes.guess_type(filename)[0]
if not content_type:
content_type = 'text/plain'
print 'Uploading to S3...'
response = connection.put(BUCKET_NAME, 'daily/%s' % filename, S3.S3Object(fileobj), {'x-amz-acl': 'public-read', 'Content-Type': content_type})
Long answer...
The S3.py library uses python's httplib to do its connection.put() HTTP requests. You can see in the source that it just passes the data argument to the httplib connection.
From S3.py...
def _make_request(self, method, bucket='', key='', query_args={}, headers={}, data='', metadata={}):
...
if (is_secure):
connection = httplib.HTTPSConnection(host)
else:
connection = httplib.HTTPConnection(host)
final_headers = merge_meta(headers, metadata);
# add auth header
self._add_aws_auth_header(final_headers, method, bucket, key, query_args)
connection.request(method, path, data, final_headers) # <-- IMPORTANT PART
resp = connection.getresponse()
if resp.status < 300 or resp.status >= 400:
return resp
# handle redirect
location = resp.getheader('location')
if not location:
return resp
...
If we take a look at the python httplib documentation we can see that...
HTTPConnection.request(method, url[, body[, headers]])
This will send a request to the server using the HTTP request method method and the selector url. If the body argument is present, it should be a string of data to send after the headers are finished. Alternatively, it may be an open file object, in which case the contents of the file is sent; this file object should support fileno() and read() methods. The header Content-Length is automatically set to the correct value. The headers argument should be a mapping of extra HTTP headers to send with the request.
Changed in version 2.6: body can be a file object.

don't read the whole file into your filedata variable. you could use a loop and then just read ~60 MB and submit them to amazon.
backup = open(filename, 'rb')
while True:
part_of_file = backup.read(60000000) # not exactly 60 MB....
response = connection.put() # submit part_of_file here to amazon

Related

How to return a PDF file from in-memory buffer using FastAPI?

I want to get a PDF file from s3 and then return it to the frontend from FastAPI backend.
This is my code:
#router.post("/pdf_document")
def get_pdf(document : PDFRequest) :
s3 = boto3.client('s3')
file=document.name
f=io.BytesIO()
s3.download_fileobj('adm2yearsdatapdf', file,f)
return StreamingResponse(f, media_type="application/pdf")
This API returns 200 status code, but it does not return the PDF file as a response.

As the entire file data are already loaded into memory, there is no actual reason for using StreamingResponse. You should instead use Response, by passing the file bytes (use BytesIO.getvalue() to get the bytes containing the entire contents of the buffer), defining the media_type, as well as setting the Content-Disposition header, so that the PDF file can be either viewed in the browser or downloaded to the user's device. For more details have a look at this answer, as well as this and this answer. Additionally, as the buffer is discarded when the close()method is called, you could also use FastAPI/Starlette's BackgroundTasks to close the buffer after returning the response, in order to release the memory. Alternatively, you could get the bytes using pdf_bytes = buffer.getvalue(), then close the buffer using buffer.close() and finally, return Response(pdf_bytes, headers=.... Example:
from fastapi import Response, BackgroundTasks
#app.get("/pdf")
def get_pdf(background_tasks: BackgroundTasks):
buffer = io.BytesIO()
# ...
background_tasks.add_task(buffer.close)
headers = {'Content-Disposition': 'inline; filename="out.pdf"'}
return Response(buffer.getvalue(), headers=headers, media_type='application/pdf')
To have the PDF file downloaded rather than viewed in the borwser, use:
headers = {'Content-Disposition': 'attachment; filename="out.pdf"'}

I want to send image from my computer to server [duplicate]

I'm performing a simple task of uploading a file using Python requests library. I searched Stack Overflow and no one seemed to have the same problem, namely, that the file is not received by the server:
import requests
url='http://nesssi.cacr.caltech.edu/cgi-bin/getmulticonedb_release2.cgi/post'
files={'files': open('file.txt','rb')}
values={'upload_file' : 'file.txt' , 'DB':'photcat' , 'OUT':'csv' , 'SHORT':'short'}
r=requests.post(url,files=files,data=values)
I'm filling the value of 'upload_file' keyword with my filename, because if I leave it blank, it says
Error - You must select a file to upload!
And now I get
File file.txt of size bytes is uploaded successfully!
Query service results: There were 0 lines.
Which comes up only if the file is empty. So I'm stuck as to how to send my file successfully. I know that the file works because if I go to this website and manually fill in the form it returns a nice list of matched objects, which is what I'm after. I'd really appreciate all hints.
Some other threads related (but not answering my problem):
Send file using POST from a Python script
http://docs.python-requests.org/en/latest/user/quickstart/#response-content
Uploading files using requests and send extra data
http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow

If upload_file is meant to be the file, use:
files = {'upload_file': open('file.txt','rb')}
values = {'DB': 'photcat', 'OUT': 'csv', 'SHORT': 'short'}
r = requests.post(url, files=files, data=values)
and requests will send a multi-part form POST body with the upload_file field set to the contents of the file.txt file.
The filename will be included in the mime header for the specific field:
>>> import requests
>>> open('file.txt', 'wb') # create an empty demo file
<_io.BufferedWriter name='file.txt'>
>>> files = {'upload_file': open('file.txt', 'rb')}
>>> print(requests.Request('POST', 'http://example.com', files=files).prepare().body.decode('ascii'))
--c226ce13d09842658ffbd31e0563c6bd
Content-Disposition: form-data; name="upload_file"; filename="file.txt"
--c226ce13d09842658ffbd31e0563c6bd--
Note the filename="file.txt" parameter.
You can use a tuple for the files mapping value, with between 2 and 4 elements, if you need more control. The first element is the filename, followed by the contents, and an optional content-type header value and an optional mapping of additional headers:
files = {'upload_file': ('foobar.txt', open('file.txt','rb'), 'text/x-spam')}
This sets an alternative filename and content type, leaving out the optional headers.
If you are meaning the whole POST body to be taken from a file (with no other fields specified), then don't use the files parameter, just post the file directly as data. You then may want to set a Content-Type header too, as none will be set otherwise. See Python requests - POST data from a file.

(2018) the new python requests library has simplified this process, we can use the 'files' variable to signal that we want to upload a multipart-encoded file
url = 'http://httpbin.org/post'
files = {'file': open('report.xls', 'rb')}
r = requests.post(url, files=files)
r.text

Client Upload
If you want to upload a single file with Python requests library, then requests lib supports streaming uploads, which allow you to send large files or streams without reading into memory.
with open('massive-body', 'rb') as f:
requests.post('http://some.url/streamed', data=f)
Server Side
Then store the file on the server.py side such that save the stream into file without loading into the memory. Following is an example with using Flask file uploads.
#app.route("/upload", methods=['POST'])
def upload_file():
from werkzeug.datastructures import FileStorage
FileStorage(request.stream).save(os.path.join(app.config['UPLOAD_FOLDER'], filename))
return 'OK', 200
Or use werkzeug Form Data Parsing as mentioned in a fix for the issue of "large file uploads eating up memory" in order to avoid using memory inefficiently on large files upload (s.t. 22 GiB file in ~60 seconds. Memory usage is constant at about 13 MiB.).
#app.route("/upload", methods=['POST'])
def upload_file():
def custom_stream_factory(total_content_length, filename, content_type, content_length=None):
import tempfile
tmpfile = tempfile.NamedTemporaryFile('wb+', prefix='flaskapp', suffix='.nc')
app.logger.info("start receiving file ... filename => " + str(tmpfile.name))
return tmpfile
import werkzeug, flask
stream, form, files = werkzeug.formparser.parse_form_data(flask.request.environ, stream_factory=custom_stream_factory)
for fil in files.values():
app.logger.info(" ".join(["saved form name", fil.name, "submitted as", fil.filename, "to temporary file", fil.stream.name]))
# Do whatever with stored file at `fil.stream.name`
return 'OK', 200

You can send any file via post api while calling the API just need to mention files={'any_key': fobj}
import requests
import json
url = "https://request-url.com"
headers = {"Content-Type": "application/json; charset=utf-8"}
with open(filepath, 'rb') as fobj:
response = requests.post(url, headers=headers, files={'file': fobj})
print("Status Code", response.status_code)
print("JSON Response ", response.json())

#martijn-pieters answer is correct, however I wanted to add a bit of context to data= and also to the other side, in the Flask server, in the case where you are trying to upload files and a JSON.
From the request side, this works as Martijn describes:
files = {'upload_file': open('file.txt','rb')}
values = {'DB': 'photcat', 'OUT': 'csv', 'SHORT': 'short'}
r = requests.post(url, files=files, data=values)
However, on the Flask side (the receiving webserver on the other side of this POST), I had to use form
#app.route("/sftp-upload", methods=["POST"])
def upload_file():
if request.method == "POST":
# the mimetype here isnt application/json
# see here: https://stackoverflow.com/questions/20001229/how-to-get-posted-json-in-flask
body = request.form
print(body) # <- immutable dict
body = request.get_json() will return nothing. body = request.get_data() will return a blob containing lots of things like the filename etc.
Here's the bad part: on the client side, changing data={} to json={} results in this server not being able to read the KV pairs! As in, this will result in a {} body above:
r = requests.post(url, files=files, json=values). # No!
This is bad because the server does not have control over how the user formats the request; and json= is going to be the habbit of requests users.

Upload:
with open('file.txt', 'rb') as f:
files = {'upload_file': f.read()}
values = {'DB': 'photcat', 'OUT': 'csv', 'SHORT': 'short'}
r = requests.post(url, files=files, data=values)
Download (Django):
with open('file.txt', 'wb') as f:
f.write(request.FILES['upload_file'].file.read())

Regarding the answers given so far, there was always something missing that prevented it to work on my side. So let me show you what worked for me:
import json
import os
import requests
API_ENDPOINT = "http://localhost:80"
access_token = "sdfJHKsdfjJKHKJsdfJKHJKysdfJKHsdfJKHs" # TODO: get fresh Token here
def upload_engagement_file(filepath):
url = API_ENDPOINT + "/api/files" # add any URL parameters if needed
hdr = {"Authorization": "Bearer %s" % access_token}
with open(filepath, "rb") as fobj:
file_obj = fobj.read()
file_basename = os.path.basename(filepath)
file_to_upload = {"file": (str(file_basename), file_obj)}
finfo = {"fullPath": filepath}
upload_response = requests.post(url, headers=hdr, files=file_to_upload, data=finfo)
fobj.close()
# print("Status Code ", upload_response.status_code)
# print("JSON Response ", upload_response.json())
return upload_response
Note that requests.post(...) needs
a url parameter, containing the full URL of the API endpoint you're calling, using the API_ENDPOINT, assuming we have an http://localhost:8000/api/files endpoint to POST a file
a headers parameter, containing at least the authorization (bearer token)
a files parameter taking the name of the file plus the entire file content
a data parameter taking just the path and file name
Installation required (console):
pip install requests
What you get back from the function call is a response object containing a status code and also the full error message in JSON format. The commented print statements at the end of upload_engagement_file are showing you how you can access them.
Note: Some useful additional information about the requests library can be found here

Some may need to upload via a put request and this is slightly different that posting data. It is important to understand how the server expects the data in order to form a valid request. A frequent source of confusion is sending multipart-form data when it isn't accepted. This example uses basic auth and updates an image via a put request.
url = 'foobar.com/api/image-1'
basic = requests.auth.HTTPBasicAuth('someuser', 'password123')
# Setting the appropriate header is important and will vary based
# on what you upload
headers = {'Content-Type': 'image/png'}
with open('image-1.png', 'rb') as img_1:
r = requests.put(url, auth=basic, data=img_1, headers=headers)
While the requests library makes working with http requests a lot easier, some of its magic and convenience obscures just how to craft more nuanced requests.

In Ubuntu you can apply this way,
to save file at some location (temporary) and then open and send it to API
path = default_storage.save('static/tmp/' + f1.name, ContentFile(f1.read()))
path12 = os.path.join(os.getcwd(), "static/tmp/" + f1.name)
data={} #can be anything u want to pass along with File
file1 = open(path12, 'rb')
header = {"Content-Disposition": "attachment; filename=" + f1.name, "Authorization": "JWT " + token}
res= requests.post(url,data,header)

S3 object returns octet-stream but was uploaded as png

I have this existing piece of code that is used to upload files to my s3 bucket.
def get_user_upload_url(customer_id, filename, content_type):
s3_client = boto3.client('s3')
object_name = "userfiles/uploads/{}/{}".format(customer_id, filename)
try:
url = s3_client.generate_presigned_url('put_object',
Params={'Bucket': BUCKET,
'Key': object_name,
"ContentType": content_type # set to "image/png"
},
ExpiresIn=100)
except Exception as e:
print(e)
return None
return url
This returns to my client a presigned URL that I use to upload my files without a issue. I have added a new use of it where I'm uploading a png and I have behave test that uploads to the presigned url just fine. The problem is if i go look at the file in s3 i cant preview it. If I download it, it wont open either. The s3 web client shows it has Content-Type image/png. I visual compared the binary of the original file and the downloaded file and i can see differences. A file type tool detects that its is an octet-stream.
signature_file_name = "signature.png"
with open("features/steps/{}".format(signature_file_name), 'rb') as f:
files = {'file': (signature_file_name, f)}
headers = {
'Content-Type': "image/png" # without this or with a different value the presigned url will error with a signatureDoesNotMatch
}
context.upload_signature_response = requests.put(response, files=files, headers=headers)
I would have expected to have been returned a PNG instead of an octet stream however I'm not sure what I have done wrong . Googling this generally results in people having a problem with the signature because there not properly setting or passing the content type and I feel like I've effectively done that here proven by the fact that if I change the content type everything fails . I'm guessing that there's something wrong with the way I'm uploading the file or maybe reading the file for the upload?

So it is todo with how im uploading. So instead it works if i upload like this.
context.upload_signature_response = requests.put(response, data=open("features/steps/{}".format(signature_file_name), 'rb'), headers=headers)
So this must have to do with the use of put_object. It must be expecting the body to be the file of the defined content type. This method accomplishes that where the prior one would make it a multi part upload. So I think it's safe to say the multipart upload is not compatible with a presigned URL for put_object.
Im still piecing it altogether, so feel free to fill in the blanks.

Locally save and parse suds response?

I am new to SOAP and suds. I am calling a non-XML SOAP API using suds. A given result contains a ton of different sub-arrays. I thought I would just locally save the whole response for parsing later but easier said than done. And I don't get this business with the built in cache option where you cache for x days or whatever. Can I permanently save and parse a response locally?

You can write the response to a local file:
client = Client("http://someWsdl?wsdl")
# Getting response object
method_response = client.service.someMethod(methodParams)
# Open local file
fd = os.open("response_file.txt",os.O_RDWR|os.O_CREAT)
# Convert response object into string
response_str = str(method_response)
# Write response to the file
ret = os.write(fd,response_str)
# Close the file
os.close(fd)

Posting only part of a file with Python's poster.encode

Using the poster.encode module, this works when I post a whole file to Solr:
f = open(filePath, 'rb')
datagen, headers = multipart_encode({'file': f})
# use wt=json because it's more convenient to navigate
request = urllib2.Request(SOLR_BASE_URL + 'update/extract?extractOnly=true&extractFormat=text&indent=true&wt=json', datagen, headers) # assumes solrPath ends in '/'
extracted = urllib2.urlopen(request).read()
However, for some files I'd like to send only the first n bytes of the files. I thought this would work:
f = open(filePath, 'rb')
mp = MultipartParam('file', fileobj=f, filesize=f)
datagen, headers = multipart_encode({'file': mp})
# use wt=json because it's more convenient to navigate
request = urllib2.Request(SOLR_BASE_URL + 'update/extract?extractOnly=true&extractFormat=text&indent=true&wt=json', datagen, headers) # assumes solrPath ends in '/'
extracted = urllib2.urlopen(request).read()
...but I get a timed out request (and the odd thing is that I then have to restart apache before requests to my web2py app work again). I get a 'http 400 content missing' error from urlopen() when I leave off the filesize argument. Am I just using MultipartParam incorrectly?
(The point of all this is that I'm using Solr to extract text content and metadata from files. For video and audio files, I'd like to get away with sending just the first 100-300k or so, as presumably the relevant data's all in the file headers.)

The reason you're having trouble is that mime encoding introduces sentinels in the post, if you don't specify the file size - that means that you have to do chunked transfer encoding so that the web server knows when to stop reading the file. But, that's the other problem - if you stop sending a MIME encoded POST to a server mid-stream, it'll just sit there waiting for the block to finish. Chunked transfer encoding and mixed-multipart mime encoding are both dead serious when it comes down to message segment sizes.
If you only want to send 100-300k of data, then only read that much, then every post you make to the server will terminate at the byte you want and the web server is expecting.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

S3 Backup Memory Usage in Python - python

don't read the whole file into your filedata variable. you could use a loop and then just read ~60 MB and submit them to amazon. backup = open(filename, 'rb') while True: part_of_file = backup.read(60000000) # not exactly 60 MB.... response = connection.put() # submit part_of_file here to amazon

Related

How to return a PDF file from in-memory buffer using FastAPI?

I want to send image from my computer to server [duplicate]

S3 object returns octet-stream but was uploaded as png

Locally save and parse suds response?

Posting only part of a file with Python's poster.encode

Categories

Resources