I have an attachment in my email.message.Message.
The attachment is of type email.message.Message so I can call get_payload() on it to return its associated data.
However, I want to be able to load this data into a file-like object so I can read and write from it as if I was reading this attachment from my desktop.
How can I do this without actually saving the attachment on my drive?
cStringIO was made specifically for this purpose.
You can use StringIO if you need multiple encoding schemes,but cStringIO is MUCH faster.
Example usage:
import cStringIO
test = cStringIO.StringIO()
test.write("test")
test.getvalue()
>>> "test"
Related
I'm currently working on a simple proof of concept for a pdf-editor application. The example is supposed to be a simplified python script showcasing how we could use the pdfrw library to edit PDF files with forms in them.
So, here's the issue. I'm not interested in writing the edited PDF to a file.
The idea is that file opening and closing is going to most likely be handled by external code and so I want all the edits in my files to be done in memory. I don't want to write the edited filestream to a local file.
Let me specify what I mean by this. I currently have a piece of code like this:
class FormFiller:
def __fill_pdf__(input_pdf_filestream : bytes, data_dict : dict):
template_pdf : pdfrw.PdfReader = pdfrw.PdfReader(input_pdf_filestream)
# <some editing magic here>
return template_pdf
def fillForm(self,mapper : FieldMapper):
value_mapping : dict = mapper.getValues()
filled_pdf : pdfrw.PdfReader = self.__fill_pdf__(self.filesteam, value_mapping)
#<this point is crucial>
def __init__(self, filestream : bytes):
self.filesteam : bytes = filestream
So, as you see the FormFiller constructor receives an array of bytes. In fact, it's an io.BytesIO object. The template_pdf variable uses a PdfReader object from the pdfrw library. Now, when we get to the #<this point is crucial> marker, I have a filled_pdf variable which is a PdfReader object. I would like to convert it to a filestream (a bytes array, or an io.BytesIO object if you will), and return it in that form. I don't want to write it to a file. However, the writer class provided by pdfrw (pdfrw.PdfWriter) does not allow for such an operation. It only provides a write(<filename>) method, which saves the PdfReader object to a pdf output file.
How should I approach this? Do you recommend a workaround? Or perhaps I should use a completely different library to accomplish this?
Please help :-(
To save your altered PDF to memory in an object that can be passed around (instead of writing to a file), simply create an empty instance of io.BytesIO:
from io import BytesIO
new_bytes_object = BytesIO()
Then, use pdfrw's PdfWriter.write() method to write your data to the empty BytesIO object:
pdfrw.PdfWriter.write(new_bytes_object, filled_pdf)
# I'm not sure about the syntax, I haven't used this lib before
This works because io.BytesIO objects act like a file object, also known as a file-like object. It and related classes like io.StringIO behave like files in memory, such as the object f created with the built-in function open below:
with open("output.txt", "a") as f:
f.write(some_data)
Before you attempt to read from new_bytes_object, don't forget to seek(0) back to the beginning, or rewind it. Otherwise, the object seems empty.
new_bytes_object.seek(0)
Let say I create zipfile object like so:
with ZipFile(StringIO(), mode='w', compression=ZIP_DEFLATED) as zf:
zf.writestr('data.json', 'data_json')
zf.writestr('document.docx', "Some document")
zf.to_bytes() # there is no such method
Can I convert zf in to bytes?
Note: I'm saying to get a bytes of zipfile it self, not the content files of inside zip archive?
I also prefer to do it in memory without dumping to disk.
Need it to test mocked request that I get from requests.get when downloading a zip file.
The data is stored to the StringIO object, which you didn't save a reference to. You should have saved a reference. (Also, unless you're on Python 2, you need a BytesIO, not a StringIO.)
memfile = io.BytesIO()
with ZipFile(memfile, mode='w', compression=ZIP_DEFLATED) as zf:
...
data = memfile.getvalue()
Note that it's important to call getvalue outside the with block (or after the close, if you want to handle closing the ZipFile object manually). Otherwise, your output will be corrupt, missing final records that are written when the ZipFile is closed.
According to S3.Client.upload_file and S3.Client.upload_fileobj, upload_fileobj may sound faster. But does anyone know specifics? Should I just upload the file, or should I open the file in binary mode to use upload_fileobj? In other words,
import boto3
s3 = boto3.resource('s3')
### Version 1
s3.meta.client.upload_file('/tmp/hello.txt', 'mybucket', 'hello.txt')
### Version 2
with open('/tmp/hello.txt', 'rb') as data:
s3.upload_fileobj(data, 'mybucket', 'hello.txt')
Is version 1 or version 2 better? Is there a difference?
The main point with upload_fileobj is that file object doesn't have to be stored on local disk in the first place, but may be represented as file object in RAM.
Python have standard library module for that purpose.
The code will look like
import io
import boto3
s3 = boto3.client('s3')
fo = io.BytesIO(b'my data stored as file object in RAM')
s3.upload_fileobj(fo, 'mybucket', 'hello.txt')
In that case it will perform faster, since you don't have to read from local disk.
TL;DR
in terms of speed, both methods will perform roughly the same, both written in python and the bottleneck will be either disk-io (read file from disk) or network-io (write to s3).
use upload_file() when writing code that only handles uploading files from disk.
use upload_fileobj() when you writing generic code to handle s3 upload that may be reused in future for not only file from disk usecase.
What is fileobj anyway?
there is convention in multiple places including the python standard library, that when one is using the term fileobj she means file-like object.
There are even some libraries exposing functions that can take file path (str) or fileobj (file-like object) as the same parameter.
when using file object your code is not limited to disk only, for example:
for example you can copy data from one s3 object into another in streaming fashion (without using disk space or slowing down the process for doing read/write io to disk).
you can (de)compress or decrypt data on the fly when writing objects to S3
example using python gzip module with file-like object in generic way:
import gzip, io
def gzip_greet_file(fileobj):
"""write gzipped hello message to a file"""
with gzip.open(filename=fileobj, mode='wb') as fp:
fp.write(b'hello!')
# using opened file
gzip_greet_file(open('/tmp/a.gz', 'wb'))
# using filename from disk
gzip_greet_file('/tmp/b.gz')
# using io buffer
file = io.BytesIO()
gzip_greet_file(file)
file.seek(0)
print(file.getvalue())
tarfile on the other hand has two parameters file & fileobj:
tarfile.open(name=None, mode='r', fileobj=None, bufsize=10240, **kwargs)
Example compression on-the-fly with s3.upload_fileobj()
import gzip, boto3
s3 = boto3.resource('s3')
def upload_file(fileobj, bucket, key, compress=False):
if compress:
fileobj = gzip.GzipFile(fileobj=fileobj, mode='rb')
key = key + '.gz'
s3.upload_fileobj(fileobj, bucket, key)
Neither is better, because they're not comparable. While the end result is the same (an object is uploaded to S3), they source that object quite differently. One expects you to supply the path on disk of the file to upload while the other expects you to provide a file-like object.
If you have a file on disk and want to upload it, then use upload_file. If you have a file-like object (which could ultimately be many things including an open file, a stream, a socket, a buffer, a string) then use upload_fileobj.
A 'file-like object' in this context is anything that implements the read method, and returns bytes.
As per the documentation in https://boto3.amazonaws.com/v1/documentation/api/1.9.185/guide/s3-uploading-files.html
"The upload_file and upload_fileobj methods are provided by the S3 Client, Bucket, and Object classes. The method functionality provided by each class is identical. No benefits are gained by calling one class's method over another's. Use whichever class is most convenient."
The answers above seems to be false
I am trying to write some code that will convert a PDF that resides on the web into a series of jpgs.
I got working code that:
1) takes pdf
2) saves it to disk
3) converts it to JPGs, which are saved to disk.
Is there a way to write the same code (attempt at code below, that throws an error), that would take the PDF from internet, but keep it in memory (to keep the program from writing to disk/reading from disk), then convert it to JPGs (which are to be uploaded to AWS s3)?
I was thinking this would work:
f = urlopen("https://s3.us-east-2.amazonaws.com/converted1jpgs/example.pdf") #file to process
But i get the following error:
"Exception TypeError: TypeError("object of type 'NoneType' has no len()",) in > ignored"
Full code, along with proper PDF file that i want converted. Note: the code works if i replace f= with the location of a PDF saved on disk:
from urllib2 import urlopen
from wand.image import Image
#location on disk
save_location = "/home/bob/Desktop/pdfs to convert/example1"
#file prefix
test_id = 'example'
print 1
f = urlopen("https://s3.us-east-2.amazonaws.com/converted1jpgs/example.pdf")
print 2
print type(f)
with Image(filename=f) as img:
print('pages = ', len(img.sequence))
with img.convert('jpg') as converted:
converted.save(filename=save_location+"/"+test_id+".jpg")
The result of urlopen obviously isn't a filename, so you can't pass in filename=f and expect it to work.
I don't have Wand installed, but from the docs, there are clearly a bunch of alternative ways to construct it.
First, urlopen is a file-like object. Of course "file-like object" is a somewhat vague term, and not all file-like objects work for all APIs that expect file-like objects (e.g., the API may expect to be able to call fileno and read from it at the POSIX level…), but this is at least worth trying (note file instead of filename):
with Image(file=f) as img:
If that doesn't work, you can always read the data into memory:
buf = f.read()
with Image(blob=buf) as img:
Not as ideal (if you have giant files), but at least you don't have to store it on disk.
I am trying to upload large CSV files onto GAE using a zip using XML & HTTP POST
Steps:
CSV is zipped & base64 encoded and sent to GAE via XML/HTTP POST
GAE - using minidom to parse XML
GAE - Base64 decode ZIP
GAE - Get CSV from Zip file.
I have tried using zipfile but can't figure out how to create a zipfile object from the base 64decoded string
I get: TypeError: unbound method read() must be called with ZipFile instance as first argument (got str instance instead)
myZipFile = base64.decodestring(base64ZipFile)
objZip = zipfile.ZipFile(myZipFile,'r')
strCSV = zipfile.ZipFile.read(objZip,'list.csv')
As Rob mentioned, ZipFile requires a file-like object. You can use StringIO to provide a file-like interface to a string.
For example:
import StringIO
myZipFile = base64.decodestring(base64ZipFile)
objZip = zipfile.ZipFile(StringIO.StringIO(myZipFile),'r')
Yes you can. In fact, I wrote a blog post that describes how to do exactly that.
A simple approach might be to upload the zipped csv to the blobstore using the blob upload API, and process the zip file from there. You'd need to fake a form post, but life might be simpler for you on the appengine side.
There's an example of how to process zipped data in AppEngine MapReduce. See the BlobstoreZipInputReader class.
ZipFile does not take a string but a file-like object.
One solution is creating a tempfile to write the string to then passing that to ZipFile:
import tempfile
import zipfile
tmp = tempfile.TemporaryFile()
tmp.write(myZipFile) # myZipFile is your decoded string containing the zip-data
objZip = zipfile.ZipFile(tmp)