How do you convert ZipFile object in to binary in python? - python

Let say I create zipfile object like so:
with ZipFile(StringIO(), mode='w', compression=ZIP_DEFLATED) as zf:
zf.writestr('data.json', 'data_json')
zf.writestr('document.docx', "Some document")
zf.to_bytes() # there is no such method
Can I convert zf in to bytes?
Note: I'm saying to get a bytes of zipfile it self, not the content files of inside zip archive?
I also prefer to do it in memory without dumping to disk.
Need it to test mocked request that I get from requests.get when downloading a zip file.

The data is stored to the StringIO object, which you didn't save a reference to. You should have saved a reference. (Also, unless you're on Python 2, you need a BytesIO, not a StringIO.)
memfile = io.BytesIO()
with ZipFile(memfile, mode='w', compression=ZIP_DEFLATED) as zf:
...
data = memfile.getvalue()
Note that it's important to call getvalue outside the with block (or after the close, if you want to handle closing the ZipFile object manually). Otherwise, your output will be corrupt, missing final records that are written when the ZipFile is closed.

Related

Writing a Python pdfrw PdfReader object to an array of bytes / filestream

I'm currently working on a simple proof of concept for a pdf-editor application. The example is supposed to be a simplified python script showcasing how we could use the pdfrw library to edit PDF files with forms in them.
So, here's the issue. I'm not interested in writing the edited PDF to a file.
The idea is that file opening and closing is going to most likely be handled by external code and so I want all the edits in my files to be done in memory. I don't want to write the edited filestream to a local file.
Let me specify what I mean by this. I currently have a piece of code like this:
class FormFiller:
def __fill_pdf__(input_pdf_filestream : bytes, data_dict : dict):
template_pdf : pdfrw.PdfReader = pdfrw.PdfReader(input_pdf_filestream)
# <some editing magic here>
return template_pdf
def fillForm(self,mapper : FieldMapper):
value_mapping : dict = mapper.getValues()
filled_pdf : pdfrw.PdfReader = self.__fill_pdf__(self.filesteam, value_mapping)
#<this point is crucial>
def __init__(self, filestream : bytes):
self.filesteam : bytes = filestream
So, as you see the FormFiller constructor receives an array of bytes. In fact, it's an io.BytesIO object. The template_pdf variable uses a PdfReader object from the pdfrw library. Now, when we get to the #<this point is crucial> marker, I have a filled_pdf variable which is a PdfReader object. I would like to convert it to a filestream (a bytes array, or an io.BytesIO object if you will), and return it in that form. I don't want to write it to a file. However, the writer class provided by pdfrw (pdfrw.PdfWriter) does not allow for such an operation. It only provides a write(<filename>) method, which saves the PdfReader object to a pdf output file.
How should I approach this? Do you recommend a workaround? Or perhaps I should use a completely different library to accomplish this?
Please help :-(
To save your altered PDF to memory in an object that can be passed around (instead of writing to a file), simply create an empty instance of io.BytesIO:
from io import BytesIO
new_bytes_object = BytesIO()
Then, use pdfrw's PdfWriter.write() method to write your data to the empty BytesIO object:
pdfrw.PdfWriter.write(new_bytes_object, filled_pdf)
# I'm not sure about the syntax, I haven't used this lib before
This works because io.BytesIO objects act like a file object, also known as a file-like object. It and related classes like io.StringIO behave like files in memory, such as the object f created with the built-in function open below:
with open("output.txt", "a") as f:
f.write(some_data)
Before you attempt to read from new_bytes_object, don't forget to seek(0) back to the beginning, or rewind it. Otherwise, the object seems empty.
new_bytes_object.seek(0)

Writing then reading in-memory bytes (BytesIO) gives a blank result

I wanted to try out the python BytesIO class.
As an experiment I tried writing to a zip file in memory, and then reading the bytes back out of that zip file. So instead of passing in a file-object to gzip, I pass in a BytesIO object. Here is the entire script:
from io import BytesIO
import gzip
# write bytes to zip file in memory
myio = BytesIO()
with gzip.GzipFile(fileobj=myio, mode='wb') as g:
g.write(b"does it work")
# read bytes from zip file in memory
with gzip.GzipFile(fileobj=myio, mode='rb') as g:
result = g.read()
print(result)
But it is returning an empty bytes object for result. This happens in both Python 2.7 and 3.4. What am I missing?
You need to seek back to the beginning of the file after writing the initial in memory file...
myio.seek(0)
How about we write and read gzip content in the same context like this?
#!/usr/bin/env python
from io import BytesIO
import gzip
content = b"does it work"
# write bytes to zip file in memory
gzipped_content = None
with BytesIO() as myio:
with gzip.GzipFile(fileobj=myio, mode='wb') as g:
g.write(content)
gzipped_content = myio.getvalue()
print(gzipped_content)
print(content == gzip.decompress(gzipped_content))
myio.getvalue() is an alternative to seek that returns the bytes containing the entire contents of the buffer (docs).
It worked for me after facing a similar issue.

Adding a file-like object to a Zip file in Python

The Python ZipFile API seems to allow the passing of a file path to ZipFile.write or a byte string to ZipFile.writestr but nothing in between. I would like to be able to pass a file like object, in this case a django.core.files.storage.DefaultStorage but any file-like object in principle. At the moment I think I'm going to have to either save the file to disk, or read it into memory. Neither of these is perfect.
You are correct, those are the only two choices. If your DefaultStorage object is large, you may want to go with saving it to disk first; otherwise, I would use:
zipped = ZipFile(...)
zipped.writestr('archive_name', default_storage_object.read())
If default_storage_object is a StringIO object, it can use default_storage_object.getvalue().
While there's no option that takes a file-like object, there is an option to open a zip entry for writing (ZipFile.open). [doc]
import zipfile
import shutil
with zipfile.ZipFile('test.zip','w') as archive:
with archive.open('test_entry.txt','w') as outfile:
with open('test_file.txt','rb') as infile:
shutil.copyfileobj(infile, outfile)
You can use your input stream as the source instead, and not have to copy the file to disk first. The downside is that if something goes wrong with your stream, the zip file will be unusable. In my application, we bypass files with errors, so we end up getting a local copy of the file anyway to ensure integrity and keep a usable zip file.

Make in-memory copy of a zip by iterating over each file of the input

Python, as far as know, does not allow modification of an archived file. That is why I want to:
Unpack the zip in memory (zip_in).
Go over each file in the zip_in, and change it if needed, then copy it to zip_out. For now I'm happy with just making a copy of a file.
Save zip_out.
I was experimenting with zipfile and io but no luck. Partially because I'm not sure how all that works and which object requires which output.
Working Code
import os
import io
import codecs
import zipfile
# Make in-memory copy of a zip file
# by iterating over each file in zip_in
# archive.
#
# Check if a file is text, and in that case
# open it with codecs.
zip_in = zipfile.ZipFile(f, mode='a')
zip_out = zipfile.ZipFile(fn, mode='w')
for i in zip_in.filelist:
if os.path.splitext(i.filename)[1] in ('.xml', '.txt'):
c = zip_in.open(i.filename)
c = codecs.EncodedFile(c, 'utf-8', 'utf-8').read()
c = c.decode('utf-8')
else:
c = zip_in.read(i.filename)
zip_out.writestr(i.filename, c)
zip_out.close()
Old Example, With a Problem
# Make in-memory copy of a zip file
# by iterating over each file in zip_in
# archive.
#
# This code below does not work properly.
zip_in = zipfile.ZipFile(f, mode='a')
zip_out = zipfile.ZipFile(fn, mode='w')
for i in zip_in.filelist:
bc = io.StringIO() # what about binary files?
zip_in.extract(i.filename, bc)
zip_out.writestr(i.filename, bc.read())
zip_out.close()
The error is TypeError: '_io.StringIO' object is not subscriptable
ZipFile.extract() expects a filename, not a file-like object to write to. Instead, use ZipFile.read(name) to get the contents of the file. It returns the byte string so will work fine with binary files. Text files may require decoding to unicode.

Access in-memory unzipped file with codecs.open()

I'm trying to open in-memory unzipped files with codecs.open(). I've figured out how to unzip a file in memory, but I don't know how to create a file object and open it with codecs. I've experimented with different ZipFile properties, but no luck.
So, here how I opened the zip in memory:
import zipfile, io
f = 'somezipfile.zip'
memory_object = io.BytesIO(f.read())
zip_in_memory = zipfile.ZipFile(memory_object)
You don't need codecs.open() to access data in memory -- it is meant for loading files from disk. You can extract the file contents from your zipfile obbject using its extract() method and decode the resulting string using decode(). If you insist on using the codecs module, you can also get a file-like object by zip_in_memory.open(...) and wrapping the returned object with codecs.EncodedFile.

Categories