Adding a file-like object to a Zip file in Python - python

The Python ZipFile API seems to allow the passing of a file path to ZipFile.write or a byte string to ZipFile.writestr but nothing in between. I would like to be able to pass a file like object, in this case a django.core.files.storage.DefaultStorage but any file-like object in principle. At the moment I think I'm going to have to either save the file to disk, or read it into memory. Neither of these is perfect.

You are correct, those are the only two choices. If your DefaultStorage object is large, you may want to go with saving it to disk first; otherwise, I would use:
zipped = ZipFile(...)
zipped.writestr('archive_name', default_storage_object.read())
If default_storage_object is a StringIO object, it can use default_storage_object.getvalue().

While there's no option that takes a file-like object, there is an option to open a zip entry for writing (ZipFile.open). [doc]
import zipfile
import shutil
with zipfile.ZipFile('test.zip','w') as archive:
with archive.open('test_entry.txt','w') as outfile:
with open('test_file.txt','rb') as infile:
shutil.copyfileobj(infile, outfile)
You can use your input stream as the source instead, and not have to copy the file to disk first. The downside is that if something goes wrong with your stream, the zip file will be unusable. In my application, we bypass files with errors, so we end up getting a local copy of the file anyway to ensure integrity and keep a usable zip file.

Related

How do you convert ZipFile object in to binary in python?

Let say I create zipfile object like so:
with ZipFile(StringIO(), mode='w', compression=ZIP_DEFLATED) as zf:
zf.writestr('data.json', 'data_json')
zf.writestr('document.docx', "Some document")
zf.to_bytes() # there is no such method
Can I convert zf in to bytes?
Note: I'm saying to get a bytes of zipfile it self, not the content files of inside zip archive?
I also prefer to do it in memory without dumping to disk.
Need it to test mocked request that I get from requests.get when downloading a zip file.
The data is stored to the StringIO object, which you didn't save a reference to. You should have saved a reference. (Also, unless you're on Python 2, you need a BytesIO, not a StringIO.)
memfile = io.BytesIO()
with ZipFile(memfile, mode='w', compression=ZIP_DEFLATED) as zf:
...
data = memfile.getvalue()
Note that it's important to call getvalue outside the with block (or after the close, if you want to handle closing the ZipFile object manually). Otherwise, your output will be corrupt, missing final records that are written when the ZipFile is closed.

Copy file handle so that there are two independent handles to the same file

I have a Python program which does the following:
It takes a list of files as input
It iterates through the list several times, each time opening the files and then closing them
What I would like is some way to open each file at the beginning, and then when iterating through the files make a copy of each file handle. Essentially this would take the form of a copy operation on file handles that allows a file to be traversed independently by multiple handles. The reason for wanting to do this is because on Unix systems, if a program obtains a file handle and the corresponding file is then deleted, the program is still able to read the file. If I try reopening the files by name on each iteration, the files might have been renamed or deleted so it wouldn't work. If I try using f.seek(0), then that might affect another thread/generator/iterator.
I hope my question makes sense, and I would like to know if there is a way to do this.
If you really want to get a copy of a file handle, you would need to use POSIX dup system call. In python, that would be accessed by using os.dup - see docs. If you have a file object (e.g. from calling open()), then you need to call fileno() method to get file descriptor.
So the entire code will look like this:
with open("myfile") as f:
fd = f.fileno() # get descriptor
fd2 = os.dup(fd) # duplicate descriptor
f2 = os.fdopen(fd2) # get corresponding file object

Writing a BytesIO object to a file, 'efficiently'

So a quick way to write a BytesIO object to a file would be to just use:
with open('myfile.ext', 'wb') as f:
f.write(myBytesIOObj.getvalue())
myBytesIOObj.close()
However, if I wanted to iterate over the myBytesIOObj as opposed to writing it in one chunk, how would I go about it? I'm on Python 2.7.1. Also, if the BytesIO is huge, would it be a more efficient way of writing by iteration?
Thanks
shutil has a utility that will write the file efficiently. It copies in chunks, defaulting to 16K. Any multiple of 4K chunks should be a good cross platform number. I chose 131072 rather arbitrarily because really the file is written to the OS cache in RAM before going to disk and the chunk size isn't that big of a deal.
import shutil
myBytesIOObj.seek(0)
with open('myfile.ext', 'wb') as f:
shutil.copyfileobj(myBytesIOObj, f, length=131072)
BTW, there was no need to close the file object at the end. with defines a scope, and the file object is defined inside that scope. The file handle is therefore closed automatically on exit from the with block.
Since Python 3.2 it's possible to use the BytesIO.getbuffer() method as follows:
from io import BytesIO
buf = BytesIO(b'test')
with open('path/to/file', 'wb') as f:
f.write(buf.getbuffer())
This way it doesn't copy the buffer's content, streaming it straight to the open file.
Note: The StringIO buffer doesn't support the getbuffer() protocol (as of Python 3.9).
Before streaming the BytesIO buffer to file, you might want to set its position to the beginning:
buf.seek(0)

Why does pyPdf2.PdfFileReader() require a file object as an input?

csv.reader() doesn't require a file object, nor does open(). Does pyPdf2.PdfFileReader() require a file object because of the complexity of the PDF format, or is there some other reason?
It's just a matter of how the library was written. csv.reader allows any iterable that returns strings (which includes files). open is opening the file, so of course it doesn't take an open file (although it can take an integer pointing at an open file descriptor). Typically, it is better to handle the file separately, usually within a with block so that it is closed properly.
with open('input.pdf', 'rb') as f:
# do something with the file
pypdf can take a BytesIO stream or a file path as well. I actually recommend passing the file path in most cases as pypdf will then take care of closing the file for you.

Access in-memory unzipped file with codecs.open()

I'm trying to open in-memory unzipped files with codecs.open(). I've figured out how to unzip a file in memory, but I don't know how to create a file object and open it with codecs. I've experimented with different ZipFile properties, but no luck.
So, here how I opened the zip in memory:
import zipfile, io
f = 'somezipfile.zip'
memory_object = io.BytesIO(f.read())
zip_in_memory = zipfile.ZipFile(memory_object)
You don't need codecs.open() to access data in memory -- it is meant for loading files from disk. You can extract the file contents from your zipfile obbject using its extract() method and decode the resulting string using decode(). If you insist on using the codecs module, you can also get a file-like object by zip_in_memory.open(...) and wrapping the returned object with codecs.EncodedFile.

Categories