Writing then reading in-memory bytes (BytesIO) gives a blank result - python

I wanted to try out the python BytesIO class.
As an experiment I tried writing to a zip file in memory, and then reading the bytes back out of that zip file. So instead of passing in a file-object to gzip, I pass in a BytesIO object. Here is the entire script:
from io import BytesIO
import gzip
# write bytes to zip file in memory
myio = BytesIO()
with gzip.GzipFile(fileobj=myio, mode='wb') as g:
g.write(b"does it work")
# read bytes from zip file in memory
with gzip.GzipFile(fileobj=myio, mode='rb') as g:
result = g.read()
print(result)
But it is returning an empty bytes object for result. This happens in both Python 2.7 and 3.4. What am I missing?

You need to seek back to the beginning of the file after writing the initial in memory file...
myio.seek(0)

How about we write and read gzip content in the same context like this?
#!/usr/bin/env python
from io import BytesIO
import gzip
content = b"does it work"
# write bytes to zip file in memory
gzipped_content = None
with BytesIO() as myio:
with gzip.GzipFile(fileobj=myio, mode='wb') as g:
g.write(content)
gzipped_content = myio.getvalue()
print(gzipped_content)
print(content == gzip.decompress(gzipped_content))

myio.getvalue() is an alternative to seek that returns the bytes containing the entire contents of the buffer (docs).
It worked for me after facing a similar issue.

Related

How do you convert ZipFile object in to binary in python?

Let say I create zipfile object like so:
with ZipFile(StringIO(), mode='w', compression=ZIP_DEFLATED) as zf:
zf.writestr('data.json', 'data_json')
zf.writestr('document.docx', "Some document")
zf.to_bytes() # there is no such method
Can I convert zf in to bytes?
Note: I'm saying to get a bytes of zipfile it self, not the content files of inside zip archive?
I also prefer to do it in memory without dumping to disk.
Need it to test mocked request that I get from requests.get when downloading a zip file.
The data is stored to the StringIO object, which you didn't save a reference to. You should have saved a reference. (Also, unless you're on Python 2, you need a BytesIO, not a StringIO.)
memfile = io.BytesIO()
with ZipFile(memfile, mode='w', compression=ZIP_DEFLATED) as zf:
...
data = memfile.getvalue()
Note that it's important to call getvalue outside the with block (or after the close, if you want to handle closing the ZipFile object manually). Otherwise, your output will be corrupt, missing final records that are written when the ZipFile is closed.

Convert/Write PDF to RAM as file-like object for further working with it

My script generates PDF (PyPDF2.pdf.PdfFileWriter object) and stores it in the variable.
I need to work with it as file-like object further in script. But now I have to write it to HDD first. Then I have to open it as file to work with it.
To prevent this unnecessary writing/reading operations I found many solutions - StringIO, BytesIO and so on. But I cannot find what exactly can help me in my case.
As far as I understand - I need to "convert" (or write to RAM) PyPDF2.pdf.PdfFileWriter object to file-like object to work directly with it.
Or there is another method that fits exactly to my case?
UPDATE - here is code-sample
from pdfrw import PdfReader, PdfWriter, PageMerge
from PyPDF2 import PdfFileReader, PdfFileWriter
red_file = PdfFileReader(open("file_name.pdf", 'rb'))
large_pages_indexes = [1, 7, 9]
large = PdfFileWriter()
for i in large_pages_indexes:
p = red_file.getPage(i)
large.addPage(p)
# here final data have to be written (I would like to avoid that)
with open("virtual_file.pdf", 'wb') as tmp:
large.write(tmp)
# here I need to read exported "virtual_file.pdf" (I would like to avoid that too)
with open("virtual_file.pdf", 'rb') as tmp:
pdf = PdfReader(tmp) # here I'm starting to work with this file using another module "pdfrw"
print(pdf)
To avoid slow disk I/O it appears you want to replace
with open("virtual_file.pdf", 'wb') as tmp:
large.write(tmp)
with open("virtual_file.pdf", 'rb') as tmp:
pdf = PdfReader(tmp)
with
buf = io.BytesIO()
large.write(buf)
buf.seek(0)
pdf = PdfReader(buf)
Also, buf.getvalue() is available to you.

Is it possible to input pdf bytes straight into PyPDF2 instead of making a PDF file first

I am using Linux; printing raw to port 9100 returns a "bytes" type. I was wondering if it is possible to go from this straight into PyPDF2, rather than make a pdf file first and using method PdfFileReader?
Thank you for your time.
PyPDF2.PdfFileReader() defines its first parameter as:
stream – A File object or an object that supports the standard read and seek methods similar to a File object. Could also be a string representing a path to a PDF file.
So you can pass any data to it as long as it can be accessed as a file-like stream. A perfect candidate for that is io.BytesIO(). Write your received raw bytes to it, then seek back to 0, pass the object to PyPDF2.PdfFileReader() and you're done.
Yeah, first comment right. Here is code-example for generate pdf-bytes without creating pdf-file:
import io
from typing import List
from PyPDF2 import PdfFileReader, PdfFileWriter
def join_pdf(pdf_chunks: List[bytes]) -> bytes:
# Create empty pdf-writer object for adding all pages here
result_pdf = PdfFileWriter()
# Iterate for all pdf-bytes
for chunk in pdf_chunks:
# Read bytes
chunk_pdf = PdfFileReader(
stream=io.BytesIO( # Create steam object
initial_bytes=chunk
)
)
# Add all pages to our result
for page in range(chunk_pdf.getNumPages()):
result_pdf.addPage(chunk_pdf.getPage(page))
# Writes all bytes to bytes-stream
response_bytes_stream = io.BytesIO()
result_pdf.write(response_bytes_stream)
return response_bytes_stream.getvalue()
A few years later, I've added this to the PyPDF2 docs:
from io import BytesIO
# Prepare example
with open("example.pdf", "rb") as fh:
bytes_stream = BytesIO(fh.read())
# Read from bytes_stream
reader = PdfFileReader(bytes_stream)
# Write to bytes_stream
writer = PdfFileWriter()
with BytesIO() as bytes_stream:
writer.write(bytes_stream)

Writing a BytesIO object to a file, 'efficiently'

So a quick way to write a BytesIO object to a file would be to just use:
with open('myfile.ext', 'wb') as f:
f.write(myBytesIOObj.getvalue())
myBytesIOObj.close()
However, if I wanted to iterate over the myBytesIOObj as opposed to writing it in one chunk, how would I go about it? I'm on Python 2.7.1. Also, if the BytesIO is huge, would it be a more efficient way of writing by iteration?
Thanks
shutil has a utility that will write the file efficiently. It copies in chunks, defaulting to 16K. Any multiple of 4K chunks should be a good cross platform number. I chose 131072 rather arbitrarily because really the file is written to the OS cache in RAM before going to disk and the chunk size isn't that big of a deal.
import shutil
myBytesIOObj.seek(0)
with open('myfile.ext', 'wb') as f:
shutil.copyfileobj(myBytesIOObj, f, length=131072)
BTW, there was no need to close the file object at the end. with defines a scope, and the file object is defined inside that scope. The file handle is therefore closed automatically on exit from the with block.
Since Python 3.2 it's possible to use the BytesIO.getbuffer() method as follows:
from io import BytesIO
buf = BytesIO(b'test')
with open('path/to/file', 'wb') as f:
f.write(buf.getbuffer())
This way it doesn't copy the buffer's content, streaming it straight to the open file.
Note: The StringIO buffer doesn't support the getbuffer() protocol (as of Python 3.9).
Before streaming the BytesIO buffer to file, you might want to set its position to the beginning:
buf.seek(0)

Make in-memory copy of a zip by iterating over each file of the input

Python, as far as know, does not allow modification of an archived file. That is why I want to:
Unpack the zip in memory (zip_in).
Go over each file in the zip_in, and change it if needed, then copy it to zip_out. For now I'm happy with just making a copy of a file.
Save zip_out.
I was experimenting with zipfile and io but no luck. Partially because I'm not sure how all that works and which object requires which output.
Working Code
import os
import io
import codecs
import zipfile
# Make in-memory copy of a zip file
# by iterating over each file in zip_in
# archive.
#
# Check if a file is text, and in that case
# open it with codecs.
zip_in = zipfile.ZipFile(f, mode='a')
zip_out = zipfile.ZipFile(fn, mode='w')
for i in zip_in.filelist:
if os.path.splitext(i.filename)[1] in ('.xml', '.txt'):
c = zip_in.open(i.filename)
c = codecs.EncodedFile(c, 'utf-8', 'utf-8').read()
c = c.decode('utf-8')
else:
c = zip_in.read(i.filename)
zip_out.writestr(i.filename, c)
zip_out.close()
Old Example, With a Problem
# Make in-memory copy of a zip file
# by iterating over each file in zip_in
# archive.
#
# This code below does not work properly.
zip_in = zipfile.ZipFile(f, mode='a')
zip_out = zipfile.ZipFile(fn, mode='w')
for i in zip_in.filelist:
bc = io.StringIO() # what about binary files?
zip_in.extract(i.filename, bc)
zip_out.writestr(i.filename, bc.read())
zip_out.close()
The error is TypeError: '_io.StringIO' object is not subscriptable
ZipFile.extract() expects a filename, not a file-like object to write to. Instead, use ZipFile.read(name) to get the contents of the file. It returns the byte string so will work fine with binary files. Text files may require decoding to unicode.

Categories