Writing a Python pdfrw PdfReader object to an array of bytes / filestream

Writing a Python pdfrw PdfReader object to an array of bytes / filestream - python

I'm currently working on a simple proof of concept for a pdf-editor application. The example is supposed to be a simplified python script showcasing how we could use the pdfrw library to edit PDF files with forms in them.
So, here's the issue. I'm not interested in writing the edited PDF to a file.
The idea is that file opening and closing is going to most likely be handled by external code and so I want all the edits in my files to be done in memory. I don't want to write the edited filestream to a local file.
Let me specify what I mean by this. I currently have a piece of code like this:
class FormFiller:
def __fill_pdf__(input_pdf_filestream : bytes, data_dict : dict):
template_pdf : pdfrw.PdfReader = pdfrw.PdfReader(input_pdf_filestream)
# <some editing magic here>
return template_pdf
def fillForm(self,mapper : FieldMapper):
value_mapping : dict = mapper.getValues()
filled_pdf : pdfrw.PdfReader = self.__fill_pdf__(self.filesteam, value_mapping)
#<this point is crucial>
def __init__(self, filestream : bytes):
self.filesteam : bytes = filestream
So, as you see the FormFiller constructor receives an array of bytes. In fact, it's an io.BytesIO object. The template_pdf variable uses a PdfReader object from the pdfrw library. Now, when we get to the #<this point is crucial> marker, I have a filled_pdf variable which is a PdfReader object. I would like to convert it to a filestream (a bytes array, or an io.BytesIO object if you will), and return it in that form. I don't want to write it to a file. However, the writer class provided by pdfrw (pdfrw.PdfWriter) does not allow for such an operation. It only provides a write(<filename>) method, which saves the PdfReader object to a pdf output file.
How should I approach this? Do you recommend a workaround? Or perhaps I should use a completely different library to accomplish this?
Please help :-(

To save your altered PDF to memory in an object that can be passed around (instead of writing to a file), simply create an empty instance of io.BytesIO:
from io import BytesIO
new_bytes_object = BytesIO()
Then, use pdfrw's PdfWriter.write() method to write your data to the empty BytesIO object:
pdfrw.PdfWriter.write(new_bytes_object, filled_pdf)
# I'm not sure about the syntax, I haven't used this lib before
This works because io.BytesIO objects act like a file object, also known as a file-like object. It and related classes like io.StringIO behave like files in memory, such as the object f created with the built-in function open below:
with open("output.txt", "a") as f:
f.write(some_data)
Before you attempt to read from new_bytes_object, don't forget to seek(0) back to the beginning, or rewind it. Otherwise, the object seems empty.
new_bytes_object.seek(0)

Related

xhtml2pdf: Output generated PDF as in-memory object (its bytes)

I'm working with Python 3, Django and the xhtml2pdf package.
I want to create a PDF from an HTML string, but I don't want to write the PDF on disk, but rather just to get its bytes from memory, as in using BytesIO or StringIO.
I've read the xhtml2pdf documentation. This is the closest I've found related to what I need:
In-memory files can be generated by using StringIO or cStringIO instead of the file open. Advanced options will be discussed later in this document.
And this is one of the latest things I've tried:
def html_to_pdf(html):
"""Writes a PDF file using xhtml2pdf from a given HTML stream
Parameters
----------
html : str
A HTML valid string.
Returns
-------
bytes
A bytes sequence containing the rendered PDF.
"""
output = BytesIO()
pisa_status = pisa.CreatePDF(html, dest=output)
return new_output.read()
But this isn't working.
Any idea how to output the generated PDF as a in-memory object and thus return its bytes?

I think your return statement is using new_output instead of output.
However, the real issue should be something else, have you tried calling output.seek(0) before reading its bytes with output.read()?

What you can also do is output.getvalue(). This will get the entire contents of the BytesIO object.

how to write csv to "variable" instead of file?

I'm not sure how to word my question exactly, and I have seen some similar questions asked but not exactly what I'm trying to do. If there already is a solution please direct me to it.
Here is what I'm trying to do:
At my work, we have a few pkgs we've built to handle various data types. One I am working with is reading in a csv file into a std_io object (std_io is our all-purpose object class that reads in any type of data file).
I am trying to connect this to another pkg I am writing, so I can make an object in the new pkg, and covert it to a std_io object.
The problem is, the std_io object is meant to read an actual file, not take in an object. To get around this, I can basically write my data to temp.csv file then read it into a std_io object.
I am wondering if there is a way to eliminate this step of writing the temp.csv file.
Here is my code:
x #my object
df = x.to_df() #object class method to convert to a pandas dataframe
df.to_csv('temp.csv') #write data to a csv file
std_io_obj = std_read('temp.csv') #read csv file into a std_io object
Is there a way to basically pass what the output of writing the csv file would be directly into std_read? Does this make sense?
The only reason I want to do this is to avoid having to code additional functionality into either of the pkgs to directly accept an object as input.
Hope this was clear, and thanks to anyone who contributes.

For those interested, or who may have this same kind of issue/objective, here's what I did to solve this problem.
I basically just created a temporary named file, linked a .csv filename to this temp file, then passed it into my std_read function which requires a csv filename as an input.
This basically tricks the function into thinking it's taking the name of a real file as an input, and it just opens it as usual and uses csvreader to parse it up.
This is the code:
import tempfile
import os
x #my object I want to convert to a std_io object
text = x.to_df().to_csv() #object class method to convert to a pandas dataframe then generate the 'text' of a csv file
filename = 'temp.csv'
with tempfile.NamedTemporaryFile(dir = os.path.dirname('.')) as f:
f.write(text.encode())
os.link(f.name, filename)
stdio_obj = std_read(filename)
os.unlink(filename)
del f
FYI - the std_read function essentially just opens the file the usual way, and passes it into csvreader:
with open(filename, 'r') as f:
rdr = csv.reader(f)

Is it possible to input pdf bytes straight into PyPDF2 instead of making a PDF file first

I am using Linux; printing raw to port 9100 returns a "bytes" type. I was wondering if it is possible to go from this straight into PyPDF2, rather than make a pdf file first and using method PdfFileReader?
Thank you for your time.

PyPDF2.PdfFileReader() defines its first parameter as:
stream – A File object or an object that supports the standard read and seek methods similar to a File object. Could also be a string representing a path to a PDF file.
So you can pass any data to it as long as it can be accessed as a file-like stream. A perfect candidate for that is io.BytesIO(). Write your received raw bytes to it, then seek back to 0, pass the object to PyPDF2.PdfFileReader() and you're done.

Yeah, first comment right. Here is code-example for generate pdf-bytes without creating pdf-file:
import io
from typing import List
from PyPDF2 import PdfFileReader, PdfFileWriter
def join_pdf(pdf_chunks: List[bytes]) -> bytes:
# Create empty pdf-writer object for adding all pages here
result_pdf = PdfFileWriter()
# Iterate for all pdf-bytes
for chunk in pdf_chunks:
# Read bytes
chunk_pdf = PdfFileReader(
stream=io.BytesIO( # Create steam object
initial_bytes=chunk
)
)
# Add all pages to our result
for page in range(chunk_pdf.getNumPages()):
result_pdf.addPage(chunk_pdf.getPage(page))
# Writes all bytes to bytes-stream
response_bytes_stream = io.BytesIO()
result_pdf.write(response_bytes_stream)
return response_bytes_stream.getvalue()

A few years later, I've added this to the PyPDF2 docs:
from io import BytesIO
# Prepare example
with open("example.pdf", "rb") as fh:
bytes_stream = BytesIO(fh.read())
# Read from bytes_stream
reader = PdfFileReader(bytes_stream)
# Write to bytes_stream
writer = PdfFileWriter()
with BytesIO() as bytes_stream:
writer.write(bytes_stream)

cPickle.load( ) error

I am working with cPickle for the purpose to convert the structure data into datastream format and pass it to the library. The thing i have to do is to read file contents from manually written file name "targetstrings.txt" and convert the contents of file into that format which Netcdf library needs in the following manner,
Note: targetstrings.txt contains latin characters
op=open("targetstrings.txt",'rb')
targetStrings=cPickle.load(op)
The Netcdf library take the contents as strings.
While loading a file it stuck with the following error,
cPickle.UnpicklingError: invalid load key, 'A'.
Please tell me how can I rectify this error, I have googled around but did not find an appropriate solution.
Any suggestions,

pickle is not for reading/writing generic text files, but to serialize/deserialize Python objects to file. If you want to read text data you should use Python's usual IO functions.
with open('targetstrings.txt', 'r') as f:
fileContent = f.read()
If, as it seems, the library just wants to have a list of strings, taking each line as a list element, you just have to do:
with open('targetstrings.txt', 'r') as f:
lines=[l for l in f]
# now in lines you have the lines read from the file

As stated - Pickle is not meant to be used in this way.
If you need to manually edit complex Python objects taht are to be read and passed as Python objects to another function, there are plenty of other formats to use - for example XML, JSON, Python files themselves. Pickle uses a Python specific protocol, that while note being binary (in the version 0 of the protocol), and not changing across Python versions, is not meant for this, and is not even the recomended method to record Python objects for persistence or comunication (although it can be used for those purposes).

Using Python, how do I to read/write data in memory like I would with a file?

I'm used to C++, and I build my data handling classes/functions to handle stream objects instead of files. I'd like to know how I might modify the following code, so that it can handle a stream of binary data in memory, rather than a file handle.
def get_count(self):
curr = self.file.tell()
self.file.seek(0, 0)
count, = struct.unpack('I', self.file.read(c_uint32_size))
self.file.seek(curr, 0)
return count
In this case, the code assumes self.file is a file, opened like so:
file = open('somefile.data, 'r+b')
How might I use the same code, yet instead do something like this:
file = get_binary_data()
Where get_binary_data() returns a string of binary data. Although the code doesn't show it, I also need to write to the stream (I didn't think it was worth posting the code for that).
Also, if possible, I'd like the new code to handle files as well.

You can use an instance of StringIO.StringIO (or cStringIO.StringIO, faster) to give a file-like interface to in-memory data.

Take a look at Python's StringIO module, docs here, which could be pretty much what you're after.

Have a look at 'StringIO' (Read and write strings as files)

Use StringIO.

I like the timing of the answer. (except mine)
We can see response time in milliseconds ?
of-course StringIO

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Writing a Python pdfrw PdfReader object to an array of bytes / filestream - python

Related

xhtml2pdf: Output generated PDF as in-memory object (its bytes)

how to write csv to "variable" instead of file?

Is it possible to input pdf bytes straight into PyPDF2 instead of making a PDF file first

cPickle.load( ) error

Using Python, how do I to read/write data in memory like I would with a file?

Categories

Resources