xhtml2pdf: Output generated PDF as in-memory object (its bytes) - python

I'm working with Python 3, Django and the xhtml2pdf package.
I want to create a PDF from an HTML string, but I don't want to write the PDF on disk, but rather just to get its bytes from memory, as in using BytesIO or StringIO.
I've read the xhtml2pdf documentation. This is the closest I've found related to what I need:
In-memory files can be generated by using StringIO or cStringIO instead of the file open. Advanced options will be discussed later in this document.
And this is one of the latest things I've tried:
def html_to_pdf(html):
"""Writes a PDF file using xhtml2pdf from a given HTML stream
Parameters
----------
html : str
A HTML valid string.
Returns
-------
bytes
A bytes sequence containing the rendered PDF.
"""
output = BytesIO()
pisa_status = pisa.CreatePDF(html, dest=output)
return new_output.read()
But this isn't working.
Any idea how to output the generated PDF as a in-memory object and thus return its bytes?

I think your return statement is using new_output instead of output.
However, the real issue should be something else, have you tried calling output.seek(0) before reading its bytes with output.read()?

What you can also do is output.getvalue(). This will get the entire contents of the BytesIO object.

Related

Read PDF tables from memory with Python

I'm trying to read a PDF file extracted from a zip file in memory to get the tables inside the file. Camelot seems a good way to do it, but I'm getting the following error:
AttributeError: '_io.StringIO' object has no attribute 'lower'
Is there some way to read the file and extract the tables with camelot, or should I use another library?
z = zipfile.ZipFile(self.zip_file)
for file in z.namelist():
if file.endswith(".pdf"):
pdf = z.read(file).decode(encoding="latin-1")
pdf = StringIO(pdf)
pdf = camelot.read_pdf(pdf, codec='utf-8')
camelot.read_pdf(filepath,...)
Accepts a file path as the first parameter. It appears to be a bad match for your requirements. Search for another library.
In any case StringIO(pdf), will return the following:
<_io.StringIO object at 0x000002592DD33E20>
For starters, when you read a file from StringIO, do it by calling the read() function
pdf = StringIO(pdf)
pdf.read()
That bit will indeed return the file bytes themselves. Next think about the encoding that the library will accept.

How to convert an XML tree object to bytes stream? Python

I have a function that saves files to a db, but this one requires a bytes stream as parameter. Something like:
write_to_db("File name", stream_obj)
Now, I want to save a XML; I am using the xml library.
import xml.etree.cElementTree as ET
Is there a function that convert the xml object to bytes stream?
The solution I got was:
Save it locally with the function write
Retrieve it with "rb" to get the file as bytes
Now that I have the bytes stream, save it with the function mentioned
Delete the file
Example:
# Saving xml as local file
tree = ET.ElementTree(ET.Element("Example")
tree.write("/This/is/a/path.xml")
# Reading local file as bytes
f = open("/This/is/a/path.xml", "rb")
# Saving to DB
write_to_db("File name", f) # <--- See how I am using "f" cuz I opened it as bytes with rb
# Deleting local file
os.remove("/This/is/a/path.xml")]
But is there a function from the xml library that returns automatically the bytes stream? Something like:
tree = ET.ElementTree(ET.Element("Example")
bytes_file = tree.get_bytes() # <-- Like this?
# Writing to db
write_to_db("File name", bytes_file)
This so I can prevent creating and removing the file in my repository.
Thank you in advance.
Another fast question:
Are the words "bytes stream" correct? or what is the difference? what would be the correct words that I am looking for?
So as Balmy mentioned in the comments, the solution is using:
ET.tostring()
My code at the end looked something like this:
# Here you build your xml
x = ET.Element("ExampleXML",{"a tag": "1", "another tag": "2"})
# Here I am saving it to my db by using the "tostring" function,
# Which as default return the xml as a bytes stream string.
write_to_db("File name", ET.tostring(x))

Writing a Python pdfrw PdfReader object to an array of bytes / filestream

I'm currently working on a simple proof of concept for a pdf-editor application. The example is supposed to be a simplified python script showcasing how we could use the pdfrw library to edit PDF files with forms in them.
So, here's the issue. I'm not interested in writing the edited PDF to a file.
The idea is that file opening and closing is going to most likely be handled by external code and so I want all the edits in my files to be done in memory. I don't want to write the edited filestream to a local file.
Let me specify what I mean by this. I currently have a piece of code like this:
class FormFiller:
def __fill_pdf__(input_pdf_filestream : bytes, data_dict : dict):
template_pdf : pdfrw.PdfReader = pdfrw.PdfReader(input_pdf_filestream)
# <some editing magic here>
return template_pdf
def fillForm(self,mapper : FieldMapper):
value_mapping : dict = mapper.getValues()
filled_pdf : pdfrw.PdfReader = self.__fill_pdf__(self.filesteam, value_mapping)
#<this point is crucial>
def __init__(self, filestream : bytes):
self.filesteam : bytes = filestream
So, as you see the FormFiller constructor receives an array of bytes. In fact, it's an io.BytesIO object. The template_pdf variable uses a PdfReader object from the pdfrw library. Now, when we get to the #<this point is crucial> marker, I have a filled_pdf variable which is a PdfReader object. I would like to convert it to a filestream (a bytes array, or an io.BytesIO object if you will), and return it in that form. I don't want to write it to a file. However, the writer class provided by pdfrw (pdfrw.PdfWriter) does not allow for such an operation. It only provides a write(<filename>) method, which saves the PdfReader object to a pdf output file.
How should I approach this? Do you recommend a workaround? Or perhaps I should use a completely different library to accomplish this?
Please help :-(
To save your altered PDF to memory in an object that can be passed around (instead of writing to a file), simply create an empty instance of io.BytesIO:
from io import BytesIO
new_bytes_object = BytesIO()
Then, use pdfrw's PdfWriter.write() method to write your data to the empty BytesIO object:
pdfrw.PdfWriter.write(new_bytes_object, filled_pdf)
# I'm not sure about the syntax, I haven't used this lib before
This works because io.BytesIO objects act like a file object, also known as a file-like object. It and related classes like io.StringIO behave like files in memory, such as the object f created with the built-in function open below:
with open("output.txt", "a") as f:
f.write(some_data)
Before you attempt to read from new_bytes_object, don't forget to seek(0) back to the beginning, or rewind it. Otherwise, the object seems empty.
new_bytes_object.seek(0)

Python Wand.image PDF to JPG in memory converter

I am trying to write some code that will convert a PDF that resides on the web into a series of jpgs.
I got working code that:
1) takes pdf
2) saves it to disk
3) converts it to JPGs, which are saved to disk.
Is there a way to write the same code (attempt at code below, that throws an error), that would take the PDF from internet, but keep it in memory (to keep the program from writing to disk/reading from disk), then convert it to JPGs (which are to be uploaded to AWS s3)?
I was thinking this would work:
f = urlopen("https://s3.us-east-2.amazonaws.com/converted1jpgs/example.pdf") #file to process
But i get the following error:
"Exception TypeError: TypeError("object of type 'NoneType' has no len()",) in > ignored"
Full code, along with proper PDF file that i want converted. Note: the code works if i replace f= with the location of a PDF saved on disk:
from urllib2 import urlopen
from wand.image import Image
#location on disk
save_location = "/home/bob/Desktop/pdfs to convert/example1"
#file prefix
test_id = 'example'
print 1
f = urlopen("https://s3.us-east-2.amazonaws.com/converted1jpgs/example.pdf")
print 2
print type(f)
with Image(filename=f) as img:
print('pages = ', len(img.sequence))
with img.convert('jpg') as converted:
converted.save(filename=save_location+"/"+test_id+".jpg")
The result of urlopen obviously isn't a filename, so you can't pass in filename=f and expect it to work.
I don't have Wand installed, but from the docs, there are clearly a bunch of alternative ways to construct it.
First, urlopen is a file-like object. Of course "file-like object" is a somewhat vague term, and not all file-like objects work for all APIs that expect file-like objects (e.g., the API may expect to be able to call fileno and read from it at the POSIX level…), but this is at least worth trying (note file instead of filename):
with Image(file=f) as img:
If that doesn't work, you can always read the data into memory:
buf = f.read()
with Image(blob=buf) as img:
Not as ideal (if you have giant files), but at least you don't have to store it on disk.

Is it possible to input pdf bytes straight into PyPDF2 instead of making a PDF file first

I am using Linux; printing raw to port 9100 returns a "bytes" type. I was wondering if it is possible to go from this straight into PyPDF2, rather than make a pdf file first and using method PdfFileReader?
Thank you for your time.
PyPDF2.PdfFileReader() defines its first parameter as:
stream – A File object or an object that supports the standard read and seek methods similar to a File object. Could also be a string representing a path to a PDF file.
So you can pass any data to it as long as it can be accessed as a file-like stream. A perfect candidate for that is io.BytesIO(). Write your received raw bytes to it, then seek back to 0, pass the object to PyPDF2.PdfFileReader() and you're done.
Yeah, first comment right. Here is code-example for generate pdf-bytes without creating pdf-file:
import io
from typing import List
from PyPDF2 import PdfFileReader, PdfFileWriter
def join_pdf(pdf_chunks: List[bytes]) -> bytes:
# Create empty pdf-writer object for adding all pages here
result_pdf = PdfFileWriter()
# Iterate for all pdf-bytes
for chunk in pdf_chunks:
# Read bytes
chunk_pdf = PdfFileReader(
stream=io.BytesIO( # Create steam object
initial_bytes=chunk
)
)
# Add all pages to our result
for page in range(chunk_pdf.getNumPages()):
result_pdf.addPage(chunk_pdf.getPage(page))
# Writes all bytes to bytes-stream
response_bytes_stream = io.BytesIO()
result_pdf.write(response_bytes_stream)
return response_bytes_stream.getvalue()
A few years later, I've added this to the PyPDF2 docs:
from io import BytesIO
# Prepare example
with open("example.pdf", "rb") as fh:
bytes_stream = BytesIO(fh.read())
# Read from bytes_stream
reader = PdfFileReader(bytes_stream)
# Write to bytes_stream
writer = PdfFileWriter()
with BytesIO() as bytes_stream:
writer.write(bytes_stream)

Categories