Python/wand code causes "Killed" when converting large PDFs - python

I have been working on setting up a PDF conversion-to-png and cropping script with Python 3.6.3 and the wand library.
I tried Pillow, but it's lacking the conversion part. I am experimenting with extracting the alpha channel because I want to feed the images to an OCR, at a later point, so I turned to trying the code provided in this SO answer.
A couple of issues came out: the first is that if the file is large, I get a "Killed" message from the terminal. The second is that it seems rather picky with the file, i.e. files that get converted properly by imagemagick's convert or pdftoppm in the command line, raise errors with wand.
I am mostly concerned with the first one though, and would really appreciate a check from more knowledgeable coders. I suspect it might come from the way the loop is structured:
from wand.image import Image
from wand.color import Color
def convert_pdf(filename, path, resolution=300):
all_pages = Image(filename=path+filename, resolution=resolution)
for i, page in enumerate(all_pages.sequence):
with Image(page) as img:
img.format = 'png'
img.background_color = Color('white')
img.alpha_channel = 'remove'
image_filename = '{}.png'.format(i)
img.save(filename=path+image_filename)
I noted that the script outputs all files at the end of the process, rather than one by one, which I am guessing it might put unnecessary burden on memory, and ultimately cause a SEGFAULT or something similar.
Thanks for checking out my question, and for any hints.

Yes, your line:
all_pages = Image(filename=path+filename, resolution=resolution)
Will start a GhostScript process to render the entire PDF to a huge temporary PNM file in /tmp. Wand will then load that massive file into memory and hand out pages from it as you loop.
The C API to MagickCore lets you specify which page to load, so you could perhaps render a page at a time, but I don't know how to get the Python wand interface to do that.
You could try pyvips. It renders PDFs incrementally by making direct calls to libpoppler, so there are no processes being started and stopped and no temporary files.
Example:
#!/usr/bin/python3
import sys
import pyvips
def convert_pdf(filename, resolution=300):
# n is number of pages to load, -1 means load all pages
all_pages = pyvips.Image.new_from_file(filename, dpi=resolution, n=-1, \
access="sequential")
# That'll be RGBA ... flatten out the alpha
all_pages = all_pages.flatten(background=255)
# the PDF is loaded as a very tall, thin image, with the pages joined
# top-to-bottom ... we loop down the image cutting out each page
n_pages = all_pages.get("n-pages")
page_width = all_pages.width
page_height = all_pages.height / n_pages
for i in range(0, n_pages):
page = all_pages.crop(0, i * page_height, page_width, page_height)
print("writing {}.tif ..".format(i))
page.write_to_file("{}.tif".format(i))
convert_pdf(sys.argv[1])
On this 2015 laptop with this huge PDF, I see:
$ /usr/bin/time -f %M:%e ../pages.py ~/pics/Audi_US\ R8_2017-2.pdf
writing 0.tif ..
writing 1.tif ..
....
writing 20.tif ..
720788:35.95
So 35s to render the entire document at 300dpi, and a peak memory use of 720MB.

Related

img2pdf: one page of pdf with one image?

The souce file is here.The fetch code is sify .It's just one jpg. If you can't download it, please contact bbliao#126.com.
However this image doesn't work with fpdf package, I don't know why. You can try it.
Thus I have to use the img2pdf. With the following code I converted this image to pdf successfully.
t=os.listdir()
with open('bb.pdf','wb') as f:
f.write(img2pdf.convert(t))
However, when multiple images are combined into one pdf file, the img2pdf just combine each image by head_to_tail. This causes every pagesize = imgaesize. Briefly, the first page of pdf is 30 cm*40 cm while the second is 20 cm*10 cm the third is 15*13...That's ugly.
I want the same pagesize(A4 for example) and the same imgsize in every page of the pdf. One page of pdf with one image.
Glancing at the documentation for img2pdf, it allows you to set the paper size by including layout details to the convert call:
import img2pdf
letter = (img2pdf.in_to_pt(8.5), img2pdf.in_to_pt(11))
layout = img2pdf.get_layout_fun(letter)
with open('test.pdf', 'wb') as f:
f.write(img2pdf.convert(['image1.jpg','image2.jpg'], layout_fun=layout))

Python converting PDF

I have the following code to create multiple jpgs from a single multi-page PDF. However I get the following error: wand.exceptions.BlobError: unable to open image '{uuid}.jpg': No such file or directory # error/blob.c/OpenBlob/2841 but the image has been created. I initially thought it may be a race condition so I put in a time.sleep() but that didn't work either so I don't believe that's it. Has anyone seen this before?
def split_pdf(pdf_obj, step_functions_client, task_token):
print(time.time())
read_pdf = PyPDF2.PdfFileReader(pdf_obj)
images = []
for page_num in range(read_pdf.numPages):
output = PyPDF2.PdfFileWriter()
output.addPage(read_pdf.getPage(page_num))
generateduuid = str(uuid.uuid4())
filename = generateduuid + ".pdf"
outputfilename = generateduuid + ".jpg"
with open(filename, "wb") as out_pdf:
output.write(out_pdf) # write to local instead
image = {"page": str(page_num + 1)} # Start at 1 rather than 0
create_image_process = subprocess.Popen(["gs", "-o " + outputfilename, "-sDEVICE=jpeg", "-r300", "-dJPEGQ=100", filename], stdout=subprocess.PIPE)
create_image_process.wait()
time.sleep(10)
with(Image(filename=outputfilename)) as img:
image["image_data"] = img.make_blob('jpeg')
image["height"] = img.height
image["width"] = img.width
images.append(image)
if hasattr(step_functions_client, 'send_task_heartbeat'):
step_functions_client.send_task_heartbeat(taskToken=task_token)
return images
It looks like you aren't passing in a value when you try to open the PDF in the first place - hence the error you are receiving.
Make sure you format the string with the full file path as well, e.g. f'/path/to/file/{uuid}.jpg' or '/path/to/file/{}.jpg'.format(uuid)
I don't really understand why your using PyPDF2, GhostScript, and wand. You not parsing/manipulating any PostScript, and Wand sits on top of ImageMagick which sits on top of ghostscript. You might be able to reduce the function down to one PDF utility.
def split_pdf(pdf_obj, step_functions_client, task_token):
images = []
with Image(file=pdf_obj, resolution=300) as document:
for index, page in enumerate(document.sequence):
image = {
"page": index + 1,
"height": page.height,
"width": page.width,
}
with Image(page) as frame:
image["image_data"] = frame.make_blob("JPEG")
images.append(image)
if hasattr(step_functions_client, 'send_task_heartbeat'):
step_functions_client.send_task_heartbeat(taskToken=task_token)
return images
I initially thought it may be a race condition so I put in a time.sleep() but that didn't work either so I don't believe that's it. Has anyone seen this before?
The example code doesn't have any error handling. PDFs can be generated by many software vendors, and a lot of them do a sloppy job. It's more than possible that PyPDF or Ghostscript failed, and you never got a chance to handle this.
For example, when I use Ghostscript for PDFs generated by a random website, I often see the following message on stderr...
ignoring zlib error: incorrect data check
... which results in incomplete documents, or blank pages.
Another common example is that the system resources have been exhausted, and no additional memory can be allocated. This happens all the time with web servers, and the solution is usually to migrate the task over to a queue worker that can cleanly shutdown at the end of each task-completion.

Multi-Page tiff resizing python

First post here, although i already spent days of searching for various queries here. Python 3.6, Pillow and tiff processing.
I would like to automate one of our manual tasks, by resizing some of the images from very big to match A4 format. We're operating on tiff format, that sometimes ( often ) contains more than one page. So I wrote:
from PIL import Image,
...
def image_resize(path, dcinput, file):
dcfake = read_config(configlocation)["resize"]["dcfake"]
try:
imagehandler = Image.open(path+file)
imagehandler = imagehandler.resize((2496, 3495), Image.ANTIALIAS)
imagehandler.save(dcinput+file, optimize=True, quality=95)
except Exception:
But the very (not) obvious is that only first page of tiff is being converted. This is not exactly what I expect from this lib, however tried to dig, and found a way to enumerate each page from tiff, and save it as a separate file.
imagehandler = Image.open(path+file)
for i, page in enumerate(ImageSequence.Iterator(imagehandler)):
page = page.resize((2496, 3495), Image.ANTIALIAS)
page.save(dcinput + "proces%i.tif" %i, optimize=True, quality=95, save_all=True)
Now I could use imagemagick, or some internal commands to convert multiple pages into one, but this is not what I want to do, as it drives to code complication.
My question, is there a unicorn that can help me with either :
1) resizing all pages of given multi-page tiff in the fly
2) build a tiff from few tiffs
I'd like to focus only on python modules.
Thx.
Take a look at this example. It will make every page of a TIF file four times smaller (by halving width and height of every page):
from PIL import Image
from PIL import ImageSequence
from PIL import TiffImagePlugin
INFILE = 'multipage_tif_example.tif'
OUTFILE = 'multipage_tif_resized.tif'
print ('Resizing TIF pages')
pages = []
imagehandler = Image.open(INFILE)
for page in ImageSequence.Iterator(imagehandler):
new_size = (page.size[0]/2, page.size[1]/2)
page = page.resize(new_size, Image.ANTIALIAS)
pages.append(page)
print ('Writing multipage TIF')
with TiffImagePlugin.AppendingTiffWriter(OUTFILE) as tf:
for page in pages:
page.save(tf)
tf.newFrame()
It's supposed to work since late Pillow 3.4.x versions (works with version 5.1.0 on my machine).
Resources:
AppendingTiffWriter discussed here.
Sample TIF files can be downloaded here.

Using wand to convert image to pdf fails after a few successful results

My application works a few times and then errors on every pdf. This is the error I receive:
Exception TypeError: TypeError("object of type 'NoneType' has no len()",) in <bound method Image.__del__ of <wand.image.Image: (empty)>> ignored
And this is the function I use:
def read_pdf(file):
pre, ext = os.path.splitext(file)
filename = pre + '.png'
with Image(filename=file, resolution=200) as pdf:
amount_of_pages = len(pdf.sequence)
image = Image(
width=pdf.width,
height=pdf.height * amount_of_pages
)
for i in range(0, amount_of_pages):
image.composite(
pdf.sequence[i],
top=pdf.height * i,
left=0
)
image.compression_quality = 100
image.save(filename=filename)
logging.info('Opened and saved pdf to image: \'' + file + '\'.')
return filename
This function will correctly convert pdfs to images but after two or three times it will crash every time and throw that exception. If I restart the python script it works again for a few times.
The error is caused by the system running out of resources. Wand calls ImageMagick library; which in turn, passes the decoding work to Ghostscript delegate. Ghostscript is very stable, but does use a lot of resources, and is not happy when run in parallel (my opinion).
Any help?
Try to architect a solution that allows a clean shutdown between PDF conversions. Like a queue worker, or subprocess script. The smallest resources leak can grow out of hand quickly.
Avoid invoking wand.image.Image.sequance. There's been a few known memory leak issues reported. Although many have been fixed, it seems PDF tasks continue to have issues.
From the code posted, it looks like your just creating a tall image with all pages of a given PDF. I would suggest porting MagickAppendImages directly.
import ctypes
from wand.image import Image
from wand.api import library
# Map C-API to python
library.MagickAppendImages.argtypes = (ctypes.c_void_p, ctypes.c_bool)
library.MagickAppendImages.restype = ctypes.c_void_p
with Image(filename='source.pdf') as pdf:
# Reset image stack
library.MagickResetIterator(pdf.wand)
# Append all pages into one new image
new_ptr = library.MagickAppendImages(pdf.wand, True)
library.MagickWriteImage(new_ptr, b'output.png')
library.DestroyMagickWand(new_ptr)
It seems that I created a new image and did not destroy it. This filled up the memory.
I just had to use with new Image(...) as img instead of img = new Image(...).

When using the Python Image Library, does open() immediately decompress the image file?

Short question
When using the Python Image Library, does open() immediately decompress the image file?
Details
I would like to measure the decompression time of compressed images (jpeg, png...), as I read that it's supposed to be a good measure of an image's "complexity" (a blank image will be decompressed quickly, and so will a purely random image, since it will not have been compressed at all, so the most "interesting" images are supposed to have the longest decompression time). So I wrote the following python program:
# complexity.py
from PIL import Image
from cStringIO import StringIO
import time
import sys
def mesure_complexity(image_path, iterations = 10000):
with open(image_path, "rb") as f:
data = f.read()
data_io = StringIO(data)
t1 = time.time()
for i in xrange(iterations):
data_io.seek(0)
Image.open(data_io, "r")
t2 = time.time()
return t2 - t1
def main():
for filepath in sys.argv[1:]:
print filepath, mesure_complexity(filepath)
if __name__ == '__main__':
main()
It can be used like this:
#python complexity.py blank.jpg blackandwhitelogo.jpg trees.jpg random.jpg
blank.jpg 1.66653203964
blackandwhitelogo.jpg 1.33399987221
trees.jpg 1.62251782417
random.jpg 0.967066049576
As you can see, I'm not getting the expected results at all, especially for the blank.jpg file: it should be the one with the lowest "complexity" (quickest decompression time). So either the article I read is utterly wrong (I really doubt it, it was a serious scientific article), or PIL is not doing what I think it's doing. Maybe the actual conversion to a bitmap is done lazily, when it's actually needed? But then why would the open delays differ? The smallest jpg file is of course the blank image, and the largest is the random image. This really does not make sense.
Note 1: when running the program multiple times, I get roughly the same results: the results are absurd, but stable. ;-)
Note 2: all images have the same size (width x height).
Edit
I just tried with png images instead of jpeg, and now everything behaves as expected. Cool! I just sorted about 50 images by complexity, and they do look more and more "complex". I checked the article (BTW, it's an article by Jean-Paul Delahaye in 'Pour la Science', April 2013): the author actually mentions that he used only loss-less compression algorithms. So I guess the answer is that open does decompress the image, but my program did not work because I should have used images compressed with loss-less algorithms only (png, but not jpeg).
Glad you got it sorted out. Anyway, the open() method is indeed a lazy operation – as stated in the documentation, to ensure that the image will be loaded, use image.load(), as this will actually force PIL / Pillow to interpret the image data (which is also stated in the linked documentation).

Categories