Python converting PDF - python

I have the following code to create multiple jpgs from a single multi-page PDF. However I get the following error: wand.exceptions.BlobError: unable to open image '{uuid}.jpg': No such file or directory # error/blob.c/OpenBlob/2841 but the image has been created. I initially thought it may be a race condition so I put in a time.sleep() but that didn't work either so I don't believe that's it. Has anyone seen this before?
def split_pdf(pdf_obj, step_functions_client, task_token):
print(time.time())
read_pdf = PyPDF2.PdfFileReader(pdf_obj)
images = []
for page_num in range(read_pdf.numPages):
output = PyPDF2.PdfFileWriter()
output.addPage(read_pdf.getPage(page_num))
generateduuid = str(uuid.uuid4())
filename = generateduuid + ".pdf"
outputfilename = generateduuid + ".jpg"
with open(filename, "wb") as out_pdf:
output.write(out_pdf) # write to local instead
image = {"page": str(page_num + 1)} # Start at 1 rather than 0
create_image_process = subprocess.Popen(["gs", "-o " + outputfilename, "-sDEVICE=jpeg", "-r300", "-dJPEGQ=100", filename], stdout=subprocess.PIPE)
create_image_process.wait()
time.sleep(10)
with(Image(filename=outputfilename)) as img:
image["image_data"] = img.make_blob('jpeg')
image["height"] = img.height
image["width"] = img.width
images.append(image)
if hasattr(step_functions_client, 'send_task_heartbeat'):
step_functions_client.send_task_heartbeat(taskToken=task_token)
return images

It looks like you aren't passing in a value when you try to open the PDF in the first place - hence the error you are receiving.
Make sure you format the string with the full file path as well, e.g. f'/path/to/file/{uuid}.jpg' or '/path/to/file/{}.jpg'.format(uuid)

I don't really understand why your using PyPDF2, GhostScript, and wand. You not parsing/manipulating any PostScript, and Wand sits on top of ImageMagick which sits on top of ghostscript. You might be able to reduce the function down to one PDF utility.
def split_pdf(pdf_obj, step_functions_client, task_token):
images = []
with Image(file=pdf_obj, resolution=300) as document:
for index, page in enumerate(document.sequence):
image = {
"page": index + 1,
"height": page.height,
"width": page.width,
}
with Image(page) as frame:
image["image_data"] = frame.make_blob("JPEG")
images.append(image)
if hasattr(step_functions_client, 'send_task_heartbeat'):
step_functions_client.send_task_heartbeat(taskToken=task_token)
return images
I initially thought it may be a race condition so I put in a time.sleep() but that didn't work either so I don't believe that's it. Has anyone seen this before?
The example code doesn't have any error handling. PDFs can be generated by many software vendors, and a lot of them do a sloppy job. It's more than possible that PyPDF or Ghostscript failed, and you never got a chance to handle this.
For example, when I use Ghostscript for PDFs generated by a random website, I often see the following message on stderr...
ignoring zlib error: incorrect data check
... which results in incomplete documents, or blank pages.
Another common example is that the system resources have been exhausted, and no additional memory can be allocated. This happens all the time with web servers, and the solution is usually to migrate the task over to a queue worker that can cleanly shutdown at the end of each task-completion.

Related

How do I stop creating a broken png when converting from base64 using Python

I've tried this a number of ways and have searched high and low, but no matter what I try (including all posts I could find here on the subject) I can't manage to convert my base64 string of an HTML document / canvas containing JavaScript.
I'm not getting the incorrect padding error which is quite common (I have ensured 'data:text/html;base64,' is not included at the start of the base64 string.)
I have also checked the base64 string both by checking and running the original .html file, which renders in browser with no issue, and decoding the string with an online decoder.
I know I must be missing something very simple here, but after several hours I'm ready to pull my hair out.
My encoding step is as follows:
htmlSource = bytes(htmlSource,'UTF-8')
fullBase64 = base64.b64encode(htmlSource)
The resultant base64 string is included in my attempts below, which should generate a turquoise oval with shadow on a dirty white background in 4k.
The following attempts all create a png file, only 1kb in size, which cannot be opened - 'It may be damaged or use a file format that Preview doesn’t recognise.':
import base64
img_data = b'PCFET0NUWVBFIGh0bWw+CjxodG1sPgogIDxoZWFkPgogICAgPG1ldGEgbmFtZT0ndmlld3BvcnQnIGNvbnRlbnQ9J3dpZHRoPWRldmljZS13aWR0aCwgaW5pdGlhbC1zY2FsZT0xLjAnPgogIDwvaGVhZD4KPGJvZHk+CjxzdHlsZT4KICAgIGJvZHksIGh0bWwgewogICAgICBwYWRkaW5nOiAwICFpbXBvcnRhbnQ7CiAgICAgIG1hcmdpbjogMCAhaW1wb3J0YW50OwogICAgICBtYXJnaW46IDA7CiAgICB9CiAgICAqIHsKICAgICAgcGFkZGluZzogMDsKICAgICAgbWFyZ2luOiAwOwogICAgfQo8L3N0eWxlPgoKPGNhbnZhcyBpZD0nbXlDYW52YXMnIHN0eWxlPSdvYmplY3QtZml0OiBjb250YWluOyB3aWR0aDogOTl2dzsgaGVpZ2h0OiA5OXZoOyc+CllvdXIgYnJvd3NlciBkb2VzIG5vdCBzdXBwb3J0IHRoZSBIVE1MNSBjYW52YXMgdGFnLjwvY2FudmFzPgoKPHNjcmlwdD4KdmFyIGNhbnZhcyA9IGRvY3VtZW50LmdldEVsZW1lbnRCeUlkKCdteUNhbnZhcycpOwpjYW52YXMud2lkdGggPSA0MDk2OwpjYW52YXMuaGVpZ2h0ID0gNDA5NjsKY2FudmFzLnN0eWxlLndpZHRoID0gJzk5dncnOwpjYW52YXMuc3R5bGUuaGVpZ2h0ID0gJzk5dmgnOwp2YXIgY3R4ID0gY2FudmFzLmdldENvbnRleHQoJzJkJyk7CnZhciBjYW52YXNXID0gY3R4LmNhbnZhcy53aWR0aDsKdmFyIGNhbnZhc0ggPSBjdHguY2FudmFzLmhlaWdodDsKCmN0eC5maWxsU3R5bGUgPSAncmdiYSgyMDAsIDE5NywgMTc3LCAxKSc7CmN0eC5maWxsUmVjdCgwLCAwLCBjYW52YXNXLCBjYW52YXNIKTsKCmN0eC5zaGFkb3dCbHVyID0gY2FudmFzVzsKY3R4LnNoYWRvd0NvbG9yID0gJ3JnYmEoMCwgMCwgMCwgMC4zKSc7CmN0eC5iZWdpblBhdGgoKTsKY3R4LmZpbGxTdHlsZSA9ICdyZ2JhKDUxLCAyMjAsIDE5MSwgMSknOwpjdHguZWxsaXBzZShjYW52YXNXIC8gMiwgY2FudmFzSCAvIDIgLCBjYW52YXNXICogLjQsIGNhbnZhc0ggKiAuNDUsIDAsIDAsIDIgKiBNYXRoLlBJKTsKY3R4LmZpbGwoKTsKCgoKPC9zY3JpcHQ+Cgo8L2JvZHk+CjwvaHRtbD4='
with open("turquoise egg.png", "wb") as fh:
fh.write(base64.decodebytes(img_data))
Version 2
from binascii import a2b_base64
data = 'PCFET0NUWVBFIGh0bWw+CjxodG1sPgogIDxoZWFkPgogICAgPG1ldGEgbmFtZT0ndmlld3BvcnQnIGNvbnRlbnQ9J3dpZHRoPWRldmljZS13aWR0aCwgaW5pdGlhbC1zY2FsZT0xLjAnPgogIDwvaGVhZD4KPGJvZHk+CjxzdHlsZT4KICAgIGJvZHksIGh0bWwgewogICAgICBwYWRkaW5nOiAwICFpbXBvcnRhbnQ7CiAgICAgIG1hcmdpbjogMCAhaW1wb3J0YW50OwogICAgICBtYXJnaW46IDA7CiAgICB9CiAgICAqIHsKICAgICAgcGFkZGluZzogMDsKICAgICAgbWFyZ2luOiAwOwogICAgfQo8L3N0eWxlPgoKPGNhbnZhcyBpZD0nbXlDYW52YXMnIHN0eWxlPSdvYmplY3QtZml0OiBjb250YWluOyB3aWR0aDogOTl2dzsgaGVpZ2h0OiA5OXZoOyc+CllvdXIgYnJvd3NlciBkb2VzIG5vdCBzdXBwb3J0IHRoZSBIVE1MNSBjYW52YXMgdGFnLjwvY2FudmFzPgoKPHNjcmlwdD4KdmFyIGNhbnZhcyA9IGRvY3VtZW50LmdldEVsZW1lbnRCeUlkKCdteUNhbnZhcycpOwpjYW52YXMud2lkdGggPSA0MDk2OwpjYW52YXMuaGVpZ2h0ID0gNDA5NjsKY2FudmFzLnN0eWxlLndpZHRoID0gJzk5dncnOwpjYW52YXMuc3R5bGUuaGVpZ2h0ID0gJzk5dmgnOwp2YXIgY3R4ID0gY2FudmFzLmdldENvbnRleHQoJzJkJyk7CnZhciBjYW52YXNXID0gY3R4LmNhbnZhcy53aWR0aDsKdmFyIGNhbnZhc0ggPSBjdHguY2FudmFzLmhlaWdodDsKCmN0eC5maWxsU3R5bGUgPSAncmdiYSgyMDAsIDE5NywgMTc3LCAxKSc7CmN0eC5maWxsUmVjdCgwLCAwLCBjYW52YXNXLCBjYW52YXNIKTsKCmN0eC5zaGFkb3dCbHVyID0gY2FudmFzVzsKY3R4LnNoYWRvd0NvbG9yID0gJ3JnYmEoMCwgMCwgMCwgMC4zKSc7CmN0eC5iZWdpblBhdGgoKTsKY3R4LmZpbGxTdHlsZSA9ICdyZ2JhKDUxLCAyMjAsIDE5MSwgMSknOwpjdHguZWxsaXBzZShjYW52YXNXIC8gMiwgY2FudmFzSCAvIDIgLCBjYW52YXNXICogLjQsIGNhbnZhc0ggKiAuNDUsIDAsIDAsIDIgKiBNYXRoLlBJKTsKY3R4LmZpbGwoKTsKCgoKPC9zY3JpcHQ+Cgo8L2JvZHk+CjwvaHRtbD4='
binary_data = a2b_base64(data)
fd = open('turquoise egg.png', 'wb')
fd.write(binary_data)
fd.close()
Version 3
import base64
fileString = 'PCFET0NUWVBFIGh0bWw+CjxodG1sPgogIDxoZWFkPgogICAgPG1ldGEgbmFtZT0ndmlld3BvcnQnIGNvbnRlbnQ9J3dpZHRoPWRldmljZS13aWR0aCwgaW5pdGlhbC1zY2FsZT0xLjAnPgogIDwvaGVhZD4KPGJvZHk+CjxzdHlsZT4KICAgIGJvZHksIGh0bWwgewogICAgICBwYWRkaW5nOiAwICFpbXBvcnRhbnQ7CiAgICAgIG1hcmdpbjogMCAhaW1wb3J0YW50OwogICAgICBtYXJnaW46IDA7CiAgICB9CiAgICAqIHsKICAgICAgcGFkZGluZzogMDsKICAgICAgbWFyZ2luOiAwOwogICAgfQo8L3N0eWxlPgoKPGNhbnZhcyBpZD0nbXlDYW52YXMnIHN0eWxlPSdvYmplY3QtZml0OiBjb250YWluOyB3aWR0aDogOTl2dzsgaGVpZ2h0OiA5OXZoOyc+CllvdXIgYnJvd3NlciBkb2VzIG5vdCBzdXBwb3J0IHRoZSBIVE1MNSBjYW52YXMgdGFnLjwvY2FudmFzPgoKPHNjcmlwdD4KdmFyIGNhbnZhcyA9IGRvY3VtZW50LmdldEVsZW1lbnRCeUlkKCdteUNhbnZhcycpOwpjYW52YXMud2lkdGggPSA0MDk2OwpjYW52YXMuaGVpZ2h0ID0gNDA5NjsKY2FudmFzLnN0eWxlLndpZHRoID0gJzk5dncnOwpjYW52YXMuc3R5bGUuaGVpZ2h0ID0gJzk5dmgnOwp2YXIgY3R4ID0gY2FudmFzLmdldENvbnRleHQoJzJkJyk7CnZhciBjYW52YXNXID0gY3R4LmNhbnZhcy53aWR0aDsKdmFyIGNhbnZhc0ggPSBjdHguY2FudmFzLmhlaWdodDsKCmN0eC5maWxsU3R5bGUgPSAncmdiYSgyMDAsIDE5NywgMTc3LCAxKSc7CmN0eC5maWxsUmVjdCgwLCAwLCBjYW52YXNXLCBjYW52YXNIKTsKCmN0eC5zaGFkb3dCbHVyID0gY2FudmFzVzsKY3R4LnNoYWRvd0NvbG9yID0gJ3JnYmEoMCwgMCwgMCwgMC4zKSc7CmN0eC5iZWdpblBhdGgoKTsKY3R4LmZpbGxTdHlsZSA9ICdyZ2JhKDUxLCAyMjAsIDE5MSwgMSknOwpjdHguZWxsaXBzZShjYW52YXNXIC8gMiwgY2FudmFzSCAvIDIgLCBjYW52YXNXICogLjQsIGNhbnZhc0ggKiAuNDUsIDAsIDAsIDIgKiBNYXRoLlBJKTsKY3R4LmZpbGwoKTsKCgoKPC9zY3JpcHQ+Cgo8L2JvZHk+CjwvaHRtbD4='
decodeit = open('turquoise egg.png', 'wb')
decodeit.write(base64.b64decode((fileString)))
decodeit.close()
FWIW I originally used the following code to create a png from the HTML without using base64, but it would only ever save the first element of JavaScript generated on the canvas (ie the background) and since I require the information in base64 anyway, thought I would approach it this way in order to capture the complete image
file = open('html.html', 'r')
imgkit.from_file(file, 'png.png')
file.close()
Html2Image has provided the solution I was looking for.
Whilst imgkt wasn't saving the fully rendered canvas, taking screenshot with html2canvas does. Documentation is here and I implemented as follows:
from html2image import Html2Image
hti.screenshot(
html_file = ‘html.html’,
size = (imageW, imageH),
save_as = ‘png.png'
)

convert all img in one pdf .?

I would like to finish my script, I tried a lot to solve but being a beginner failed.
I have a function imageio which takes image from website and after that, i would like resize all images in 63x88 and put all my images in one pdf.
full_path = os.path.join(filePath1, name + ".png")
if os.path.exists(full_path):
number = 1
while True:
full_path = os.path.join(filePath1, name + str(number) + ".png")
if not os.path.exists(full_path):
break
number += 1
imageio.imwrite(full_path, im_padded.astype(np.uint8))
os.chmod(full_path, mode=0o777)
thanks for answer
We (ImageIO) currently don't have a PDF reader/writer. There is a long-standing features request for it, which hasn't been implemented yet because there is currently nobody willing to contribute it.
Regarding the loading of images, we have an example for this in the docs:
import imageio as iio
from pathlib import Path
images = list()
for file in Path("path/to/folder").iterdir():
im = iio.imread(file)
images.append(im)
The caveat is that this particular example assumes that you want to read all images in a folder, and that there is only images in said folder. If either of these cases doesn't apply to you, you can easily customize the snippet.
Regarding the resizing of images, you have several options, and I recommend scikit-image's resize function.
To then get all the images into a PDF, you could have a look at matplotlib, which can generate a figure which you can save as a PDF file. The exact steps to do so will depend on the desired layout of your resulting pdf.

How extract text from this compressed PDF/A?

For machine learning purposes (sckit-learn), I need to extract the raw text from lots of PDF files. First off, I was using xpdf pdftotext to do this task:
exe = r'"'+os.path.join(xpdf_path,"pdftotext.exe")+'"'
cmd = exe+" "+"\""+pdf+"\""+" "+"\""+pdf+".txt"+"\""
subprocess.check_output(cmd)
with open(pdf+".txt") as f:
texto_converted = f.read()
But unfortunately, for few of them, I was unable to get the text because they are using "stream" on their pdf source, like this one.
The result is something like this:
59!"#$%&'()*+,-.#/#01"21"" 345667.0*(879:4$;<;4=<6>4?$#"12!/ 21#$#A$3A$>#>BCDCEFGCHIJKIJLMNIJILOCNPQRDS QPFTRPUCTCVQWBCTTQXFPYTO"21 "#/!"#(Z[12\&A+],$3^_3;9`Z &a# .2"#.b#"(#c#A(87*95d$d4?$d3e#Z"f#\"#2b?2"#`Z 2"!eb2"#H1TBRgF JhiO
jFK# 2"k#`Z !#212##"elf/e21m#*c!n2!!#/bZ!#2#`Z "eo ]$5<$#;A533> "/\ko/f\#e#e#p
I Even trying using zlib + regex:
import re
import zlib
pdf = open("pdfa.pdf", "rb").read()
stream = re.compile(b'.*?FlateDecode.*?stream(.*?)endstream', re.S)
for s in re.findall(stream,pdf):
s = s.strip(b'\r\n')
try:
print(zlib.decompress(s).decode('UTF-8'))
print("")
except:
pass
The result was something like this:
1 0 -10 -10 10 10 d1
0.01 0 0 0.01 0 0 cm
1 0 -10 -10 10 10 d1
0.01 0 0 0.01 0 0 cm
I even tried pdftopng (xpdf) to try tesseract after, without success
So, Is there any way to extract pure text from a PDF like that using Python or a third party app?
If you want to decompress the streams in a PDF file, I can recommend using qdpf, but on this file
qpdf --decrypt --stream-data=uncompress document.pdf out.pdf
doesn't help either.
I am not sure though why your efforts with xpdf and tesseract did not work out, using image-magick's convert
to create PNG files in a temporary directory and tesseract, you can do:
import os
from pathlib import Path
from tempfile import TemporaryDirectory
import subprocess
DPI=600
def call(*args):
cmd = [str(x) for x in args]
return subprocess.check_output(cmd, stderr=subprocess.STDOUT).decode('utf-8')
def ocr(docpath, lang):
result = []
abs_path = Path(docpath).expanduser().resolve()
old_dir = os.getcwd()
out = Path('out.txt')
with TemporaryDirectory() as tmpdir:
os.chdir(tmpdir)
call('convert', '-density', DPI, abs_path, 'out.png')
index = -1
while True:
# names have no leading zeros on the digits, would be difficult to sort glob() output
# so just count them
index += 1
png = Path(f'out-{index}.png')
if not png.exists():
break
call('tesseract', '--dpi', DPI, png, out.stem, '-l', lang)
result.append(out.read_text())
os.chdir(old_dir)
return result
pages = ocr('~/Downloads/document.pdf', 'por')
print('\n'.join(pages[1].splitlines()[21:24]))
which gives:
DA NÃO REALIZAÇÃO DE AUDIÊNCIA DE AUTOCOMPOSIÇÃO NO CASO EM CONCRETO
Com vista a obter maior celeridade processual, assim como da impossibilidade de conciliação entre
If you are on Windows, make sure your PDF file is not open in a different process (like a PDF viewer), as Windows doesn't seem to like that.
The final print is limited as the full output is quite large.
This converting and OCR-ing takes a while so you might want to uncomment the print in call() to get some sense of progress.
There are two fairly simple techniques you can use.
1) Google's "Tessaract" open source OCR (optical character recognition). You could apply this evenly to all PDFs, though converting all that data into pixels and then working magic upon them is going to be more computationally expensive. Which is more important, engineer time or CPU time? There's a pytesseract module. Note that this tool works on image formats, so you'd have to use something like GhostScript (another open source project) to convert all of a PDF's pages to images, then run [py]tessaract on those images.
2) pyPDF can get each page and programmatically extract any text draw operations in the order they were drawn onto the page. This may be nothing like the logical reading order of the page... While a PDF could draw all the 'a's and then all the 'b's (and so forth), it's actually more efficient to draw everything in "font a" , then everything in "font b". It's important to note that "font b" might just be the italic version of "font a". This produces a shorter/more efficient stream of drawing commands, though probably not by such an amount as to be a good business decision to do so.
The kicker here is that a random pile of PDF files might require you to do some OCR. A poorly assembled PDF (one with a font subset that has no "to unicode" data) can't be properly mined for text even though it has nothing but text drawing operations. "Draw glyphs one through five from "font C" doesn't mean much if you don't know that those first five glyphs are "g-l-y-p-h", because that's the order they were used in.
On the other hand, if you've got home-grown PDFs or all your pdfs are from some known source (Word's pdf converter for example), you'll know what to expect in advance.
Note that the only thing mentioned above that I've actually used is Ghostscript. I remember it having a solid command line interface we used to generate images for some online PDF viewer Many Years Ago.

Python/wand code causes "Killed" when converting large PDFs

I have been working on setting up a PDF conversion-to-png and cropping script with Python 3.6.3 and the wand library.
I tried Pillow, but it's lacking the conversion part. I am experimenting with extracting the alpha channel because I want to feed the images to an OCR, at a later point, so I turned to trying the code provided in this SO answer.
A couple of issues came out: the first is that if the file is large, I get a "Killed" message from the terminal. The second is that it seems rather picky with the file, i.e. files that get converted properly by imagemagick's convert or pdftoppm in the command line, raise errors with wand.
I am mostly concerned with the first one though, and would really appreciate a check from more knowledgeable coders. I suspect it might come from the way the loop is structured:
from wand.image import Image
from wand.color import Color
def convert_pdf(filename, path, resolution=300):
all_pages = Image(filename=path+filename, resolution=resolution)
for i, page in enumerate(all_pages.sequence):
with Image(page) as img:
img.format = 'png'
img.background_color = Color('white')
img.alpha_channel = 'remove'
image_filename = '{}.png'.format(i)
img.save(filename=path+image_filename)
I noted that the script outputs all files at the end of the process, rather than one by one, which I am guessing it might put unnecessary burden on memory, and ultimately cause a SEGFAULT or something similar.
Thanks for checking out my question, and for any hints.
Yes, your line:
all_pages = Image(filename=path+filename, resolution=resolution)
Will start a GhostScript process to render the entire PDF to a huge temporary PNM file in /tmp. Wand will then load that massive file into memory and hand out pages from it as you loop.
The C API to MagickCore lets you specify which page to load, so you could perhaps render a page at a time, but I don't know how to get the Python wand interface to do that.
You could try pyvips. It renders PDFs incrementally by making direct calls to libpoppler, so there are no processes being started and stopped and no temporary files.
Example:
#!/usr/bin/python3
import sys
import pyvips
def convert_pdf(filename, resolution=300):
# n is number of pages to load, -1 means load all pages
all_pages = pyvips.Image.new_from_file(filename, dpi=resolution, n=-1, \
access="sequential")
# That'll be RGBA ... flatten out the alpha
all_pages = all_pages.flatten(background=255)
# the PDF is loaded as a very tall, thin image, with the pages joined
# top-to-bottom ... we loop down the image cutting out each page
n_pages = all_pages.get("n-pages")
page_width = all_pages.width
page_height = all_pages.height / n_pages
for i in range(0, n_pages):
page = all_pages.crop(0, i * page_height, page_width, page_height)
print("writing {}.tif ..".format(i))
page.write_to_file("{}.tif".format(i))
convert_pdf(sys.argv[1])
On this 2015 laptop with this huge PDF, I see:
$ /usr/bin/time -f %M:%e ../pages.py ~/pics/Audi_US\ R8_2017-2.pdf
writing 0.tif ..
writing 1.tif ..
....
writing 20.tif ..
720788:35.95
So 35s to render the entire document at 300dpi, and a peak memory use of 720MB.

Using wand to convert image to pdf fails after a few successful results

My application works a few times and then errors on every pdf. This is the error I receive:
Exception TypeError: TypeError("object of type 'NoneType' has no len()",) in <bound method Image.__del__ of <wand.image.Image: (empty)>> ignored
And this is the function I use:
def read_pdf(file):
pre, ext = os.path.splitext(file)
filename = pre + '.png'
with Image(filename=file, resolution=200) as pdf:
amount_of_pages = len(pdf.sequence)
image = Image(
width=pdf.width,
height=pdf.height * amount_of_pages
)
for i in range(0, amount_of_pages):
image.composite(
pdf.sequence[i],
top=pdf.height * i,
left=0
)
image.compression_quality = 100
image.save(filename=filename)
logging.info('Opened and saved pdf to image: \'' + file + '\'.')
return filename
This function will correctly convert pdfs to images but after two or three times it will crash every time and throw that exception. If I restart the python script it works again for a few times.
The error is caused by the system running out of resources. Wand calls ImageMagick library; which in turn, passes the decoding work to Ghostscript delegate. Ghostscript is very stable, but does use a lot of resources, and is not happy when run in parallel (my opinion).
Any help?
Try to architect a solution that allows a clean shutdown between PDF conversions. Like a queue worker, or subprocess script. The smallest resources leak can grow out of hand quickly.
Avoid invoking wand.image.Image.sequance. There's been a few known memory leak issues reported. Although many have been fixed, it seems PDF tasks continue to have issues.
From the code posted, it looks like your just creating a tall image with all pages of a given PDF. I would suggest porting MagickAppendImages directly.
import ctypes
from wand.image import Image
from wand.api import library
# Map C-API to python
library.MagickAppendImages.argtypes = (ctypes.c_void_p, ctypes.c_bool)
library.MagickAppendImages.restype = ctypes.c_void_p
with Image(filename='source.pdf') as pdf:
# Reset image stack
library.MagickResetIterator(pdf.wand)
# Append all pages into one new image
new_ptr = library.MagickAppendImages(pdf.wand, True)
library.MagickWriteImage(new_ptr, b'output.png')
library.DestroyMagickWand(new_ptr)
It seems that I created a new image and did not destroy it. This filled up the memory.
I just had to use with new Image(...) as img instead of img = new Image(...).

Categories