Converting a PDF to a series of images with Python

Converting a PDF to a series of images with Python - python

I'm attempting to use Python to convert a multi-page PDF into a series of JPEGs. I can split the PDF up into individual pages easily enough with available tools, but I haven't been able to find anything that can covert PDFs to images.
PIL does not work, as it can't read PDFs. The two options I've found are using either GhostScript or ImageMagick through the shell. This is not a viable option for me, since this program needs to be cross-platform, and I can't be sure either of those programs will be available on the machines it will be installed and used on.
Are there any Python libraries out there that can do this?

ImageMagick has Python bindings.

Here's whats worked for me using the python ghostscript module (installed by '$ pip install ghostscript'):
import ghostscript
def pdf2jpeg(pdf_input_path, jpeg_output_path):
args = ["pdf2jpeg", # actual value doesn't matter
"-dNOPAUSE",
"-sDEVICE=jpeg",
"-r144",
"-sOutputFile=" + jpeg_output_path,
pdf_input_path]
ghostscript.Ghostscript(*args)
I also installed Ghostscript 9.18 on my computer and it probably wouldn't have worked otherwise.

You can't avoid the Ghostscript dependency. Even Imagemagick relies on Ghostscript for its PDF reading functions. The reason for this is the complexity of the PDF format: a PDF doesn't just contain bitmap information, but mostly vector shapes, transparencies etc.
Furthermore it is quite complex to figure out which of these objects appear on which page.
So the correct rendering of a PDF Page is clearly out of scope for a pure Python library.
The good news is that Ghostscript is pre-installed on many windows and Linux systems, because it is also needed by all those PDF Printers (except Adobe Acrobat).

If you're using linux some versions come with a command line utility called 'pdftopbm' out of the box. Check out netpbm

Perhaps relevant: http://www.swftools.org/gfx_tutorial.html

Related

Convert from qcow2 to raw with Python

How can I use Python to convert a qcow2 image file into a raw image file?
I know of qemu-img, but I'm curious about any Python libraries that might allow me to avoid asking my users to install that tool. It's not packaged with a default Fedora install, and that's what I'm developing for. If there are no other options however, I'll use qemu-img.

It seems that qemu-img is a necessity for converting qcow2 image files to raw images. I did not find a solution that avoided calling on this tool. This isn't a big issue though, because qemu-img is widely available in distros' repositories, and is sometimes packaged with distros. In order to make use of this tool in Python, simply ensure that it's installed to the system and then call it programmatically via the subprocess module, like so:
import subprocess
# Assuming file_path is the path to a local qcow2 file
if file_path.endswith('.qcow2'):
raw_file_path = file_path[:5] + '.raw'
subprocess.call(['qemu-img', 'convert', file_path, raw_file_path])

Python3 pdf parsing

I have python3.2 installed (I'm not sure if it matters, but it's a 64-bit version) on a Windows machine. I need to open a bunch of pdf files, find certain numbers from the text and store them. Work I should have to ( and the maximum I'd like to) put in is one day.
Can I parse the pdf with plain python without too much of a hassle?
Is there a library that would achieve this easily?
If it's too complicated to do it with this python installation I can install a different set, but that requires alot of extra work, so other solutions appreciated.

solution to convert PDFs, DOCs, DOCXs into a textual format with python

I am developing a full text search engine for indexing popular binary formats. I know that there are hundereds of such questions (and solutions) already, but I found it tough to find one:
cross platform
supports DOC, DOCX and PDF formats at once
easy to use with python
can be set up in a major shared host

For PDFs, I recommend PDFminer.
Try the docx module (I have not used it myself)
I am not aware of any pure python module that can read .doc files.
There are command-line tools to extract text from .doc files: antiword and catdoc (and probably others). If the packages are installed on your shared host, you could use subprocess to shell out to these tools. Available on Windows via Cygwin.
Apache POI is a Java library that can extract text from Office documents. If your shared host has Java installed, you could write a bit of Java (or Jython) code and execute using subprocess.

If at server side you can use OpenOffice then you can use unoconv: Convert between any document format supported by OpenOffice

One possible solution is to use google documents to extract the text contents from binary .doc-files. You upload the document to google docs and then download the text contents. It is a fairly slow process, but it is the only "pure Python" solution I know of since it doesn't require any external tools except for network access. An external tool such as catdoc or antiword is a much better solution if you are allowed to install it on your host.

Textract uses the default tools for every kind of file.
https://github.com/deanmalmgren/textract

Using python to convert PNG's into PDF's

I want to write a python script to convert PNG's into 2-page pdfs (i.e. 2 PNGs per PDF). The software needs to run on both a Mac and Windows 7.
My current solution is using ReportLab, but that doesn't install easily on a Mac. According to its website, it only has a compiled version for Windows. It has a cross-platform version that requires a C compiler to install.
Is there a better way to do this (so that I don't have to install a C compiler on the Mac)? Should I use a different library, or a different language entirely? As long as I can call the program from a python script, I could use any language for the pdf creation. Or, is there a very straightforward (i.e a non-programmer could install it) C compiler that I could install on a Mac?
What do you recommend?

The unix program convert can help you for conversion.
convert file.png file.pdf
But you said you want to have it under windows too. PIL imaging library has a PDF module. you should try a simple
pilconvert file.png file.pdf
to put more than one image you can play with the library which is quite flexible for resizing, stitching, etc. It is also available for mac and windows
Update 2011-02-15
On my macosx, I have installed reportlab without difficulties.
% sudo easy_install reportlab
Searching for reportlab
Reading http://pypi.python.org/simple/reportlab/
Reading http://www.reportlab.com/
Best match: reportlab 2.5
Downloading http://pypi.python.org/packages/source/r/reportlab/reportlab-2.5.tar.gz#md5=cdf8b87a6cf1501de1b0a8d341a217d3
Processing reportlab-2.5.tar.gz
So you could use a combination of PIL and Reportlab for your own needs.

Reportlab is one of the good tools for generating pdfs. If you can use it for your purposes, it is better to stick with it. For installation on MAC, I see that darwinports have a port for Reportlab called py-reportlab. Follow the instructions to install it using portage, it will install the dependencies by itself.

PyCairo could be a good alternative. You'll have full control and less dependencies, but reportlab is a lot simplier.

Why anyone hasn't tried this??
If you are not much concern about the quality, try the following solution.
import PIL.Image
filepath = "temp.png"
newfilename = 'our.pdf'
im = PIL.Image.open(filepath)
im.save(newfilename, "PDF", quality=100)

How Can I Programmatically Build a Multi-Page TIFF out of Many Single Page TIFFs, Using Python?

I've found, via Google, numerous people asking the same question, but no solutions. The Python Image Library (PIL) has tools for stepping through an already existing multi-page TIFF, but nothing about creating them.
Libraries would hopefully be available on Windows, for Python 2.6.
If there's some freeware out there which will do the trick, I wouldn't mind seeing it, but I was hoping I could accomplish this in Python.

You can use ImageMagick for this (available on Unix and Windows).
A linux shell command would be
$ convert *.tif multipage.tif
where *.tif are all your individual tif files.

A freeware option: Irfanview can do it, even via the command line; this allows you to call it from Python.
From changes version 3.90:
New command line option:
/multitif=(tifname,file1,...,fileN)
Example to create multipage TIF test.tif from 2 other files:
i_view32 /multitif=(c:\test.tif,c:\test1.bmp,c:\dummy.jpg)
New command line option:
/append=tiffile
Example to open c:\test.jpg and append it as (TIF) page to c:\test.tif
i_view32 c:\test.jpg /append=c:\test.tif
I have used it once and know it works, though limitation on command line length apply.

you can use the command utility "tiffutil"

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting a PDF to a series of images with Python - python

ImageMagick has Python bindings.

If you're using linux some versions come with a command line utility called 'pdftopbm' out of the box. Check out netpbm

Perhaps relevant: http://www.swftools.org/gfx_tutorial.html

Related

Convert from qcow2 to raw with Python

Python3 pdf parsing

solution to convert PDFs, DOCs, DOCXs into a textual format with python

Using python to convert PNG's into PDF's

How Can I Programmatically Build a Multi-Page TIFF out of Many Single Page TIFFs, Using Python?

Categories

Resources