Edit pdf file with PDFMiner - python

I was wondering if it's possible to édit an existing pdf file with Pdfminer. It seens to be a powerful tool, but the documentation is poor/inexisting.
I found some exemples, but they don't match with my goal. I want to make a search engine which changes the color of my keywords in the pdf file.

PDFMiner is not for altering existing PDF files, but for extracting text and metadata from them. The closest solution to what you're looking for using PDFMiner would probably be to use the included pdf2txt.py tool to extract the text and then mark that up to highlight your keywords.
There's also the simple option of just using a PDF viewer with the built-in ability to find and highlight multiple search terms. I think Adobe Acrobat can do it, but I'm not sure about others.

No, pdfminer doesn't support editing.
However, it might be a lot easier if you don't try to modify the pdf, but use PDFOpenParameters instead: http://partners.adobe.com/public/developer/en/acrobat/PDFOpenParameters.pdf
You can use url fragment identifiers like this:
http://www.example.com/test.pdf#search=foo
Or even when opening Acrobat on the commandline (Windows example)
AcroRd32.exe /A "search=foo" test.pdf
You could also open the pdf a specific page, and highlight a certain area of that page (but not different areas on different pages at the same time).
(ok, I know it's not really a solution for the question you asked, but if this is sufficient for your needs, it's a lot simpler)

Related

HTML to RTF string using Python

I am looking for a way to convert HTML text to RTF string. Is there any libraries that does this job. I get html content dynamically in my project and need it to be rendered in RTF format. I am using HTML parser to convert HTML text to normal string and then have trying to use PyRTF for conversion to RTF format. Is there any better way that this can be done.Thanks in advance.
RTF seems a dicey format to convert from/to. I've tried cutting and pasting among applications on Mac OS X, for example, where RTF is something of a lingua franca. Some of those apps are Microsoft apps (relevant in that RTF is a Microsoft-developed format), others are not. Even basic formatting information like font size, font face, line spacing, and list styling (ordered or unordered) is jumbled when copying from one ostensibly RTF-speaking app to another. Simply put, it's a mess.
I have searched for ways to programmatically read, write, and transform RTF, preferably from Python. I found a number of packages on PyPI, trying them out has been a disappointing experience. They would support RTF 1.5, say, when the current version is 1.9.1. RTF has been around a long time, but a 2005-vintage spec is not very recent. There were lots of gotchas and incompatibilities. LOTS.
Now, I'm not saying it's impossible, or that there aren't other libraries out there that would do the trick. I have not tried the zopyx.convert mentioned by others here, for example. Maybe it's great. But looking at its dependencies--Java, FOP, etc.--it looks like a pretty complex (and thus likely fragile) toolchain. I read its code on github, and the Python is really only there as a coordination veneer. It organizes external tools XFC, XINC, FOP, and PrinceXML--three of the four of which are commercial software. That includes the key XFC part that deals with RTF. Color me skeptical.
There are two converters that I've found are worth a look: If you're using a Mac, the textutil command line program is actually one of the better and simpler tools I've seen.
textutil -convert html filename.rtf -output filename.html
The other formatting engine that's worth considering is LibreOffice. It's free, open source, reasonably amenable to automation, and a decent foundation as an interoperability hub. That's not just a guess; I've built complex, multi-format document workflows around it.
I would question why you're trying to get into RTF. That seems like a document format you'd be trying to escape from. But if you need to go there, textutil and LibreOffice are the least-worst mechanisms I've found.
There is a wonderful python library that comes as a tarball.
You can download it at https://pypi.python.org/pypi/zopyx.convert2/2.4.5.
Good luck!
I see this question is over a year old, but figured I'd contribute anyway. I recently had a similar requirement, and turned to PyRTF, a small but powerful Python module that can construct RTF documents from a text file. You could use Beautiful Soup to scrape the HTML, going down the parse tree tag by tag, and use the PyRTF API to construct appropriate objects (table, cell, paragraph, section or document).
The API itself is quite granular, and allows for a whole bunch of custom formatting (font text, alignment, color, headers, footers etc.)
Hope this helps.

text extraction project - best tool for extracting only specific rows / items out of a PDF?

I'm working on a project that is going to extract specified text from a pdf document. I have no experience with this type of extraction. One issue is that we don't just want a dump of all the text in the document. Rather, is there a way to extract only certain fields in the pdf? Is there a notion of pdf templates that could be used for something like this?
I'm trying to use Apple's Automator - this is able to get all the text but not specified text. Ideally, I would like someone in Pages to have for example 30 discreet rows of text and have 20 of those rows be specified as 'catalog item' and have our Automator script take ONLY those twenty lines.
Any ideas on best workflow / extraction tools for this? I would prefer only consumer level items be used such as Apple Pages, Automator, and ruby or python as a scripting language.
thx
edit #1
looks like tagged pdf's might be one way to do this - not sure how well supported on Apple Pages this is
With python, the best choice would probably be PDFMiner. It can extract the coordinates for every text string, so you can work out the rectangles in your form on your own and pick out what falls within them. It's all pretty low level, but PDF is unfortunately a pretty low level format.
Be warned that unless you already know a lot about the structure of PDF, you'll find the API and documentation rather scanty. Look around for usage examples, including here on SO.
For Ruby you might try pdf-reader for parsing a PDF and accessing both metadata and content. Extracting the specific items your interested in is another story, but how to go about doing that depends highly on what format of data you're expecting.
You can use Origami in Ruby, a framework designed to parse, analyze,
and forge PDF documents, or the Python equivalent: Origapy, a simple Python
interface for the Ruby based Origami.

Hide information in a PDF file in Python

In Python, I have files generated by ReportLab. Now, i need to extract some pages from that PDF and hide confidential information.
I can create a PDF file with blacked-out spots and use pyPdf to mergePage, but people can still select and copy-paste the information under the blacked-out spots.
Is there a way to make those spots completely confidential?
Per example, I need to hide addresses on the pages, how would i do it?
Thanks,
Basically you'll have to remove the corresponding text drawing commands in the PDF's page content stream. It's much easier to generate the pages twice, once with the confidential information, once without them.
It might be possible (I don't know ReportLab enough) to specially craft the PDF in a way that the confidential information is easier accessible (e.g. as separate XObjects) for deletion. Still you'd have to do pretty low-level operations on the PDF -- which I would advise against.
(Sorry, I was not able to log on when I posted the question...)
Unfortunately, the document cannot be regenerated at will (context sensitive), and those PDF files (about 35) are 3000+ pages.
I was thinking about using pdf2ps and pdf2ps back, but there is a lot of quality.
pdf2ps -dLanguageLevel=3 input.pdf - | ps2pdf14 - output.pdf
And if i use "pdftops" instead, the text is still selectable. If there is a way to make it non-selectable like with "pdf2ps" but with better quality, it will do too.

how to extract formatted text content from PDF

How can I extract the text content (not images) from a PDF while (roughly) maintaining the style and layout like Google Docs can?
To extract the text from the PDF AND get it's position you can use PDFMiner. PDFMiner can also export the PDF directly in HTML keeping the text at the good position.
I don't know your use case, but there's a lot of problems you can encounter when doing this because PDF is really presentation oriented and not content oriented, the text flow is not continous. So, if you want the text to be editable, it will not be an easy task.
Have you tried pyPDF or ReportLab PDF libraries? I personally have not used them but you can have a go at them. here is useful too
Xpdf has a utility call PDFtoText that does a great job. http://foolabs.com/xpdf/download.html
If you want to do it just like Google:
Google converts the PDF to an image, and then overlays the image, where text used to be, with JavaScript highlightable areas (which is about like Voodoo magic). The areas appear to be text when you scroll over them with your cursor, but they're not. This might not help you to know, but that's how they do it. If you want to reverse engineer it, you might start with https://www.mercurial-scm.org/ On the home page, they do the same thing with JavaScript to make the text highlightable and copyable. You can extract the text from the PDF, and find it's location in the page with on of the mentioned libraries in the other answers. Then you can overlay an extracted image of the file with the same style of JavaScript areas.
If you don't have your heart set on doing this with python, Ghostscript can do this for you. Check out pdf2ascii (a script that comes with GS) to get the plain text. Styles are more complicated as they can be specified in a few different ways.
Acrobat Professional can do the job. In the "File" menu, choose export. Then, choose Text.

How to include page in PDF in PDF document in Python

I am using reportlab toolkit in Python to generate some reports in PDF format. I want to use some predefined parts of documents already published in PDF format to be included in generated PDF file. Is it possible (and how) to accomplish this in reportlab or in python library?
I know I can use some other tools like PDF Toolkit (pdftk) but I am looking for Python-based solution.
I'm currently using PyPDF to read, write, and combine existing PDF's and ReportLab to generate new content. Using the two package seemed to work better than any single package I was able to find.
If you want to place existing PDF pages in your Reportlab documents I recommend pdfrw. Unlike PageCatcher it is free.
I've used it for several projects where I need to add barcodes etc to existing documents and it works very well. There are a couple of examples on the project page of how to use it with Reportlab.
A couple of things to note though:
If the source PDF contains errors (due to the originating program following the PDF spec imperfectly for example), pdfrw may fail even though something like Adobe Reader has no apparent problems reading the PDF. pdfrw is currently not very fault tolerant.
Also, pdfrw works by being completely agnostic to the actual content of the PDF page you are placing. So for example, you wouldn't be able to use pdfrw inspect a page to see if it contains a certain string of text in the lower right-hand corner. However if you don't need to do anything like that you should be fine.
There is an add-on for ReportLab — PageCatcher.

Categories