What program to write pdf including other pdf on Linux from Python? - python

On an Ubuntu server, I want to create pdfs which include other static pdfs. I have tried using ReportLab with pyPdf. Ideally I would use ReportLab to do the whole thing, but in order to import the pdfs requires their PageCatcher which has a large recurring fee.
So I use pyPdf to merge a page created with ReportLab and my other pdfs. The problem is that even though this looks fine in Acrobat and Foxit, part of one of the pages prints garbled on a Xerox 7400 color printer. I can't figure out the issue, but would be willing to buy a more integrated solution if it existed and was reasonably priced. I thought PDF Creator Pilot was it until I saw that it was Windows only.
So is there a reasonably priced ($1K or less) solution or a different suggestion?

I have had a lot of success with the Java library iText. They have a great library of samples for pretty much anything you could think of doing with PDF files. This example is for concatenating PDF files and sounds like it would do what you need: http://itextpdf.com/examples/index.php?page=example&id=123. There is also PDFBox which is another great Java based PDF manipulation library.
I realize that you are looking for a Python based solution but there may not be many other options. If you are using the Jython interpreter instead of CPython, integrating in iText should be trivial. If not, then you could consider calling out to it as a separate process. I realize that may not be idea for your situation but I figured I would mention it as an option.

Another non-Python answer. If you are just merging pages, then pdftk does that well (along with a lot of other things).

Related

How can Adobe Acrobat do a such a good job at merging PDF files?

In my work I often find myself at bulk merging multiple A4 event tickets PDF.
For the next example I want to bulk merge 500 pdfs of 2Mb each.
The issue is that no matter what tool I use (PDFtk, GhostScript, PDFSam,..) the resulting PDF is of approximately 1Gb.
If I do the same job with Acrobat Pro (trial version) the output is of 8Mb without any loss of quality what so ever.
How can Acrobat do such a great job? And how can I replicate this behaviour?
Does that have anything to do with the fact that the pdfs are all in the same layout with the same images and the only variables are texts? (seat and row deatails, etc)
I would like to be able to do it in a Linux environment with opensource tools, I even tried to build a Python script to do that using PyPDF2 library but apart from the big size I also lose some text details getting replaced by blank squares.
I've been on this for a couple of days now, does anyone have any idea on how to replicate the Acrobat work?
Thanks in advance
By spotting duplication of common structures (images, fonts) etc. Especially if your 500 files come from a common source, as you suggest they do.
Try...
cpdf in1.pdf in2.pdf in3.pdf etc... -o out.pdf
If the result is still large, you can explicitly do:
cpdf -squeeze out.pdf -o small.pdf
(Cpdf is free for non-commercial use, but only its core, CamlPDF, is fully open source.)

Automated OCR program or off-the-shelf tool to scan PDFs/images and output into excel documents

I'm new to programming and recently started getting into Python a lot more seriously. However, I've done some projects that required programming in my company, so I have some background on how it works (or how to scour the internet! lol).
Recently however, we have had a client that sends us invoices in PDF formats and we would like to automate all the invoices to compile into one .csv file.
I've been picking up a few OCR codes (I ran my first image-to-text output recently), however I don't think I'm 100% capable of creating such automation yet since I'm still very fresh in programming. It would require at least a few weeks, and I'm not sure if it's worth it if we could just ask the client to set up a more accurate excel spreadsheet to send over every time.
That's why I'm turning to an already available OCR tool. I recently found this gem: https://www.pdftoexcel.com/ however it is a very manual process and not as automated as we could like. If there is a way to program a script to upload an available PDF file from a certain folder in order to upload it to the website and export it into an Excel file every time we receive an invoice, would it be possible to share?
It would also be a big plus if there would be a way to upload a batch of invoices and identify different charges, providing a summary across the scanned invoices, particularly in the categories
I hope what I'm asking for makes sense. Let me know if you'd require more clarification.
Cheers
There's lots of stuff available for Python, if you have a quick search on Google or StackOverflow. I believe I used Tesseract OCR in the past.
My experience is that you will get serviceable OCR with some of the popular Python libraries, but the excellent stuff will come with a price tag.
Try some tests with your PDF invoices, but if you are getting even slightly questionable results, you may have to consider more expensive alternatives (or even standalone machinery!).
If the client is sending you clear, nicely formatted PDFs with a good font, I don't see why free Python libraries wouldn't be sufficient.

HTML to RTF string using Python

I am looking for a way to convert HTML text to RTF string. Is there any libraries that does this job. I get html content dynamically in my project and need it to be rendered in RTF format. I am using HTML parser to convert HTML text to normal string and then have trying to use PyRTF for conversion to RTF format. Is there any better way that this can be done.Thanks in advance.
RTF seems a dicey format to convert from/to. I've tried cutting and pasting among applications on Mac OS X, for example, where RTF is something of a lingua franca. Some of those apps are Microsoft apps (relevant in that RTF is a Microsoft-developed format), others are not. Even basic formatting information like font size, font face, line spacing, and list styling (ordered or unordered) is jumbled when copying from one ostensibly RTF-speaking app to another. Simply put, it's a mess.
I have searched for ways to programmatically read, write, and transform RTF, preferably from Python. I found a number of packages on PyPI, trying them out has been a disappointing experience. They would support RTF 1.5, say, when the current version is 1.9.1. RTF has been around a long time, but a 2005-vintage spec is not very recent. There were lots of gotchas and incompatibilities. LOTS.
Now, I'm not saying it's impossible, or that there aren't other libraries out there that would do the trick. I have not tried the zopyx.convert mentioned by others here, for example. Maybe it's great. But looking at its dependencies--Java, FOP, etc.--it looks like a pretty complex (and thus likely fragile) toolchain. I read its code on github, and the Python is really only there as a coordination veneer. It organizes external tools XFC, XINC, FOP, and PrinceXML--three of the four of which are commercial software. That includes the key XFC part that deals with RTF. Color me skeptical.
There are two converters that I've found are worth a look: If you're using a Mac, the textutil command line program is actually one of the better and simpler tools I've seen.
textutil -convert html filename.rtf -output filename.html
The other formatting engine that's worth considering is LibreOffice. It's free, open source, reasonably amenable to automation, and a decent foundation as an interoperability hub. That's not just a guess; I've built complex, multi-format document workflows around it.
I would question why you're trying to get into RTF. That seems like a document format you'd be trying to escape from. But if you need to go there, textutil and LibreOffice are the least-worst mechanisms I've found.
There is a wonderful python library that comes as a tarball.
You can download it at https://pypi.python.org/pypi/zopyx.convert2/2.4.5.
Good luck!
I see this question is over a year old, but figured I'd contribute anyway. I recently had a similar requirement, and turned to PyRTF, a small but powerful Python module that can construct RTF documents from a text file. You could use Beautiful Soup to scrape the HTML, going down the parse tree tag by tag, and use the PyRTF API to construct appropriate objects (table, cell, paragraph, section or document).
The API itself is quite granular, and allows for a whole bunch of custom formatting (font text, alignment, color, headers, footers etc.)
Hope this helps.

Any good pdf417 Barcode libraries for Python?

I'm looking for a good python module to generate pdf417 barcodes. Has anyone used one they liked?
Ideally I would like one with as few dependencies as possible, and one that runs on both linux and MacOSX.
We recently had to approach this problem as well, and being a Python shop we wanted a Python solution. It become clear the elaphe is the project that had the potential to actually accomplish pdf 417 barcode.
However what we found was it errors by todays standards, and so we entered the hunt to fix the library. Turns out elaphe must generate an outdated form of *.eps post script that can't be interpreted by ghost script and this is where the bar code generation fails.
Well fortunately elphae uses a common library behind the scenes called Barcode Writer in Pure PostScript # http://bwipp.terryburton.co.uk
This common backend library which has many projects in multi-languages using it to generate projects. The fix specifically for us was to fork elaphe, and correct it's *.eps file generation.
To determine what is broken in the *.eps, look at this other site that is made using postscriptbarcode, and it let's you generate the pdf417 barcode online (as well as other formats): http://www.terryburton.co.uk/barcodewriter/generator/
Once you generate a pdf417 barcode it gives you the option to download the .png, .jpg, and YES the .eps file!
Using this .eps file you can pipe it to ghost script and tweak the parameterization to get the exact pdf417 barcode you are looking for. Then take this result and integrate it into the elaphe library and actually get a pull request on that thing ....
Seems to be a bit of work, but nothing that can't be knocked out in an afternoon. It is ideal to get the elaphe library back in shape to generate these without making this enhancement.
Please note that the performance of this approach for us is a few seconds to generate this barcode due to the fact it creates the 2000 line eps file and pipes it to ghost script which generates another image file that we send back as the final barcode result. This is not as performance as code128 with reportlab.
Perhaps room for optimizations: Is pillow faster than PIL in anyway? Do we need all the parts of the eps file to generate the barcode of type pdf417? Other ways to optimize?
Anyway, great question Ken and I hope you find this to be a great answer.
I guess the issue in elaphe reported by Matteius in 2013 has been fixed, since the issues and commit logs show updates on the pdf417 topic since then.
Anyway, there are now a few other options (got the list with either pip search elaphe or pip search pdf417) :
elaphe ;
elaphe3 (fork of elaphe tested against python3) ;
candybar (no documentation ? also a webservice) ;
pdf417gen ;
treepoem (about the name : barcode -> bark ode -> tree poem =D ) — edit : didn't dig the issue, but as of today generation of PDF417 seems broken.
All but pdf417gen support several types of barcodes.
Note that the documentation of bwipp (on which are based elaphe and treepoem) only mentions 5 levels of error correction (1 to 5), while pdf417gen claims to support 9 security levels (0 to 8).
Reportlab does have an extension called rlbarcode, but this one does not include support for pdf417 codes. I do not know of any other extension for reportlab including support for pdf417 bar codes.
Anyway, if you are interested in generation of pdf417 codes from python, you may be interested in this project: elaphe.
I have still not tested it (in fact, I need to generate pdf417 from python, and I found this thread as well as the elaphe project page) I am going to download the elaphe tools in order to test it right now.

How to include page in PDF in PDF document in Python

I am using reportlab toolkit in Python to generate some reports in PDF format. I want to use some predefined parts of documents already published in PDF format to be included in generated PDF file. Is it possible (and how) to accomplish this in reportlab or in python library?
I know I can use some other tools like PDF Toolkit (pdftk) but I am looking for Python-based solution.
I'm currently using PyPDF to read, write, and combine existing PDF's and ReportLab to generate new content. Using the two package seemed to work better than any single package I was able to find.
If you want to place existing PDF pages in your Reportlab documents I recommend pdfrw. Unlike PageCatcher it is free.
I've used it for several projects where I need to add barcodes etc to existing documents and it works very well. There are a couple of examples on the project page of how to use it with Reportlab.
A couple of things to note though:
If the source PDF contains errors (due to the originating program following the PDF spec imperfectly for example), pdfrw may fail even though something like Adobe Reader has no apparent problems reading the PDF. pdfrw is currently not very fault tolerant.
Also, pdfrw works by being completely agnostic to the actual content of the PDF page you are placing. So for example, you wouldn't be able to use pdfrw inspect a page to see if it contains a certain string of text in the lower right-hand corner. However if you don't need to do anything like that you should be fine.
There is an add-on for ReportLab — PageCatcher.

Categories