As an accountant, I produce A4 PDF financial reports for clients. The report contains a PDF cover page design, table of contents, blocks of text and many tables of financial data.
To date I have used a mixture of Microsoft Excel and Word to produce these reports, then save as PDF and add the PDF cover. The major disadvantages to this are that I have to manually edit the tables, I would much rather create automated reports based off existing data exported from my accounting software.
I would like to move away from Excel-Word and move towards (semi-) automating this through python (potentially pandas and markdown packages) - with markdown or html.
Previously I used LaTeX to produce these reports, however I found LaTeX challenging if something went wrong, the errors are difficult to understand and even basic table production can be challenging.
I am trying to plan out how I could bring together python-markdown-html/css. I was wondering if anyone else had experience in producing A4 reports in this way and any advice that they could offer. Initially I was drawn to having text saved as .md files and data stored in either mongoDB, pandas dataframes or simply CSV. I would then use the combination of .md and the data to produce a complete report in HTML. However, could HTML be converted into A4 PDF easily? I understand that there are now page CSS functionality for printing, but is this applicable? How would you suggest I can automate the creation of A4 PDF reports?
To answer your questions plainly:
However, could HTML be converted into A4 PDF easily?
Yes, this is possible using pandoc.
I understand that there are now page CSS functionality for printing, but is this applicable?
Not needed if you use a pandoc template, but possible if desired.
How would you suggest I can automate the creation of A4 PDF reports?
I suggest using pandoc and pandoc templates. This will allow you to convert from a file containing a mixture of makdown, latex, html, and whatever else you would like directly into a pdf.
More details on how:
Pandoc is a document conversion tool that can do this job very well. It will allow you to convert from html or markdown or LaTeX or a mix of all 3 into pdf or a number of other desired formats. For additional control on how the output looks, you can use a pandoc template. You can find information on how to create a custom template here. Here is an example of how that command works:
pandoc /filepath/doc_name.md -o doc_name.pdf --template /file_path/pandoc-templates/article.latex
This process can automated with some further effort. You could do something such as write some python code to generate your graphs or tables from source csv files, then have that code call your pandoc command and build a document.
Here is how I convert my ipython files with graphics outputs and tables into nice looking PDF files, hiding the code segments:
First install jupyter_contrib_nbextensions with
pip install jupyter_contrib_nbextensions
and wkhtmltopdf library from:
https://wkhtmltopdf.org/downloads.html for example I use macos so I had to install
wkhtmltox-0.12.6-2.macos-cocoa.pkg
from this site.
Now convert your file outputs to HTML hiding your code:
jupyter nbconvert --no-input --to html A4_REPORT.ipynb
(A4_REPORT.ipynb is the file you should already have prepared generating some kind of a table, graph or have contained inline markdown segments and able to run in jupyter notebook)
Now convert your this HTML to PDF:
wkhtmltopdf A4_REPORT.html A4_REPORT.pdf
DONE !
Related
I am just getting started with Jupyter Notebook and I'm running into an issue when exporting.
In my current notebook, I alternate between code cells with code and markdown cells. (Which explain my code).
In the markdown cells, sometimes I will use a little HTML to display a table or a list. I will also use the bold tag <b></b> to emphasize a particular portion of text.
My problem is, when I export this notebook to PDF (via the menu in Jupyter Notebook) all of my HTML gets saved as plaintext.
For example, instead of displaying a table, when exporting to PDF, the HTML will be displayed instead. <tr>Table<tr> <th>part1</th>, etc.
I've tried exporting to HTML instead, but even the HTML file displays the HTML as plaintext.
I tried downloading nbconvert (which is probably what I'm doing when I use the jupter GUI anyways) and using that via terminal, but I still get the same result.
Has anyone run into this problem before?
I tried to export it to html and it worked normally.
Where did you define your html? Did you used the Markdown textfields?
Alternatives:
I don't have the nbconverter, but what about exporting it to html and use another tool to convert it to a pdf?
Use markdown language, it provides tables. Link:
https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet
Consider upgrading your notebook
I fixed this myself.
It turns out that somewhere in the code, there was a tag.
Although it did not run the entire length of the cell, the fact that the plaintext tag was there at all changed the dynamic of the cell.
Next, I had strange formatting errors (Text was of different size and strangely emphasized) when using = as plaintext in the cell. When opening the cell for editing, these = symbols were big bold and blue. This probably has something to do with the markdown language.
This was solved by placing the = on the same line as other text.
I did have to convert the page to HTML, then use a firefox addon to convert to PDF.
Converting to PDF from jupyter notebook uses LaTeX to transcribe the page, and all html is converted to plaintext.
The page appeared as normal with html tables, and normal html in the markdown cell. I just had to be careful with any extraneous tags.
If anyone else encounters this problem, check your html tags, and make sure that you are not accidentally doing something in markdown language.
We have a project in python with django.
We need to generate complex word, excel and pdf files.
For the rest of our projects which were done in PHP we used PHPexcel ,
PHPWord and tcpdf for PDF.
What libraries for python would you recommend for creating this kind of files ? (for excel and word its imortant to use the open xml file format xlsx , docx)
Python-docx may help ( https://github.com/mikemaccana/python-docx ).
Python doesn't have highly-developed tools to manipulate word documents. I've found the java library xdocreport ( https://code.google.com/p/xdocreport/ ) to be the best by far for Word reporting. Because I need to generate PCL, which is efficiently done via FOP I also use docx4j.
To integrate this with my python, I use the spark framework to wrap it up with a simple web service, and use requests on the python side to talk to the service.
For excel, there's openpyxl, which actually is a python port of PHPexcel, afaik. I haven't used it yet, but it sounds ok to me.
I would recommend using Docutils. It takes reStructuredText files and converts them to a range of output files. Included in the package are HTML, LaTeX and .odf file writers but in the sandbox there are a whole load of other writers for writing to other formats, see for example, the WordML writer (disclaimer: I haven't used it).
The advantage of this solution is that you can write plain text (reStructuredText) master files, which are human readable as is, and then convert to a range of other file formats as required.
Whilst not a Python solution, you should also look at Pandoc a Haskell library which supports a much wider range of output and input formats than docutils. One major advantage of Pandoc over Docutils is that you can do the reverse translation, i.e. WordML to reStructuredText. You can try Pandoc here.
I have never used any libraries for this, but you can change the extension of any docx, xlsx file to zip, and see the magic!
Generating openxml files is as simple as generating couple of XML files (you can use templates) and zipping it.
Simplest way to generate PDF is to generate HTML (with CSS+images) and convert it using wkhtmltopdf tool.
I want to print or save gantt-chart(in pdf format). These charts are generated on web after a particular input. Our chart is a plug-in for Trac. I have used Genshi library to generate charts.
There's an open source python library for generating PDF files by Report Labs. I've not used it myself, but other questions & answers on SO have revolved around this library, Report Lab Toolkit.
Can you give more information about your plugin? There is a gantt chart plugin on trac-hacks.org; is that the one you are using, or a custom one? If custom, is it available as Open Source somewhere so we can see what you are doing?
If you implemented this as a wiki macro, you can use the WikiToPdf plugin to do this.
You Could use WeasyPrint to convert HTML to PDF. From their example website:
weasyprint http://www.w3.org/TR/CSS21/intro.html CSS21-intro.pdf -s http://weasyprint.org/samples/CSS21-print.css
creates a PDF file based on the HTML page and CSS provided. This is a python implementation.
I want to automatically generate booking confirmation PDF files in Python. Most of the content will be static (i.e. logos, booking terms, phone numbers), with a few dynamic bits (dates, costs, etc).
From the user side, the simplest way to do this would be to start with a PDF file with the static content, and then using python to just add the dynamic parts. Is this a simple process?
From doing a bit of search, it seems that I can use reportlab for creating content and pyPdf for merging PDF's together. Is this the best approach? Or is there a really funky way that I haven't come across yet?
Thanks!
From the user side, the simplest way to do this would be to start with a PDF file with the static content, and then using python to just add the dynamic parts. Is this a simple process?
Unfortunately no. There are several tools that are good at producing PDFs from scratch (most commonly for Python, ReportLab), but they don't generally load existing PDFs. You would have to include generating code for any boilerplate text, lines, blocks, shapes and images, rather than this being freely editable by the user.
On the other side there's pyPdf which can load PDFs, collate the pages, and extract some of the information, but can't really add new content. You can ‘merge’ pages into one, but you'd still have to create the extra information overlay as a page in ReportLab first.
Look into docutils and reSTructuredText. You could quickly write out your PDF document in reST and then compile the PDF using rst2pdf.py
I've used this, it creates very beautiful documents and the markup is extensible! Later you could take the same code and run it into rst2html to create a website out if it!
Take a look here:
http://docutils.sourceforge.net/docs/user/rst/quickref.html
http://code.google.com/p/rst2pdf/
Good luck
You could generate a document through, for example, TeX, or OpenOffice, or whatever gives you the most comfortable bindings and then print the document with a pdf printer.
This allows you not to have to figure out where to put fields precisely or figure out what to do if your content overflows the space allocated for it.
I am using reportlab toolkit in Python to generate some reports in PDF format. I want to use some predefined parts of documents already published in PDF format to be included in generated PDF file. Is it possible (and how) to accomplish this in reportlab or in python library?
I know I can use some other tools like PDF Toolkit (pdftk) but I am looking for Python-based solution.
I'm currently using PyPDF to read, write, and combine existing PDF's and ReportLab to generate new content. Using the two package seemed to work better than any single package I was able to find.
If you want to place existing PDF pages in your Reportlab documents I recommend pdfrw. Unlike PageCatcher it is free.
I've used it for several projects where I need to add barcodes etc to existing documents and it works very well. There are a couple of examples on the project page of how to use it with Reportlab.
A couple of things to note though:
If the source PDF contains errors (due to the originating program following the PDF spec imperfectly for example), pdfrw may fail even though something like Adobe Reader has no apparent problems reading the PDF. pdfrw is currently not very fault tolerant.
Also, pdfrw works by being completely agnostic to the actual content of the PDF page you are placing. So for example, you wouldn't be able to use pdfrw inspect a page to see if it contains a certain string of text in the lower right-hand corner. However if you don't need to do anything like that you should be fine.
There is an add-on for ReportLab — PageCatcher.