I've been reading on how to build reports and sort of publication ready articles using Jupyter and markdown and am sort of confused at this point. I have followed several posts and guides including:
This blog post
Making publication ready Python Notebooks
Writing academic papers in plain text with Markdown and Jupyter notebook
Right now, I am using the following plugins to have a more readable format:
collapsible headings
python-markdown (breaks when outputting to pdf)
hide code
cite2c
This was all relatively easy to implement, though I am still having issues with python markdown breaking when outputting to pdf. I am wondering:
Especially considering most of the blog posts and tutorials I found are about 4 years or older, are there easier approaches or guidelines for outputting markdown in a jupyter notebook to PDF (or word?).
Crossreferences seem to not be possible yet. For instance, I can reference any figure that I label manually, but numbers are not automatically updated.
Any suggestions are welcomed!
Related
As an accountant, I produce A4 PDF financial reports for clients. The report contains a PDF cover page design, table of contents, blocks of text and many tables of financial data.
To date I have used a mixture of Microsoft Excel and Word to produce these reports, then save as PDF and add the PDF cover. The major disadvantages to this are that I have to manually edit the tables, I would much rather create automated reports based off existing data exported from my accounting software.
I would like to move away from Excel-Word and move towards (semi-) automating this through python (potentially pandas and markdown packages) - with markdown or html.
Previously I used LaTeX to produce these reports, however I found LaTeX challenging if something went wrong, the errors are difficult to understand and even basic table production can be challenging.
I am trying to plan out how I could bring together python-markdown-html/css. I was wondering if anyone else had experience in producing A4 reports in this way and any advice that they could offer. Initially I was drawn to having text saved as .md files and data stored in either mongoDB, pandas dataframes or simply CSV. I would then use the combination of .md and the data to produce a complete report in HTML. However, could HTML be converted into A4 PDF easily? I understand that there are now page CSS functionality for printing, but is this applicable? How would you suggest I can automate the creation of A4 PDF reports?
To answer your questions plainly:
However, could HTML be converted into A4 PDF easily?
Yes, this is possible using pandoc.
I understand that there are now page CSS functionality for printing, but is this applicable?
Not needed if you use a pandoc template, but possible if desired.
How would you suggest I can automate the creation of A4 PDF reports?
I suggest using pandoc and pandoc templates. This will allow you to convert from a file containing a mixture of makdown, latex, html, and whatever else you would like directly into a pdf.
More details on how:
Pandoc is a document conversion tool that can do this job very well. It will allow you to convert from html or markdown or LaTeX or a mix of all 3 into pdf or a number of other desired formats. For additional control on how the output looks, you can use a pandoc template. You can find information on how to create a custom template here. Here is an example of how that command works:
pandoc /filepath/doc_name.md -o doc_name.pdf --template /file_path/pandoc-templates/article.latex
This process can automated with some further effort. You could do something such as write some python code to generate your graphs or tables from source csv files, then have that code call your pandoc command and build a document.
Here is how I convert my ipython files with graphics outputs and tables into nice looking PDF files, hiding the code segments:
First install jupyter_contrib_nbextensions with
pip install jupyter_contrib_nbextensions
and wkhtmltopdf library from:
https://wkhtmltopdf.org/downloads.html for example I use macos so I had to install
wkhtmltox-0.12.6-2.macos-cocoa.pkg
from this site.
Now convert your file outputs to HTML hiding your code:
jupyter nbconvert --no-input --to html A4_REPORT.ipynb
(A4_REPORT.ipynb is the file you should already have prepared generating some kind of a table, graph or have contained inline markdown segments and able to run in jupyter notebook)
Now convert your this HTML to PDF:
wkhtmltopdf A4_REPORT.html A4_REPORT.pdf
DONE !
Does anyone know how to export RCloud notebooks (not RStudio) to a common format? I’m using it for a project to try to predict if a car will statistically be a good purchase or a lemon, and I’m learning - so I’m trying a lot of new/different code and packages to make graphs, regression charts etc.
I’m new to RCloud so I want to save this notebook as a reference document/cheat sheet on my laptop so I can ‘reuse’ the common R commands I used (e.g. how to use "lapply" command to change vectors to numeric like “mycarsub[, 1:6] <- lapply(mycarsub[, 1:6], as.numeric”, "na.omit", etc. I just want a reference to use for other projects or notebooks in Rcloud, RStudio etc.
So I’m wondering if anyone knows how to export it in Text format that is searchable or easily read with common apps (outside of RCloud, or RStudio)? Like export to Word/Libreoffice, HTML etc?
I tried “Share” at the top but think it only exports R file types, I’m probably doing it wrong. Or if you have another way to accomplish what i'm trying to do. I cut and pasted but doesn't work all the time (user error?). I searched Stack Overflow but only got RStudio or R developer code exporting via API's etc. Hope this is enough info, first post.
RCloud was created for making it easier to share code and for others to learn from existing code so it includes the ability to search code using Lucene search syntax. Rather that creating another document to keep track of, I would suggest opening multiple RCloud tabs - use one to search, cut and paste from and the other to code in; you can create multiple tabs by copying and pasting any notebook URL into a new tab.
If you prefer to have a separate document, you can export the RCloud notebooks as an R Source file or a Rmarkdown file using the Advanced menu located in the navigation bar at the far right.
This is literally day 1 of python for me. I've coded in VBA, Java, and Swift in the past, but I am having a particularly hard time following guides online for coding a pdf scraper. Since I have no idea what I am doing, I keep running into a wall every time I want to test out some of the code I've found online.
Basic Info
Windows 7 64bit
python 3.6.0
Spyder3
I have many of the pdf related code packages (PyPDF2, pdfminer, pdfquery, pdfwrw, etc)
Goals
To create something in python that allows me to convert PDFs from a folder into an excel file (ideallY) OR a text file (from which I will use VBA to convert).
Issues
Every time I try some sample code from guides i've found online, I always run into syntax errors on the lines where I am calling the pdf that I want to test the code on. Some guide links and error examples below. Should I be putting my test.pdf into the same file as the .py file?
How to scrape tables in thousands of PDF files?
I got an invalid syntax error due to "for" on the last line
PDFMiner guide (Link)
runfile('C:/Users/U587208/Desktop/pdffolder/pdfminer.py', wdir='C:/Users/U587208/Desktop/pdffolder')
File "C:/Users/U587208/Desktop/pdffolder/pdfminer.py", line 79
print pdf_to_csv('test.pdf', separator, threshold)
^
SyntaxError: invalid syntax
It seems that the tutorials you are following make use of python 2. There are usually few noticable differences, the the biggest is that in python 3, print became a funtion so
print()
I would recomment either changing you version of python or finding a tutorial for python 3. Hope this helps
Here
Pdfminer python 3.5 an example, how to extract informations from a PDF.
But it does not solve the problem with tables you want to export to Excel. Commercial products are probably better in doing that...
I am trying to do this exact same thing! I have been able to convert my pdf to text however the formatting is extremely random and messy and I need the tables to stay in tact to be able to write them into excel data sheets. I am now attempting to convert to XML to see if it will be easier to extract from. If I get anywhere on this I will let you know :)
btw, use python 2 if you're going to use pdfminer. Here's some help with pdfminer https://media.readthedocs.org/pdf/pdfminer-docs/latest/pdfminer-docs.pdf
On an Ubuntu server, I want to create pdfs which include other static pdfs. I have tried using ReportLab with pyPdf. Ideally I would use ReportLab to do the whole thing, but in order to import the pdfs requires their PageCatcher which has a large recurring fee.
So I use pyPdf to merge a page created with ReportLab and my other pdfs. The problem is that even though this looks fine in Acrobat and Foxit, part of one of the pages prints garbled on a Xerox 7400 color printer. I can't figure out the issue, but would be willing to buy a more integrated solution if it existed and was reasonably priced. I thought PDF Creator Pilot was it until I saw that it was Windows only.
So is there a reasonably priced ($1K or less) solution or a different suggestion?
I have had a lot of success with the Java library iText. They have a great library of samples for pretty much anything you could think of doing with PDF files. This example is for concatenating PDF files and sounds like it would do what you need: http://itextpdf.com/examples/index.php?page=example&id=123. There is also PDFBox which is another great Java based PDF manipulation library.
I realize that you are looking for a Python based solution but there may not be many other options. If you are using the Jython interpreter instead of CPython, integrating in iText should be trivial. If not, then you could consider calling out to it as a separate process. I realize that may not be idea for your situation but I figured I would mention it as an option.
Another non-Python answer. If you are just merging pages, then pdftk does that well (along with a lot of other things).
I am using reportlab toolkit in Python to generate some reports in PDF format. I want to use some predefined parts of documents already published in PDF format to be included in generated PDF file. Is it possible (and how) to accomplish this in reportlab or in python library?
I know I can use some other tools like PDF Toolkit (pdftk) but I am looking for Python-based solution.
I'm currently using PyPDF to read, write, and combine existing PDF's and ReportLab to generate new content. Using the two package seemed to work better than any single package I was able to find.
If you want to place existing PDF pages in your Reportlab documents I recommend pdfrw. Unlike PageCatcher it is free.
I've used it for several projects where I need to add barcodes etc to existing documents and it works very well. There are a couple of examples on the project page of how to use it with Reportlab.
A couple of things to note though:
If the source PDF contains errors (due to the originating program following the PDF spec imperfectly for example), pdfrw may fail even though something like Adobe Reader has no apparent problems reading the PDF. pdfrw is currently not very fault tolerant.
Also, pdfrw works by being completely agnostic to the actual content of the PDF page you are placing. So for example, you wouldn't be able to use pdfrw inspect a page to see if it contains a certain string of text in the lower right-hand corner. However if you don't need to do anything like that you should be fine.
There is an add-on for ReportLab — PageCatcher.