I have a one-page OpenOffice document which is a POD template. Basically, I use this template to replace tags in the template with automatically generated data in the rendered document.
This is how it works:
# loads the POD OpenOffice template engine
from appy.pod.renderer import Renderer as OORenderer
# defines data for template tags
ctx = {'template_tag': 'my_data'}
# renders the template in my_model.odt using defined data
r = OORenderer('my_model.odt', ctx, 'rendered_file.pdf')
# saves into the specified PDF file
r.run()
I would like to generate several pages like that, and merge them into a single PDF file.
Is there a way to do it using POD?
Or maybe can I programmatically merge generated PDF files into a single file?
Thank you!
and merge them into a single PDF file.
just try pyPDF.
A Pure-Python library built as a PDF toolkit. It is capable of:
extracting document information (title, author, ...),
splitting documents page by page,
merging documents page by page,
cropping pages,
merging multiple pages into a single page,
encrypting and decrypting PDF files.
Related
I have one use-case .Lets say there is pdf report which has data from testing of some manufacturing components
and this PDF report is loaded in DB using some internally developed software.
We need to develop some reconciliation program wherein the data needs to be compared from PDF report to Database. We can assume pdf file has a fixed template.
If there are many tables and some raw text data in pdf then how mysql save this pdf data..in One table or in many tables .
Please suggest some approach(preferably in python) for comparing data
Finding and extracting specific text from URL PDF files, without downloading or writing (solution) have a look at this example and see if it will help. I found it worked efficiently for me, this is if the pdf is URL based, but you could simply change the input source to be your DB. In your case you can remove the two if statements under the if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal): line. You mention having PDFs with the same template, if you are looking to extract text from one specific area of the template, use the print statement that has been commented out to find coordinates of desired data. Then as is done in the example, use those coordinates in if statements.
I want to automate some report creation. Some elements that I need in the report are saved as rich text, so an HTML file. There are a couple of libraries to do this, such as html2pdf or pdfforge. However, I would also like to add extra information to the report that is not located in this HTML file, like for example a title or some information queried from the DB that is not necessarily in the HTML file.
Does anyone have a suggestion to do this?
Thanks in advance.
I'm trying to parse the pdf into an html, and then I would like to extract the headings and subheading from the tags. The pdf document was generated by Microsoft word so, I'm pretty sure there must be a way to get those headings.
So far, I have tried parsing with Apache Tika and PDFMiner.six but so far the html I have got doesn't have such tags which I could use to extract headings and subheadings of the document.
I wonder if there is a way to do it, would appreciate any help. Thank you
I suggest you to use GROBID which is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications.
A simple python client for GROBID REST services is available at https://github.com/kermitt2/grobid-client-python
This Python client can be used to process a set of PDF in a given directory by the GROBID service. Results are written in a given output directory and include the resulting XML TEI representation of the PDF.
Hope this helps.
I've been using Weasyprint for pdf generation successfully, until I reach a certain size, a common use case of my app, where the pdf generation takes so long (more than 10s) that it breaks the connectivity with the browser, and the download is impossible.
I suppose I must stream the file creation and return a django StreamingHttpResponse (agree ?). I wouldn't pre-process the pdf because it is formed from baskets with items users frequently add or delete.
But how can I stream the file creation with weasyprint ? Even if I cut my sourceHtml string in parts, how to write the pdf step by step ?
I render a django template and generate the pdf from it:
from weasyprint import HTML
sourceHtml = template.render(my-objects)
outhtml = HTML(string=sourceHtml).write_pdf()
response = HttpResponse(outhtml, content_type='application/pdf')
response['Content-Disposition'] = u'attachment; filename="{}.pdf"'.format(name)
Or is there another way to solve this problem ?
Thanks !
I asked on the issue tracker: https://github.com/Kozea/WeasyPrint/issues/416
It is not doable and a suggested workaround is to
split the download into two steps: one route asynchronously generates the document and stores it on the filesystem, the second route downloads the generated document. When the document is not generated yet, you can hide the second link and display something like "the document is not generated yet" instead.
I have informations which pops up in fancybox, e.g. a user form(which includes images, his/her address, phone no, name etc..), now how can I convert that into PDF and let user download pdf file. Can this be achieved using jquery.
I use datatables to show rows of user informations, which when clicked in individual rows are popped up in fancybox. There is a plugin in datatables called datatools which lets user to convert table into csv, xls, pdf file, which uses 'swf' file. Is there some way to convert html rendered page to pdf ??
I am using python2.7, flask
You can convert HTML to PDF using WKHTML2PDF.