How to convert docx to pdf in python dash - python

I am trying to deploy a python dash app to Heroku server. I perform some calculations based on user input and I create a docx report (I have a specific .docx template and I fill it with data each time). My problem is that before sending that report to the user, I want to convert it to pdf. I was already using for offline use, docx2pdf however I do not know if heroku has word embedded. Apart from that, I do not want to save the .docx and then convert it to pdf and then delete that .docx. I want to do all the conversions when the files are in memory, not stored.
Can you suggest me any other way to do that?
I tried docx2pdf without success, and I do not want this approach either because I have to save the file.
I tried another solution that I found, conver the docx to html and then to pdf with pdfkit, but the conversion from docx to html really messed up the document.
I also tried several other minor libraries without success.

Related

How to automate the conversion of PPT into PDFs?

I currently have a template for certificate in PPT format and wish to convert (or print) it to PDF.
I have a python list of people names that I am passing to python in the form of a list.
I need help with library and code to write a python script to automate this process by using a loop to edit the ppt with the names in the list and export the ppts to generate PDFs iteratively
You might be better off converting to PDF first with a placeholder text and then try to replace the text in the PDF using pdf-redactor: https://github.com/JoshData/pdf-redactor
A more reliable and future-proof option would be to replicate the template in Latex, where you can then very easily substitute the text.

How to convert XML Word Documents to DOCX?

I have been given a series of folders with large amounts of Word documents in .xml formatting. They each contain some VBA code, but the code on all of them has already been run, so I don't need to keep this.
I need to print all of the files in each folder, but due to constraints on XML files on the network, I can't simply mass-print them from Windows Explorer, so I need to convert them to .docx (or .doc) first.
How can I go about doing this? I tried a simple python script using python-docx:
import os
from docx import Document
folderPath=<folderpath>
fileNamesList=os.listdir(folderPath)
for xmlFileName in fileNamesList:
currentDoc=Document(os.path.join(folderPath,xmlFileName))
docxFileName=xmlFileName.replace('.xml','.docx')
currentDoc.save(os.path.join(folderPath,docxFileName))
currentDoc.close()
This gives:
docx.opc.exceptions.PackageNotFoundError: Package not found at <first file name>.xml
I'm guessing this is because python-docx isn't meant to open .xml files, but that's a pretty uneducated guess. Searching around for this error, all I can find are problems with it not being installed properly (which it is, as far as I can tell) or using .doc files instead of .docx.
Am I simply using python-docx incorrectly? If not, is there are more suitable package or technique I should be using?
It's not clear just what sort of files those .xml files are, but I suspect they are a transitional format used I think in Word 2003, which was XML-based, but not the Open Packaging Convention (OPC) format used in Word documents since Word 2007.
python-docx is not going to read those, now or ever, so you'll either need to convert them to .docx format or parse the XML directly.
If I had Windows available, I suppose I would use VBA to write a short conversion script and then work with the .docx files using python-pptx. I would start by seeing if Word could load the .xml file and go from there.
You might be able to find a utility to do a bulk conversion, but I didn't find any on a quick search.
If all you're interested in is a one-time print, and Word will load the files, then a VBA script for that without the conversion step might be a good option. python-docx doesn't print .docx files, only read and write them.

How to copy from Excel and paste as image in Word using Python?

How do I copy a set of cells from Excel and paste them as image in Word using Python ?
I am trying to automate some reports (in docx format). The excels are auto generated and then I have to manually copy cells from Excel and paste as image in Word.
I have done some research and found there are some libraries like python-docx (for word) and openpyxl (for excel). What I am not able to figure out is the copy and paste as image from excel to word.
I don't think this is an easy task. Because conversion Excel to image probably means the Python package/codes need to know how to render the excel content, which is a big step beyond read/write excel format. I don't know any Python package can do that.
Assume you're running Python codes in Windows, you may try to call COM to copy/paste directly from Excel to Word. I guess that would behave just like you manually Ctrl+C/V. Reference
If you prefer to convert to an image, may try a VBA script. Reference

Processing a PDF for information extraction

I am working on a project where I have a pdf file which describes one of the health policy. What I need to do is extract the information from this PDF and try to save it in some form such that I can answer the questions related to the policy by extracting info from this PDf.
This PDF is too big, so I want to divide the PDF according to the different sections so that when a query related to some particular area comes in then I wont have to go through the entire document.
I tried solving this using some pdf converters which converts the PDFs into the HTMLs. But these converters wont convert the PDF to HTML properly so that headings will have heading tag. Also even if I convert this properly and get the proper sections out of the document, I am not getting how to store this data.(I mean in which form should I store this Data).
Is there any other solution with which I can achieve this. I am using Python and also I can use NLTK if needed. Also the format is not fixed for the PDfs, I mean to say my code should work on any kind of PDFs.
PDFMiner is great in that it has location for every bit of text it gets from the PDF. It won't be nicely put in header tags or anything like that, but if you have a consistent PDF structure in your docs you might be able to get something working.

How to include page in PDF in PDF document in Python

I am using reportlab toolkit in Python to generate some reports in PDF format. I want to use some predefined parts of documents already published in PDF format to be included in generated PDF file. Is it possible (and how) to accomplish this in reportlab or in python library?
I know I can use some other tools like PDF Toolkit (pdftk) but I am looking for Python-based solution.
I'm currently using PyPDF to read, write, and combine existing PDF's and ReportLab to generate new content. Using the two package seemed to work better than any single package I was able to find.
If you want to place existing PDF pages in your Reportlab documents I recommend pdfrw. Unlike PageCatcher it is free.
I've used it for several projects where I need to add barcodes etc to existing documents and it works very well. There are a couple of examples on the project page of how to use it with Reportlab.
A couple of things to note though:
If the source PDF contains errors (due to the originating program following the PDF spec imperfectly for example), pdfrw may fail even though something like Adobe Reader has no apparent problems reading the PDF. pdfrw is currently not very fault tolerant.
Also, pdfrw works by being completely agnostic to the actual content of the PDF page you are placing. So for example, you wouldn't be able to use pdfrw inspect a page to see if it contains a certain string of text in the lower right-hand corner. However if you don't need to do anything like that you should be fine.
There is an add-on for ReportLab — PageCatcher.

Categories