is there a way to bulk covert docx files into pdf - python

Is there a way to covert bulk docx files to PDF ?
I'm doing a mailmerge to generate huge number of letters in word extension and i'm struggling on having the pdf conversion for the same letters.

https://pandoc.org/demos.html - pandoc is a library and a command line tool, and it also has python bindings. Demo 35 - 36 shows using docx as an input file, so I think it would work.
If straight PDF output unavailable (I don't think I've used pandoc in that way), you can also output to LaTeX, which can then be compiled to PDF (using miktex or other latex distributions).

Related

How to convert docx to pdf in python dash

I am trying to deploy a python dash app to Heroku server. I perform some calculations based on user input and I create a docx report (I have a specific .docx template and I fill it with data each time). My problem is that before sending that report to the user, I want to convert it to pdf. I was already using for offline use, docx2pdf however I do not know if heroku has word embedded. Apart from that, I do not want to save the .docx and then convert it to pdf and then delete that .docx. I want to do all the conversions when the files are in memory, not stored.
Can you suggest me any other way to do that?
I tried docx2pdf without success, and I do not want this approach either because I have to save the file.
I tried another solution that I found, conver the docx to html and then to pdf with pdfkit, but the conversion from docx to html really messed up the document.
I also tried several other minor libraries without success.

I need to extract text from PDF file and make a new .txt file to put in

I need help in a PYTHON script to read PDF file and copy every word on it and put them in a new .txt file (every word must take 1 line) ; and then deleted the repeated words and count them after that and print the count in the last line
Install these libraries.
PyPDF2 (To convert simple, text-based PDF files into text readable by Python)
textract (To convert non-trivial, scanned PDF files into text readable by Python)
nltk (To clean and convert phrases into keywords)
Each of these libraries can be installed with the following commands in side terminal(on macOS):
pip install Libraryname
See this Tutorial https://medium.com/#rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f
Use texttrack it support many types of files also PDF. So texttrack better.
folow these links
https://github.com/deanmalmgren/textract
https://textract.readthedocs.io/en/latest/
Did you search the Stackoverflow for answers?
Here you can find some pretty good answers about how to extract text from a pdf file (Look at Jakobovski answer):
How to extract text from a PDF file?
Here you can find information about writing/editing/creating .txt files:
https://www.guru99.com/reading-and-writing-files-in-python.html

Extract text from pdf file with Python

I would like to extract text, including tables from pdf file.
I tried camelot, but it can only get table data not text.
I also tried PDF2, however it can't read Chinese characters.
Here is the pdf sample to read.
Are there any recommended text-extraction python packages?
By far the simplest way is to extract text in one OS shell command line using the poppler pdf utility tools (often included in python libraries) then modify that output in python.py as required.
>pdftotext -layout -f 1 -l 1 -enc UTF-8 sample.pdf
NOTE some of the text is embeded to right of the logo image and that can be extracted separately using pdftoppm -png or pdfimages then pass to inferior output quality OCR tools for those smaller areas.

How to convert XML Word Documents to DOCX?

I have been given a series of folders with large amounts of Word documents in .xml formatting. They each contain some VBA code, but the code on all of them has already been run, so I don't need to keep this.
I need to print all of the files in each folder, but due to constraints on XML files on the network, I can't simply mass-print them from Windows Explorer, so I need to convert them to .docx (or .doc) first.
How can I go about doing this? I tried a simple python script using python-docx:
import os
from docx import Document
folderPath=<folderpath>
fileNamesList=os.listdir(folderPath)
for xmlFileName in fileNamesList:
currentDoc=Document(os.path.join(folderPath,xmlFileName))
docxFileName=xmlFileName.replace('.xml','.docx')
currentDoc.save(os.path.join(folderPath,docxFileName))
currentDoc.close()
This gives:
docx.opc.exceptions.PackageNotFoundError: Package not found at <first file name>.xml
I'm guessing this is because python-docx isn't meant to open .xml files, but that's a pretty uneducated guess. Searching around for this error, all I can find are problems with it not being installed properly (which it is, as far as I can tell) or using .doc files instead of .docx.
Am I simply using python-docx incorrectly? If not, is there are more suitable package or technique I should be using?
It's not clear just what sort of files those .xml files are, but I suspect they are a transitional format used I think in Word 2003, which was XML-based, but not the Open Packaging Convention (OPC) format used in Word documents since Word 2007.
python-docx is not going to read those, now or ever, so you'll either need to convert them to .docx format or parse the XML directly.
If I had Windows available, I suppose I would use VBA to write a short conversion script and then work with the .docx files using python-pptx. I would start by seeing if Word could load the .xml file and go from there.
You might be able to find a utility to do a bulk conversion, but I didn't find any on a quick search.
If all you're interested in is a one-time print, and Word will load the files, then a VBA script for that without the conversion step might be a good option. python-docx doesn't print .docx files, only read and write them.

create office files from python

We have a project in python with django.
We need to generate complex word, excel and pdf files.
For the rest of our projects which were done in PHP we used PHPexcel ,
PHPWord and tcpdf for PDF.
What libraries for python would you recommend for creating this kind of files ? (for excel and word its imortant to use the open xml file format xlsx , docx)
Python-docx may help ( https://github.com/mikemaccana/python-docx ).
Python doesn't have highly-developed tools to manipulate word documents. I've found the java library xdocreport ( https://code.google.com/p/xdocreport/ ) to be the best by far for Word reporting. Because I need to generate PCL, which is efficiently done via FOP I also use docx4j.
To integrate this with my python, I use the spark framework to wrap it up with a simple web service, and use requests on the python side to talk to the service.
For excel, there's openpyxl, which actually is a python port of PHPexcel, afaik. I haven't used it yet, but it sounds ok to me.
I would recommend using Docutils. It takes reStructuredText files and converts them to a range of output files. Included in the package are HTML, LaTeX and .odf file writers but in the sandbox there are a whole load of other writers for writing to other formats, see for example, the WordML writer (disclaimer: I haven't used it).
The advantage of this solution is that you can write plain text (reStructuredText) master files, which are human readable as is, and then convert to a range of other file formats as required.
Whilst not a Python solution, you should also look at Pandoc a Haskell library which supports a much wider range of output and input formats than docutils. One major advantage of Pandoc over Docutils is that you can do the reverse translation, i.e. WordML to reStructuredText. You can try Pandoc here.
I have never used any libraries for this, but you can change the extension of any docx, xlsx file to zip, and see the magic!
Generating openxml files is as simple as generating couple of XML files (you can use templates) and zipping it.
Simplest way to generate PDF is to generate HTML (with CSS+images) and convert it using wkhtmltopdf tool.

Categories