create office files from python

create office files from python - python

We have a project in python with django.
We need to generate complex word, excel and pdf files.
For the rest of our projects which were done in PHP we used PHPexcel ,
PHPWord and tcpdf for PDF.
What libraries for python would you recommend for creating this kind of files ? (for excel and word its imortant to use the open xml file format xlsx , docx)

Python-docx may help ( https://github.com/mikemaccana/python-docx ).
Python doesn't have highly-developed tools to manipulate word documents. I've found the java library xdocreport ( https://code.google.com/p/xdocreport/ ) to be the best by far for Word reporting. Because I need to generate PCL, which is efficiently done via FOP I also use docx4j.
To integrate this with my python, I use the spark framework to wrap it up with a simple web service, and use requests on the python side to talk to the service.

For excel, there's openpyxl, which actually is a python port of PHPexcel, afaik. I haven't used it yet, but it sounds ok to me.

I would recommend using Docutils. It takes reStructuredText files and converts them to a range of output files. Included in the package are HTML, LaTeX and .odf file writers but in the sandbox there are a whole load of other writers for writing to other formats, see for example, the WordML writer (disclaimer: I haven't used it).
The advantage of this solution is that you can write plain text (reStructuredText) master files, which are human readable as is, and then convert to a range of other file formats as required.
Whilst not a Python solution, you should also look at Pandoc a Haskell library which supports a much wider range of output and input formats than docutils. One major advantage of Pandoc over Docutils is that you can do the reverse translation, i.e. WordML to reStructuredText. You can try Pandoc here.

I have never used any libraries for this, but you can change the extension of any docx, xlsx file to zip, and see the magic!
Generating openxml files is as simple as generating couple of XML files (you can use templates) and zipping it.

Simplest way to generate PDF is to generate HTML (with CSS+images) and convert it using wkhtmltopdf tool.

Related

How to convert XML Word Documents to DOCX?

I have been given a series of folders with large amounts of Word documents in .xml formatting. They each contain some VBA code, but the code on all of them has already been run, so I don't need to keep this.
I need to print all of the files in each folder, but due to constraints on XML files on the network, I can't simply mass-print them from Windows Explorer, so I need to convert them to .docx (or .doc) first.
How can I go about doing this? I tried a simple python script using python-docx:
import os
from docx import Document
folderPath=<folderpath>
fileNamesList=os.listdir(folderPath)
for xmlFileName in fileNamesList:
currentDoc=Document(os.path.join(folderPath,xmlFileName))
docxFileName=xmlFileName.replace('.xml','.docx')
currentDoc.save(os.path.join(folderPath,docxFileName))
currentDoc.close()
This gives:
docx.opc.exceptions.PackageNotFoundError: Package not found at <first file name>.xml
I'm guessing this is because python-docx isn't meant to open .xml files, but that's a pretty uneducated guess. Searching around for this error, all I can find are problems with it not being installed properly (which it is, as far as I can tell) or using .doc files instead of .docx.
Am I simply using python-docx incorrectly? If not, is there are more suitable package or technique I should be using?

It's not clear just what sort of files those .xml files are, but I suspect they are a transitional format used I think in Word 2003, which was XML-based, but not the Open Packaging Convention (OPC) format used in Word documents since Word 2007.
python-docx is not going to read those, now or ever, so you'll either need to convert them to .docx format or parse the XML directly.
If I had Windows available, I suppose I would use VBA to write a short conversion script and then work with the .docx files using python-pptx. I would start by seeing if Word could load the .xml file and go from there.
You might be able to find a utility to do a bulk conversion, but I didn't find any on a quick search.
If all you're interested in is a one-time print, and Word will load the files, then a VBA script for that without the conversion step might be a good option. python-docx doesn't print .docx files, only read and write them.

How to create / Write STDF file with python?

I'm Trying to convert .txt data file to STDF (ATE Standard Test Data Format, commonly used in semiconductor tests) file.
Is there any way to do that?
Are there any libraries in Python which would help in cases like this?
Thanks!

You can try Semi-ATE STDF library:
It supports only ver. 4. You can use conda-forge or pypi to install it.

It is of course possible since Python is Turing complete. However, you should use one of the available open source or commercial libraries to handle the STDF writing if you are not familiar with STDF. Even one mis-placed byte in the binary output will wreck your file.
It is impossible to say whether an existing tool can do this for you because a text file can have anything in it. Your text file will need to adhere to the tool's expectations of where the necessary header data (lot id, program name, etc.), test names and numbers, part identifiers, test results and so on will be in the text file.

HTML to RTF string using Python

I am looking for a way to convert HTML text to RTF string. Is there any libraries that does this job. I get html content dynamically in my project and need it to be rendered in RTF format. I am using HTML parser to convert HTML text to normal string and then have trying to use PyRTF for conversion to RTF format. Is there any better way that this can be done.Thanks in advance.

RTF seems a dicey format to convert from/to. I've tried cutting and pasting among applications on Mac OS X, for example, where RTF is something of a lingua franca. Some of those apps are Microsoft apps (relevant in that RTF is a Microsoft-developed format), others are not. Even basic formatting information like font size, font face, line spacing, and list styling (ordered or unordered) is jumbled when copying from one ostensibly RTF-speaking app to another. Simply put, it's a mess.
I have searched for ways to programmatically read, write, and transform RTF, preferably from Python. I found a number of packages on PyPI, trying them out has been a disappointing experience. They would support RTF 1.5, say, when the current version is 1.9.1. RTF has been around a long time, but a 2005-vintage spec is not very recent. There were lots of gotchas and incompatibilities. LOTS.
Now, I'm not saying it's impossible, or that there aren't other libraries out there that would do the trick. I have not tried the zopyx.convert mentioned by others here, for example. Maybe it's great. But looking at its dependencies--Java, FOP, etc.--it looks like a pretty complex (and thus likely fragile) toolchain. I read its code on github, and the Python is really only there as a coordination veneer. It organizes external tools XFC, XINC, FOP, and PrinceXML--three of the four of which are commercial software. That includes the key XFC part that deals with RTF. Color me skeptical.
There are two converters that I've found are worth a look: If you're using a Mac, the textutil command line program is actually one of the better and simpler tools I've seen.
textutil -convert html filename.rtf -output filename.html
The other formatting engine that's worth considering is LibreOffice. It's free, open source, reasonably amenable to automation, and a decent foundation as an interoperability hub. That's not just a guess; I've built complex, multi-format document workflows around it.
I would question why you're trying to get into RTF. That seems like a document format you'd be trying to escape from. But if you need to go there, textutil and LibreOffice are the least-worst mechanisms I've found.

There is a wonderful python library that comes as a tarball.
You can download it at https://pypi.python.org/pypi/zopyx.convert2/2.4.5.
Good luck!

I see this question is over a year old, but figured I'd contribute anyway. I recently had a similar requirement, and turned to PyRTF, a small but powerful Python module that can construct RTF documents from a text file. You could use Beautiful Soup to scrape the HTML, going down the parse tree tag by tag, and use the PyRTF API to construct appropriate objects (table, cell, paragraph, section or document).
The API itself is quite granular, and allows for a whole bunch of custom formatting (font text, alignment, color, headers, footers etc.)
Hope this helps.

Output PCL from Word document using Python

I'm building a web application which will include functionality that takes MS Word (and possibly input from a web-based rich text editor) documents, substitutes values into the formfield placeholders in those documents, and generates a PCL document as output.
I'm developing in python and django on windows, but this whole solution will need to be deployed to a web host (yet to be chosen), which in practice means that the solution will need to run on linux.
I'm open to linux-only solutions if that's the only way. I'm open to solutions that involve talking to a server written in another language. I am able to write C++ or java if necessary to get this done. The final output does have to be in PCL format.
My question is: what is a good tool chain for generating PCL from word documents using python?
I am considering using some kind of interface to openoffice to open the word documents, do the substitutions, and send the output to some kind of printer driver. Does anyone have experience with this? What libraries would you recommend?
Options for interfacing that I have identified include the following; any other suggestions would be greatly welcomed:
Ulif.openoffice: http://pypi.python.org/pypi/ulif.openoffice/0.4
Py3o.renderserver: https://bitbucket.org/faide/py3o.renderserver
OpenOffice-python: http://openoffice-python.origo.ethz.ch/
A second approach would be to use something like paradocx ( https://bitbucket.org/yougov/paradocx/wiki/Home ) to open the word files, do the substitutions using that in python, then somehow interface with something that can output PCL. Again, any experience or comments on this approach would be appreciated.
I will very much appreciate any comments on tools and toolchains, and ideas or recipes that you may have.
This question covers similar ground to, but is not the same as: How to Create PCL file from MS word

Ghostscript can read PS (Postscript) or PDF and create PCL. You can use python libraries or just subprocess....

OK, so my final solution involved creating a java webservice to perform my transcoding.
Docx4j provides a class org.docx4j.convert.out.pdf.viaXSLFO.Conversion which hooks into apache FOP to convert Docx to PDF; that can be easily hacked to convert to PCL (because FOP outputs PCL)
Spark is a lightweight java web framework which allowed me to wrap my transcoder in a web service
Because I also manipulate the document, I need to have some metadata, so the perfect thing is a multipart form. I decode that using Apache Fileupload
In almost all cases, I had to upgrade to the development versions of libraries to get this to work.
On the python side I use:
requests to communicate with the web service
poster to prepare the multi-part request

How to include page in PDF in PDF document in Python

I am using reportlab toolkit in Python to generate some reports in PDF format. I want to use some predefined parts of documents already published in PDF format to be included in generated PDF file. Is it possible (and how) to accomplish this in reportlab or in python library?
I know I can use some other tools like PDF Toolkit (pdftk) but I am looking for Python-based solution.

I'm currently using PyPDF to read, write, and combine existing PDF's and ReportLab to generate new content. Using the two package seemed to work better than any single package I was able to find.

If you want to place existing PDF pages in your Reportlab documents I recommend pdfrw. Unlike PageCatcher it is free.
I've used it for several projects where I need to add barcodes etc to existing documents and it works very well. There are a couple of examples on the project page of how to use it with Reportlab.
A couple of things to note though:
If the source PDF contains errors (due to the originating program following the PDF spec imperfectly for example), pdfrw may fail even though something like Adobe Reader has no apparent problems reading the PDF. pdfrw is currently not very fault tolerant.
Also, pdfrw works by being completely agnostic to the actual content of the PDF page you are placing. So for example, you wouldn't be able to use pdfrw inspect a page to see if it contains a certain string of text in the lower right-hand corner. However if you don't need to do anything like that you should be fine.

There is an add-on for ReportLab — PageCatcher.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.