How to convert XML Word Documents to DOCX? - python

I have been given a series of folders with large amounts of Word documents in .xml formatting. They each contain some VBA code, but the code on all of them has already been run, so I don't need to keep this.
I need to print all of the files in each folder, but due to constraints on XML files on the network, I can't simply mass-print them from Windows Explorer, so I need to convert them to .docx (or .doc) first.
How can I go about doing this? I tried a simple python script using python-docx:
import os
from docx import Document
folderPath=<folderpath>
fileNamesList=os.listdir(folderPath)
for xmlFileName in fileNamesList:
currentDoc=Document(os.path.join(folderPath,xmlFileName))
docxFileName=xmlFileName.replace('.xml','.docx')
currentDoc.save(os.path.join(folderPath,docxFileName))
currentDoc.close()
This gives:
docx.opc.exceptions.PackageNotFoundError: Package not found at <first file name>.xml
I'm guessing this is because python-docx isn't meant to open .xml files, but that's a pretty uneducated guess. Searching around for this error, all I can find are problems with it not being installed properly (which it is, as far as I can tell) or using .doc files instead of .docx.
Am I simply using python-docx incorrectly? If not, is there are more suitable package or technique I should be using?

It's not clear just what sort of files those .xml files are, but I suspect they are a transitional format used I think in Word 2003, which was XML-based, but not the Open Packaging Convention (OPC) format used in Word documents since Word 2007.
python-docx is not going to read those, now or ever, so you'll either need to convert them to .docx format or parse the XML directly.
If I had Windows available, I suppose I would use VBA to write a short conversion script and then work with the .docx files using python-pptx. I would start by seeing if Word could load the .xml file and go from there.
You might be able to find a utility to do a bulk conversion, but I didn't find any on a quick search.
If all you're interested in is a one-time print, and Word will load the files, then a VBA script for that without the conversion step might be a good option. python-docx doesn't print .docx files, only read and write them.

Related

How to extract contents from WMI file in python?

I have WMI-files on my hdd, which I can open using some supporting tools (LogReader). In wmi-files, I have some report information as text module. I want to convert these files into .txt, .xml or some other datatype, which is suitable to use in python. So, I can use the information for further tasks. I tried to extract contents of WMI-Files in python but I couldn't accomplish it.
Is there any way to solve this problem of mine?

Extracting text from text documents with Poedit

I am making a quiz app which reads data from text files. The app works fine but I now want to translate it into English (from my native language). I can do that for strings defined in source files (.py) such as text on buttons etc., but have troubles with extracting text that needs translating from those text documents where all my questions and possible answers are.
I am using module gettext with Python and am using operator _ or _( to indicate translatable strings (which I have set in Poedit under Properties - Sources Keywords).
I have also set paths of my translatable sources to . (all files in that directory) and even tried setting those .txt files specifically for extracting.
My text file looks like this (one line of one file):
_(Koliko je 2/0?);_(0):_(ni definirano):_(2);_(ni definirano)
I tried to find which document type's Poedit extracts text from but did not find anything other than "from source" - should it be able to extract from .txt files or not? If not, how should I name them?
As I said, it does extracts strings from my .py files so it is working otherwise.
Poedit can't magically know the syntax of your homegrown file format, so simply adding .txt files can't possibly do anything. You'll have to write a custom extractor (see how xgettext works for reference) or switch to some standard syntax:
Be sufficiently similar to a supported programming language, such as C (where as luck would have it, ; and : are both valid syntax elements, although using e.g. , instead of : would be safer):
_("Koliko je 2/0?");_("0"):_("ni definirano"):_("2");_("ni definirano")
Use XML-based format, where xgettext supports extraction rules described with ITS.
I was confronted with the same problem when trying to extract strings to be translated from an options.txt file in a WordPress plugin. The only solution I found was to copy that options.txt file to options.php, which PoEdit was able to search for strings. When the translation operation is finished, the options.txt file can then be deleted.

Read all types of files in python

I am trying to extract information from different types of files in python(.pdf .doc .docx) and convert to .txt but while processing different files I am getting space and newlines when not required and many other issues. I have tried PyPDF2 and PDF manager.Please suggest me something with which i can extract information from files.
EDIT
Currently looking for something which can help me extract exact text from .pdf files. I have tried PyPDF, PDFMiner and PDF Manager and I am getting issues with some pdfs in all of them.
Personally I think pdfminer is the best python module for extracting information from pdfs Get it here
I think you can refer to this link
for corresponding file formats.

how to read ppt file using python?

I want to get the content (text only) in a ppt file. How to do it?
(It likes that if I want to get content in a txt file, I just need to open and read. What do I need to do to get information from ppt files?)
By the way, I know there is a win32com in windows system. But now I am working on linux, is there any possible way?
I found this discussion over on Superuser:
Command line tool in Linux to Extract Text From Word, Excel, Powerpoint?
There are several reasonable answers listed there, including using LibreOffice to do this (and for .doc, .docx, .pptx, etc, etc.), and the Apache Tika Project (which appears to be the 5,000lb gorilla in this solution space).

create office files from python

We have a project in python with django.
We need to generate complex word, excel and pdf files.
For the rest of our projects which were done in PHP we used PHPexcel ,
PHPWord and tcpdf for PDF.
What libraries for python would you recommend for creating this kind of files ? (for excel and word its imortant to use the open xml file format xlsx , docx)
Python-docx may help ( https://github.com/mikemaccana/python-docx ).
Python doesn't have highly-developed tools to manipulate word documents. I've found the java library xdocreport ( https://code.google.com/p/xdocreport/ ) to be the best by far for Word reporting. Because I need to generate PCL, which is efficiently done via FOP I also use docx4j.
To integrate this with my python, I use the spark framework to wrap it up with a simple web service, and use requests on the python side to talk to the service.
For excel, there's openpyxl, which actually is a python port of PHPexcel, afaik. I haven't used it yet, but it sounds ok to me.
I would recommend using Docutils. It takes reStructuredText files and converts them to a range of output files. Included in the package are HTML, LaTeX and .odf file writers but in the sandbox there are a whole load of other writers for writing to other formats, see for example, the WordML writer (disclaimer: I haven't used it).
The advantage of this solution is that you can write plain text (reStructuredText) master files, which are human readable as is, and then convert to a range of other file formats as required.
Whilst not a Python solution, you should also look at Pandoc a Haskell library which supports a much wider range of output and input formats than docutils. One major advantage of Pandoc over Docutils is that you can do the reverse translation, i.e. WordML to reStructuredText. You can try Pandoc here.
I have never used any libraries for this, but you can change the extension of any docx, xlsx file to zip, and see the magic!
Generating openxml files is as simple as generating couple of XML files (you can use templates) and zipping it.
Simplest way to generate PDF is to generate HTML (with CSS+images) and convert it using wkhtmltopdf tool.

Categories