Clean up XML of a DOCX document with python / Linux binary - python

It could be some kind of question similar to this one
But methods described there aren't applicable to my situation. I'm looking for a tool to use from Python or just a standalone Linux binary. All, that I've already found are only Win/MSO-related methods:(
Is there any way to simply clean docx tags in Linux?
Thanks!

I've tried to use headless LibreOffice as a convertor from DOCX to DOCX and it seemed to help with most of the cases.
libreoffice --headless --convert-to docx ./Copyright\ license.docx
Nevertheless, this way needs more testing.

Related

How to convert pdf to docx using python while maintaining the layout

I need help in converting pdf to docx. I found one library pdf2docx but this is a GNU license so I can't use this library to convert is there any library that I can use? I want to maintain the layout and all information and pdf files also contain images.
I have also tried converting the pdf to docx using subprocess this works but the file is corrupted.
Libraries that I have found are:
pikepdf
pdfminer.six
pymupdf
pdfrw
pdfplumber
etc...
In short I am looking for a pdf2docx library alternative.
Please help me with this....

Convert docx to PDF using Linux and Python

I am looking for a way to convert a docx to PDF using Python in Linux. So far, all I have found that works it is using Windows, is there a way to do it in Python without using libreoffice?

solution to convert PDFs, DOCs, DOCXs into a textual format with python

I am developing a full text search engine for indexing popular binary formats. I know that there are hundereds of such questions (and solutions) already, but I found it tough to find one:
cross platform
supports DOC, DOCX and PDF formats at once
easy to use with python
can be set up in a major shared host
For PDFs, I recommend PDFminer.
Try the docx module (I have not used it myself)
I am not aware of any pure python module that can read .doc files.
There are command-line tools to extract text from .doc files: antiword and catdoc (and probably others). If the packages are installed on your shared host, you could use subprocess to shell out to these tools. Available on Windows via Cygwin.
Apache POI is a Java library that can extract text from Office documents. If your shared host has Java installed, you could write a bit of Java (or Jython) code and execute using subprocess.
If at server side you can use OpenOffice then you can use unoconv: Convert between any document format supported by OpenOffice
One possible solution is to use google documents to extract the text contents from binary .doc-files. You upload the document to google docs and then download the text contents. It is a fairly slow process, but it is the only "pure Python" solution I know of since it doesn't require any external tools except for network access. An external tool such as catdoc or antiword is a much better solution if you are allowed to install it on your host.
Textract uses the default tools for every kind of file.
https://github.com/deanmalmgren/textract

how to convert text files to pdf files without reportlab in python?

I have a problem when using reportlab and py2exe. It works normal on python but much error on reportlab modules when running the exe file after compiled by py2exe. Can you suggest a library or code in python way to convert a text files (with tables) to pdf format without using reportlab. Thanks.
I used pyPdf in the past, which is quite good for quick and dirty solutions, though I would hesitate before using it for large projects.

Converting a PDF to a series of images with Python

I'm attempting to use Python to convert a multi-page PDF into a series of JPEGs. I can split the PDF up into individual pages easily enough with available tools, but I haven't been able to find anything that can covert PDFs to images.
PIL does not work, as it can't read PDFs. The two options I've found are using either GhostScript or ImageMagick through the shell. This is not a viable option for me, since this program needs to be cross-platform, and I can't be sure either of those programs will be available on the machines it will be installed and used on.
Are there any Python libraries out there that can do this?
ImageMagick has Python bindings.
Here's whats worked for me using the python ghostscript module (installed by '$ pip install ghostscript'):
import ghostscript
def pdf2jpeg(pdf_input_path, jpeg_output_path):
args = ["pdf2jpeg", # actual value doesn't matter
"-dNOPAUSE",
"-sDEVICE=jpeg",
"-r144",
"-sOutputFile=" + jpeg_output_path,
pdf_input_path]
ghostscript.Ghostscript(*args)
I also installed Ghostscript 9.18 on my computer and it probably wouldn't have worked otherwise.
You can't avoid the Ghostscript dependency. Even Imagemagick relies on Ghostscript for its PDF reading functions. The reason for this is the complexity of the PDF format: a PDF doesn't just contain bitmap information, but mostly vector shapes, transparencies etc.
Furthermore it is quite complex to figure out which of these objects appear on which page.
So the correct rendering of a PDF Page is clearly out of scope for a pure Python library.
The good news is that Ghostscript is pre-installed on many windows and Linux systems, because it is also needed by all those PDF Printers (except Adobe Acrobat).
If you're using linux some versions come with a command line utility called 'pdftopbm' out of the box. Check out netpbm
Perhaps relevant: http://www.swftools.org/gfx_tutorial.html

Categories