python convert microsoft office docs to plain text on linux

python convert microsoft office docs to plain text on linux - python

Any recomendations on a method to convert .doc, .ppt, and .xls to plain text on linux using python? Really any method of conversion would be useful. I have already looked at using Open Office but, I would like a solution that does not require having to install Open Office.

I'd go for the command line-solution (and then use the Python subprocess module to run the tools from Python).
Convertors for msword (catdoc), excel (xls2csv) and ppt (catppt) can be found (in source form) here: http://vitus.wagner.pp.ru/software/catdoc/.
Can't really comment on the usefullness of catppt but catdoc and xls2csv work great!
But be sure to first search your distributions repositories... On ubuntu for example catdoc is just one fast apt-get away.

You can access OpenOffice via Python API.
Try using this as a base: http://wiki.services.openoffice.org/wiki/Odt2txt.py

The usual tool for converting Microsoft Office documents to HTML or other formats was mswordview, which has since been renamed to vwWare.
If you're looking for a command-line tool, they actually recommend using AbiWord to perform the conversion:
AbiWord --to=txt
If you're looking for a library, start on the wvWare overview page. They also maintain a list of libraries and tools which read MS Office documents.

At the command line, antiword or wv work very nicely for .doc files. (Not a Python solution, but they're easy to install and fast.)

Same problem here. Below is my simple script to convert all doc files in dir 'docs/' to dir 'txts/' using catdoc. Hope it will help someone:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import glob, re, os
f = glob.glob('docs/*.doc') + glob.glob('docs/*.DOC')
outDir = 'txts'
if not os.path.exists(outDir):
os.makedirs(outDir)
for i in f:
os.system("catdoc -w '%s' > '%s'" %
(i, outDir + '/' + re.sub(r'.*/([^.]+)\.doc', r'\1.txt', i,
flags=re.IGNORECASE)))

For dealing with Excel Spreadsheets xlwt is good. But it won't help with .doc and .ppt files.
(You may have also heard of PyExcelerator. xlwt is a fork of this and better maintained so I think you'd be better of with xlwt.)

I've had some success at using XSLT to process the XML-based office files into something usable in the past. It's not necessarily a python-based solution, but it does get the job done.

Related

Internet Shortcut in python

I have a problem. Let's say I have a website (e.g. www.google.com). Is there any way to create a file with a .url extension linking to this website in python? (I am currently looking for a flat, and I am trying to save shortcuts on my hard drive only to apartment offers posted online matching my expectations ) I've tried to use the os and requests module to create such files, but with no success. I would really appreciate the help. (I am using python 3.9.6 on Windows 10)

This is pretty straightforward. I had no idea what .URL files were before seeing this post, so I decided to drag its URL to my desktop. It created a file with the following contents which I viewed in Notepad:
[InternetShortcut]
URL=https://stackoverflow.com/questions/68304057/internet-shortcut-in-python
So, you just need to write out the same thing via Python, except replace the URL with the one you want:
test_url = r'https://www.google.com/'
with open('Google.url','w') as f:
f.write(f"""[InternetShortcut]
URL={test_url}
""")
With regards to your current attempts:
I've tried to use os and requests module to create such file
It's not clear what you're using requests or os for, since you didn't provide a Minimal Reproduceable Example of what you'd tried so far; so, if there's a more complex element to this that you didn't specify, such as automatically generating the file while you're in your browser, or something like that, then you need to update your question to include all of your requirements.

Is there anyway to read .one(OneNote files) using Python script?

I am trying to read .one file(OneNote files) and want to write its content into a text file, but didn't find a single way to do it using Python. Please help me with this.

Try to get the content of your notes calling:
./me/onenote/pages/1-1c13bcbae2fdd747a95b3e5386caddf1!1-xxxxxxxx-xxxx-xxxx-xxxxxxxxxxxx/content?includeIDs=true&includeInkML=true&preAuthenticated=true
It will give you text/html, which you can parse with https://pypi.org/project/lxml/

I didn't find a good way to decode .one file.
But I find another way to workaround it.
Install OneNote 2016 and sync the contents.
Install the Evernote legacy version
Import data from OneNote to Evernote. (There's a button on GUI).
Export notes to html from Evernote.
Then you can do whatever you want. Yeah!

This isn't Python, but for other internet voyagers trying to escape Microsoft's iron grip this PowerShell script is pure magic.
https://passbe.com/2019/bulk-export-onenote-2013-2016-pages-as-html/

printing to windows printer with python or shell command

I'm tryin to script an annoying task that involves fetching, handling and printing loads of scanned docs - jpeg or pdf. I don't succeed in accessing the printer from python or from windows shell (which I could script with python subproccess module). I succeeded in printing a text file from the command line with lpr command, but not jpg or pdf.
be glad for any clues about that, including a more extensive win shell reference for printing to printer, a suitable python library I missed in my google search stackoverflow search etc (just one unanswered question)

Well, after a little research I found some links that might help you:
1) To print images using Python Shell, this link below has some code using PIL that will, hopefully, do what you want:
http://timgolden.me.uk/python/win32_how_do_i/print.html
2) To print PDF files, this link may have what you need:
http://www.darkcoding.net/software/printing-word-and-pdf-files-from-python/
I never did any of those things, but with a quick look, I could find this links and they seem to make very much sense. Hope it helps :)

I used this for a rtf (just an idea) :
subprocess.call(['loffice', '-pt', 'LaserJet', file])
I am using LibreOffice. it can print in a batch mode.

with a default pdf viewer assigned to the system you can do
import win32api
fname="C:\\somePDF.pdf"
win32api.ShellExecute(0, "print", fname, None, ".", 0)
note that this will only work on windows and will not work with all pdf viewers but it should be good with acrobat and Foxit and several other major ones.

create office files from python

We have a project in python with django.
We need to generate complex word, excel and pdf files.
For the rest of our projects which were done in PHP we used PHPexcel ,
PHPWord and tcpdf for PDF.
What libraries for python would you recommend for creating this kind of files ? (for excel and word its imortant to use the open xml file format xlsx , docx)

Python-docx may help ( https://github.com/mikemaccana/python-docx ).
Python doesn't have highly-developed tools to manipulate word documents. I've found the java library xdocreport ( https://code.google.com/p/xdocreport/ ) to be the best by far for Word reporting. Because I need to generate PCL, which is efficiently done via FOP I also use docx4j.
To integrate this with my python, I use the spark framework to wrap it up with a simple web service, and use requests on the python side to talk to the service.

For excel, there's openpyxl, which actually is a python port of PHPexcel, afaik. I haven't used it yet, but it sounds ok to me.

I would recommend using Docutils. It takes reStructuredText files and converts them to a range of output files. Included in the package are HTML, LaTeX and .odf file writers but in the sandbox there are a whole load of other writers for writing to other formats, see for example, the WordML writer (disclaimer: I haven't used it).
The advantage of this solution is that you can write plain text (reStructuredText) master files, which are human readable as is, and then convert to a range of other file formats as required.
Whilst not a Python solution, you should also look at Pandoc a Haskell library which supports a much wider range of output and input formats than docutils. One major advantage of Pandoc over Docutils is that you can do the reverse translation, i.e. WordML to reStructuredText. You can try Pandoc here.

I have never used any libraries for this, but you can change the extension of any docx, xlsx file to zip, and see the magic!
Generating openxml files is as simple as generating couple of XML files (you can use templates) and zipping it.

Simplest way to generate PDF is to generate HTML (with CSS+images) and convert it using wkhtmltopdf tool.

Is there a Python module for converting RTF to plain text? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
Ideally, I'd like a module or library that doesn't require superuser access to install; I have limited privileges in my working environment.

I've been working on a library called Pyth, which can do this:
http://pypi.python.org/pypi/pyth/
Converting an RTF file to plaintext looks something like this:
from pyth.plugins.rtf15.reader import Rtf15Reader
from pyth.plugins.plaintext.writer import PlaintextWriter
doc = Rtf15Reader.read(open('sample.rtf'))
print PlaintextWriter.write(doc).getvalue()
Pyth can also generate RTF files, read and write XHTML, generate documents from Python markup a la Nevow's stan, and has limited experimental support for latex and pdf output. Its RTF support is pretty robust -- we use it in production to read RTF files generated by various versions of Word, OpenOffice, Mac TextEdit, EIOffice, and others.

OpenOffice has a RTF reader. You can use python to script OpenOffice, see here for more info.
You could probably try using the magic com-object on Windows to read anything that smells ms-binary. I wouldn't recommend that though.
Actually parsing the raw data probably won't be very hard, see this example written in .bat/QBasic.
DocFrac is a free open source converter betweeen RTF, HTML and text. Windows, Linux, ActiveX and DLL platforms available. It will probably be pretty easy to wrap it up in python.
RTF::TEXT::Converter - Perl extension for converting RTF into text. (in case You have problems withg DocFrac).
Official Rich Text Format (RTF) Specifications, version 1.7, by Microsoft.
Good luck (with the limited privileges in Your working environment).

If you are on Mac , you can convert an RTF file file.rtf to TXT from the CLI like:
textutil -convert txt file.rtf

Have you checked out pyrtf-ng?
Update: The parsing functionality is available if you do a Subversion checkout, but I'm not sure how full-featured it is. (Look in the rtfng.parser.base module.)

Here's a link to a script that converts rtf to text using regex:
Regular Expression for extracting text from an RTF string
Also, and updated link on github:
Github link

There is good library pyrtf-ng for all-purpose RTF handling.

PyRTF-ng 0.9.1 has not parsed any of my RTF documents, both with the ParsingException.
First document was generated with OpenOffice 3.4, the second one with Mac TextEdit.
Pyth 0.5.6 parsed without problems both documents, but has not processed cyrillic symbols properly.
But each editor opens other's editor document correctly and without trouble, so all libraries seems to have a weak rtf support.
So I'm writing my own parser with with blackjack and hookers.
(I've uploaded both files, so you can check RTF libraries by yourself: http://yadi.sk/d/RMHawVdSD8O9 http://yadi.sk/d/RmUaSe5tD8OD)

I just came across pyrtflib - there's not much (any) documentation on it, it's kinda a case of installing it and then using the inbuilt help() function to find out what's available and what everything does.
Having said that in my little trial run of its rtf.Rtf2Html.getHtml() function it went well enough. I haven't tried the Rtf2Txt function but given the simpler nature of converting rtf to plaintext it should do fine I'd expect.

I ran into the same thing ans I was trying to code it myself. It's not that easy but here is what I had when I decided to go for a commandline app. Its ruby but you can adapt to python very easily.
There is some header garbage to clean up, but you can see more or less the idea.
f = File.open('r.rtf','r')
b=0
p=false
str = ''
begin
while (char = f.readchar)
if char.chr=='{'
b+=1
next
end
if char.chr=='}'
b-=1
next
end
if char.chr=='\\'
p=true
next
end
if p==true && (char.chr==' ' or char.chr=='\n' or char.chr=='\t' or char.chr=='\r')
p=false
next
end
if p==true && (char.chr=='\'')
#this is the source of my headaches. you need to read the code page from the header and encode this.
p=false
str << '#'
next
end
next if b>2
next if p
str << char.chr
end
rescue EOFError
end
f.close

Conversely, if you want to write RTFs easily from Python, you can use the third-party module rtflib. It's a fairly new and incomplete module but still very powerful and useful. Below is an example that writes "hello world" in rich text to an RTF called helloworld.rtf. This is a very primitive example, and the module can also be used to add colors, italics, tables, and many other aspects of rich text to RTF files.
from rtflib import *
file = RTF("helloworld.rtf")
file.startfile()
file.addstrict()
file.addtext("hello world")
file.writeout()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python convert microsoft office docs to plain text on linux - python

Any recomendations on a method to convert .doc, .ppt, and .xls to plain text on linux using python? Really any method of conversion would be useful. I have already looked at using Open Office but, I would like a solution that does not require having to install Open Office.

You can access OpenOffice via Python API. Try using this as a base: http://wiki.services.openoffice.org/wiki/Odt2txt.py

At the command line, antiword or wv work very nicely for .doc files. (Not a Python solution, but they're easy to install and fast.)

For dealing with Excel Spreadsheets xlwt is good. But it won't help with .doc and .ppt files. (You may have also heard of PyExcelerator. xlwt is a fork of this and better maintained so I think you'd be better of with xlwt.)

I've had some success at using XSLT to process the XML-based office files into something usable in the past. It's not necessarily a python-based solution, but it does get the job done.

Related

Internet Shortcut in python

Is there anyway to read .one(OneNote files) using Python script?

printing to windows printer with python or shell command

create office files from python

Is there a Python module for converting RTF to plain text? [closed]

Categories

Resources