This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
solution to convert PDFs, DOCs, DOCXs into a textual format with python
I am making a document search engine which indexes popular binary formats. I am looking for python libraries for this purpose.
Reliable converters proved too hard to find. PyPDF never works accurately. Please reccomend:
python libraries that convert these formats to text
or cross-platform, standalone programs that can be called as a subprocess
You can sort of read .docx by unzipping it and then rootling around in the resulting folder structure. See How can I search a word in a Word 2007 .docx file?.
If pyPDF isn't working for you, you can use pdftotext as a subprocess.
.doc is probably the hardest. Is COM scripting an option for you? That is, asking Word to open the file and export it as text? There's a linux utility extracting text from MS word files in python.
You might try Open Office.
It's converting skills are above average. For editing PDF documents, you need to install the pdf import extension.
There are some extensions to work with python, such as the python-uno bridge, but I've had difficulty with it, and generally resort to calling open office as a subprocess.
Just noticed you opened a duplicate question at:
solution to convert PDFs, DOCs, DOCXs into a textual format with python...
Related
I have been given a series of folders with large amounts of Word documents in .xml formatting. They each contain some VBA code, but the code on all of them has already been run, so I don't need to keep this.
I need to print all of the files in each folder, but due to constraints on XML files on the network, I can't simply mass-print them from Windows Explorer, so I need to convert them to .docx (or .doc) first.
How can I go about doing this? I tried a simple python script using python-docx:
import os
from docx import Document
folderPath=<folderpath>
fileNamesList=os.listdir(folderPath)
for xmlFileName in fileNamesList:
currentDoc=Document(os.path.join(folderPath,xmlFileName))
docxFileName=xmlFileName.replace('.xml','.docx')
currentDoc.save(os.path.join(folderPath,docxFileName))
currentDoc.close()
This gives:
docx.opc.exceptions.PackageNotFoundError: Package not found at <first file name>.xml
I'm guessing this is because python-docx isn't meant to open .xml files, but that's a pretty uneducated guess. Searching around for this error, all I can find are problems with it not being installed properly (which it is, as far as I can tell) or using .doc files instead of .docx.
Am I simply using python-docx incorrectly? If not, is there are more suitable package or technique I should be using?
It's not clear just what sort of files those .xml files are, but I suspect they are a transitional format used I think in Word 2003, which was XML-based, but not the Open Packaging Convention (OPC) format used in Word documents since Word 2007.
python-docx is not going to read those, now or ever, so you'll either need to convert them to .docx format or parse the XML directly.
If I had Windows available, I suppose I would use VBA to write a short conversion script and then work with the .docx files using python-pptx. I would start by seeing if Word could load the .xml file and go from there.
You might be able to find a utility to do a bulk conversion, but I didn't find any on a quick search.
If all you're interested in is a one-time print, and Word will load the files, then a VBA script for that without the conversion step might be a good option. python-docx doesn't print .docx files, only read and write them.
I checked mose question and answers in stackoverflow and others there is many way to open and read the .docx file not doc by using python
I already checked python-docx library but it only support to docx.
I wanna to open and extract text from .doc file(not docx). Plase help me Because I'm new in python
You can use Tika Python, it's an Apache Tika bindings for python. Another good library is a textract.
I created library to extract text from doc files. It works for C and Python
https://github.com/uvoteam/libdoc
usage example:
import extract_doc
with open('./test.doc', 'rb') as myfile:
data = bytearray(myfile.read())
print(extract_doc.extract_doc_text(data, len(data)))
I want to get the content (text only) in a ppt file. How to do it?
(It likes that if I want to get content in a txt file, I just need to open and read. What do I need to do to get information from ppt files?)
By the way, I know there is a win32com in windows system. But now I am working on linux, is there any possible way?
I found this discussion over on Superuser:
Command line tool in Linux to Extract Text From Word, Excel, Powerpoint?
There are several reasonable answers listed there, including using LibreOffice to do this (and for .doc, .docx, .pptx, etc, etc.), and the Apache Tika Project (which appears to be the 5,000lb gorilla in this solution space).
We have a project in python with django.
We need to generate complex word, excel and pdf files.
For the rest of our projects which were done in PHP we used PHPexcel ,
PHPWord and tcpdf for PDF.
What libraries for python would you recommend for creating this kind of files ? (for excel and word its imortant to use the open xml file format xlsx , docx)
Python-docx may help ( https://github.com/mikemaccana/python-docx ).
Python doesn't have highly-developed tools to manipulate word documents. I've found the java library xdocreport ( https://code.google.com/p/xdocreport/ ) to be the best by far for Word reporting. Because I need to generate PCL, which is efficiently done via FOP I also use docx4j.
To integrate this with my python, I use the spark framework to wrap it up with a simple web service, and use requests on the python side to talk to the service.
For excel, there's openpyxl, which actually is a python port of PHPexcel, afaik. I haven't used it yet, but it sounds ok to me.
I would recommend using Docutils. It takes reStructuredText files and converts them to a range of output files. Included in the package are HTML, LaTeX and .odf file writers but in the sandbox there are a whole load of other writers for writing to other formats, see for example, the WordML writer (disclaimer: I haven't used it).
The advantage of this solution is that you can write plain text (reStructuredText) master files, which are human readable as is, and then convert to a range of other file formats as required.
Whilst not a Python solution, you should also look at Pandoc a Haskell library which supports a much wider range of output and input formats than docutils. One major advantage of Pandoc over Docutils is that you can do the reverse translation, i.e. WordML to reStructuredText. You can try Pandoc here.
I have never used any libraries for this, but you can change the extension of any docx, xlsx file to zip, and see the magic!
Generating openxml files is as simple as generating couple of XML files (you can use templates) and zipping it.
Simplest way to generate PDF is to generate HTML (with CSS+images) and convert it using wkhtmltopdf tool.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
Ideally, I'd like a module or library that doesn't require superuser access to install; I have limited privileges in my working environment.
I've been working on a library called Pyth, which can do this:
http://pypi.python.org/pypi/pyth/
Converting an RTF file to plaintext looks something like this:
from pyth.plugins.rtf15.reader import Rtf15Reader
from pyth.plugins.plaintext.writer import PlaintextWriter
doc = Rtf15Reader.read(open('sample.rtf'))
print PlaintextWriter.write(doc).getvalue()
Pyth can also generate RTF files, read and write XHTML, generate documents from Python markup a la Nevow's stan, and has limited experimental support for latex and pdf output. Its RTF support is pretty robust -- we use it in production to read RTF files generated by various versions of Word, OpenOffice, Mac TextEdit, EIOffice, and others.
OpenOffice has a RTF reader. You can use python to script OpenOffice, see here for more info.
You could probably try using the magic com-object on Windows to read anything that smells ms-binary. I wouldn't recommend that though.
Actually parsing the raw data probably won't be very hard, see this example written in .bat/QBasic.
DocFrac is a free open source converter betweeen RTF, HTML and text. Windows, Linux, ActiveX and DLL platforms available. It will probably be pretty easy to wrap it up in python.
RTF::TEXT::Converter - Perl extension for converting RTF into text. (in case You have problems withg DocFrac).
Official Rich Text Format (RTF) Specifications, version 1.7, by Microsoft.
Good luck (with the limited privileges in Your working environment).
If you are on Mac , you can convert an RTF file file.rtf to TXT from the CLI like:
textutil -convert txt file.rtf
Have you checked out pyrtf-ng?
Update: The parsing functionality is available if you do a Subversion checkout, but I'm not sure how full-featured it is. (Look in the rtfng.parser.base module.)
Here's a link to a script that converts rtf to text using regex:
Regular Expression for extracting text from an RTF string
Also, and updated link on github:
Github link
There is good library pyrtf-ng for all-purpose RTF handling.
PyRTF-ng 0.9.1 has not parsed any of my RTF documents, both with the ParsingException.
First document was generated with OpenOffice 3.4, the second one with Mac TextEdit.
Pyth 0.5.6 parsed without problems both documents, but has not processed cyrillic symbols properly.
But each editor opens other's editor document correctly and without trouble, so all libraries seems to have a weak rtf support.
So I'm writing my own parser with with blackjack and hookers.
(I've uploaded both files, so you can check RTF libraries by yourself: http://yadi.sk/d/RMHawVdSD8O9 http://yadi.sk/d/RmUaSe5tD8OD)
I just came across pyrtflib - there's not much (any) documentation on it, it's kinda a case of installing it and then using the inbuilt help() function to find out what's available and what everything does.
Having said that in my little trial run of its rtf.Rtf2Html.getHtml() function it went well enough. I haven't tried the Rtf2Txt function but given the simpler nature of converting rtf to plaintext it should do fine I'd expect.
I ran into the same thing ans I was trying to code it myself. It's not that easy but here is what I had when I decided to go for a commandline app. Its ruby but you can adapt to python very easily.
There is some header garbage to clean up, but you can see more or less the idea.
f = File.open('r.rtf','r')
b=0
p=false
str = ''
begin
while (char = f.readchar)
if char.chr=='{'
b+=1
next
end
if char.chr=='}'
b-=1
next
end
if char.chr=='\\'
p=true
next
end
if p==true && (char.chr==' ' or char.chr=='\n' or char.chr=='\t' or char.chr=='\r')
p=false
next
end
if p==true && (char.chr=='\'')
#this is the source of my headaches. you need to read the code page from the header and encode this.
p=false
str << '#'
next
end
next if b>2
next if p
str << char.chr
end
rescue EOFError
end
f.close
Conversely, if you want to write RTFs easily from Python, you can use the third-party module rtflib. It's a fairly new and incomplete module but still very powerful and useful. Below is an example that writes "hello world" in rich text to an RTF called helloworld.rtf. This is a very primitive example, and the module can also be used to add colors, italics, tables, and many other aspects of rich text to RTF files.
from rtflib import *
file = RTF("helloworld.rtf")
file.startfile()
file.addstrict()
file.addtext("hello world")
file.writeout()