I want to get the content (text only) in a ppt file. How to do it?
(It likes that if I want to get content in a txt file, I just need to open and read. What do I need to do to get information from ppt files?)
By the way, I know there is a win32com in windows system. But now I am working on linux, is there any possible way?
I found this discussion over on Superuser:
Command line tool in Linux to Extract Text From Word, Excel, Powerpoint?
There are several reasonable answers listed there, including using LibreOffice to do this (and for .doc, .docx, .pptx, etc, etc.), and the Apache Tika Project (which appears to be the 5,000lb gorilla in this solution space).
Related
I am using python 3 on windows 10 (though OS X is also available). I am attempting to extract the text from lots of .pdf files, all in Chinese characters. I have had success with pdfminer and textract, except for certain files. These files are not images, but proper documents with selectable text. If I use Adobe Acrobat Pro X and export to .txt, My output looks like:
!!
F/.....e..................!
216.. ..... .... ....
........
If I output to .doc, .docx, .rtf, or even copy-paste into any text editor, it looks like this:
ҁϦљӢख़ε༊ݢቹៜϐѦჾѱ॥ᓀϩӵΠ
I have no idea why Adobe would display the text properly but not export it correctly or even let me copy-paste. I thought maybe it was a font issue, the font is DFKaiShu sb-estd-bf which I already have installed (it appears to automatically come with windows).
I do have a workaround, but it's ugly and very difficult to automate; I print the pdf to a pdf (or any sort of image), then use adobe pro's built-in OCR, then convert to a word document (it still does not convert correctly to .txt). Ultimately I need to do this for ~2000 documents, each of which can be up to 200 pages.
Is there any other way to do this? Why is exporting or copy-pasting not working correctly? I have uploaded a 2-page sample to google drive here.
I have been given a series of folders with large amounts of Word documents in .xml formatting. They each contain some VBA code, but the code on all of them has already been run, so I don't need to keep this.
I need to print all of the files in each folder, but due to constraints on XML files on the network, I can't simply mass-print them from Windows Explorer, so I need to convert them to .docx (or .doc) first.
How can I go about doing this? I tried a simple python script using python-docx:
import os
from docx import Document
folderPath=<folderpath>
fileNamesList=os.listdir(folderPath)
for xmlFileName in fileNamesList:
currentDoc=Document(os.path.join(folderPath,xmlFileName))
docxFileName=xmlFileName.replace('.xml','.docx')
currentDoc.save(os.path.join(folderPath,docxFileName))
currentDoc.close()
This gives:
docx.opc.exceptions.PackageNotFoundError: Package not found at <first file name>.xml
I'm guessing this is because python-docx isn't meant to open .xml files, but that's a pretty uneducated guess. Searching around for this error, all I can find are problems with it not being installed properly (which it is, as far as I can tell) or using .doc files instead of .docx.
Am I simply using python-docx incorrectly? If not, is there are more suitable package or technique I should be using?
It's not clear just what sort of files those .xml files are, but I suspect they are a transitional format used I think in Word 2003, which was XML-based, but not the Open Packaging Convention (OPC) format used in Word documents since Word 2007.
python-docx is not going to read those, now or ever, so you'll either need to convert them to .docx format or parse the XML directly.
If I had Windows available, I suppose I would use VBA to write a short conversion script and then work with the .docx files using python-pptx. I would start by seeing if Word could load the .xml file and go from there.
You might be able to find a utility to do a bulk conversion, but I didn't find any on a quick search.
If all you're interested in is a one-time print, and Word will load the files, then a VBA script for that without the conversion step might be a good option. python-docx doesn't print .docx files, only read and write them.
I was wondering if it's possible to édit an existing pdf file with Pdfminer. It seens to be a powerful tool, but the documentation is poor/inexisting.
I found some exemples, but they don't match with my goal. I want to make a search engine which changes the color of my keywords in the pdf file.
PDFMiner is not for altering existing PDF files, but for extracting text and metadata from them. The closest solution to what you're looking for using PDFMiner would probably be to use the included pdf2txt.py tool to extract the text and then mark that up to highlight your keywords.
There's also the simple option of just using a PDF viewer with the built-in ability to find and highlight multiple search terms. I think Adobe Acrobat can do it, but I'm not sure about others.
No, pdfminer doesn't support editing.
However, it might be a lot easier if you don't try to modify the pdf, but use PDFOpenParameters instead: http://partners.adobe.com/public/developer/en/acrobat/PDFOpenParameters.pdf
You can use url fragment identifiers like this:
http://www.example.com/test.pdf#search=foo
Or even when opening Acrobat on the commandline (Windows example)
AcroRd32.exe /A "search=foo" test.pdf
You could also open the pdf a specific page, and highlight a certain area of that page (but not different areas on different pages at the same time).
(ok, I know it's not really a solution for the question you asked, but if this is sufficient for your needs, it's a lot simpler)
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
solution to convert PDFs, DOCs, DOCXs into a textual format with python
I am making a document search engine which indexes popular binary formats. I am looking for python libraries for this purpose.
Reliable converters proved too hard to find. PyPDF never works accurately. Please reccomend:
python libraries that convert these formats to text
or cross-platform, standalone programs that can be called as a subprocess
You can sort of read .docx by unzipping it and then rootling around in the resulting folder structure. See How can I search a word in a Word 2007 .docx file?.
If pyPDF isn't working for you, you can use pdftotext as a subprocess.
.doc is probably the hardest. Is COM scripting an option for you? That is, asking Word to open the file and export it as text? There's a linux utility extracting text from MS word files in python.
You might try Open Office.
It's converting skills are above average. For editing PDF documents, you need to install the pdf import extension.
There are some extensions to work with python, such as the python-uno bridge, but I've had difficulty with it, and generally resort to calling open office as a subprocess.
Just noticed you opened a duplicate question at:
solution to convert PDFs, DOCs, DOCXs into a textual format with python...
I am using reportlab toolkit in Python to generate some reports in PDF format. I want to use some predefined parts of documents already published in PDF format to be included in generated PDF file. Is it possible (and how) to accomplish this in reportlab or in python library?
I know I can use some other tools like PDF Toolkit (pdftk) but I am looking for Python-based solution.
I'm currently using PyPDF to read, write, and combine existing PDF's and ReportLab to generate new content. Using the two package seemed to work better than any single package I was able to find.
If you want to place existing PDF pages in your Reportlab documents I recommend pdfrw. Unlike PageCatcher it is free.
I've used it for several projects where I need to add barcodes etc to existing documents and it works very well. There are a couple of examples on the project page of how to use it with Reportlab.
A couple of things to note though:
If the source PDF contains errors (due to the originating program following the PDF spec imperfectly for example), pdfrw may fail even though something like Adobe Reader has no apparent problems reading the PDF. pdfrw is currently not very fault tolerant.
Also, pdfrw works by being completely agnostic to the actual content of the PDF page you are placing. So for example, you wouldn't be able to use pdfrw inspect a page to see if it contains a certain string of text in the lower right-hand corner. However if you don't need to do anything like that you should be fine.
There is an add-on for ReportLab — PageCatcher.