Related
I am trying to extract text from pdf using pdfminer.six library (like here), I have already installed it in my virtual environment. here is my code :
import pdfminer as miner
text = miner.high_level.extract_text('file.pdf')
print(text)
but when I execute the code with python pdfreader.py I get the following error :
Traceback (most recent call last):
File ".\pdfreader.py", line 9, in <module>
text = miner.high_level.extract_text('pdfBulletins/corona1.pdf')
AttributeError: module 'pdfminer' has no attribute 'high_level'
I suspect it has something to do with the Python path, because I installed pdfminer inside my virtual environment, but I see that this installed pdf2txt.py outside in my system python install. Is this behavior normal? I mean something that happens inside my venv should not alter my system Python installation.
I successfully extracted the text using pdf2txt.py utility that comes with pdfminer.six library (from command line and using system python install), but not from the code inside my venv project. My pdfminer.six version is 20201018
What could be the problem with my code ?
pdfminer high_level extract_text requires additional parameters to work properly. This code below uses pdfminer.six and it extracts the text from my pdf files.
from pdfminer.high_level import extract_text
pdf_file = open('my_file.pdf', 'rb')
text = extract_text(pdf_file, password='', page_numbers=None, maxpages=0, caching=True, codec='utf-8', laparams=None)
print(text)
Here are a couple of additional posts that I wrote on extracting text from PDF files that might be useful:
Unsuccessful attempt to extract text data from PDF
How to convert whole pdf to text in python
how to write code to extract a specific text and integer on the same line from a pdf file using python?
Python Data Extraction from an Encrypted PDF
Your problem is trying to use a function from a module you have not imported. Importing pdfminer does NOT automatically also import pdfminer.high_level.
This works:
from pdfminer.high_level import extract_text
text = extract_text('file.pdf')
print(text)
Try pdfreader to extract texts (plain and containing PDF operators) from PDF document
Here is a sample code extracting all the above from all document pages.
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
plain_text = ""
pdf_markdown = ""
try:
while True:
viewer.render()
pdf_markdown += viewer.canvas.text_content
plain_text += "".join(viewer.canvas.strings)
viewer.next()
except PageDoesNotExist:
pass
You'll need to install pdfminer.six instead of just pdfminer:
pip install pdfminer.six
Only after that, you can import extract_text as:
from pdfminer.high_level import extract_text
Problem in my case
pdfminer and pdfminer.six are both installed,
from pdfminer.high_level import extract_text than tries to use the wrong package.
Solution
For me uninstalling pdfminer worked:
pip uninstall pdfminer
now you should only have pdfminer.six installed and should be able to import extract_text.
I am an recent graduate in pure mathematics who only has taken few basic programming courses. I am doing an internship and I have an internal data analysis project. I have to analyze the internal PDFs of the last years. The PDFs are "secured." In other words, they are encrypted. We do not have PDF passwords, even more, we are not sure if passwords exist. But, we have all these documents and we can read them manually. We can print them as well. The goal is to read them with Python because is the language that we have some idea.
First, I tried to read the PDFs with some Python libraries. However, the Python libraries that I found do not read encrypted PDFs. At that time, I could not export the information using Adobe Reader either.
Second, I decided to decrypt the PDFs. I was successful using the Python library pykepdf. Pykepdf works very well! However, the decrypted PDFs cannot be read as well with the Python libraries of the previous point (PyPDF2 and Tabula). At this time, we have made some improvement because using Adobe Reader I can export the information from the decrypted PDFs, but the goal is to do everything with Python.
The code that I am showing works perfectly with unencrypted PDFs, but not with encrypted PDFs. It is not working with the decrypted PDFs that were gotten with pykepdf as well.
I did not write the code. I found it in the documentation of the Python libraries Pykepdf and Tabula. The PyPDF2 solution was written by Al Sweigart in his book, "Automate the Boring Stuff with Python," that I highly recommend. I also checked that the code is working fine, with the limitations that I explained before.
First question,
why I cannot read the decrypted files, if the programs work with files that never have been encrypted?
Second question,
Can we read with Python the decrypted files somehow? Which library can do it or is impossible? Are all decrypted PDFs extractable?
Thank you for your time and help!!!
I found these results using Python 3.7, Windows 10, Jupiter Notebooks, and Anaconda 2019.07.
Python
import pikepdf
with pikepdf.open("encrypted.pdf") as pdf:
num_pages = len(pdf.pages)
del pdf.pages[-1]
pdf.save("decrypted.pdf")
import tabula
tabula.read_pdf("decrypted.pdf", stream=True)
import PyPDF2
pdfFileObj=open("decrypted.pdf", "rb")
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
pdfReader.numPages
pageObj=pdfReader.getPage(0)
pageObj.extractText()
With Tabula, I am getting the message "the output file is empty."
With PyPDF2, I am getting only '/n'
UPDATE 10/3/2019 Pdfminer.six (Version November 2018)
I got better results using the solution posted by DuckPuncher. For the decrypted file, I got the labels, but not the data. Same happens with the encrypted file. For the file that has never been encrypted works perfect. As I need the data and the labels of encrypted or decrypted files, this code does not work for me. For that analysis, I used pdfminer.six that is Python library that was released in November 2018. Pdfminer.six includes a library pycryptodome. According to their documentation "PyCryptodome is a self-contained Python package of low-level cryptographic primitives.."
The code is in the stack exchange question:
Extracting text from a PDF file using PDFMiner in python?
I would love if you want to repeat my experiment. Here is the description:
1) Run the codes mention in this question with any PDF that never has been encrypted.
2) Do the same with a PDF "Secure" (this is a term that Adobe uses), I am calling it the encrypted PDF. Use a generic form that you can find using Google. After you download it, you need to fill the fields. Otherwise, you would be checking for labels, but not fields. The data is in the fields.
3) Decrypt the encrypted PDF using Pykepdf. This will be the decrypted PDF.
4) Run the codes again using the decrypted PDF.
UPDATE 10/4/2019 Camelot (Version July 2019)
I found the Python library Camelot. Be careful that you need camelot-py 0.7.3.
It is very powerful, and works with Python 3.7. Also, it is very easy to use. First, you need also to install Ghostscript. Otherwise, it will not work.
You need also to install Pandas. Do not use pip install camelot-py. Instead use pip install camelot-py[cv]
The author of the program is Vinayak Mehta. Frank Du shares this code in a YouTube video "Extract tabular data from PDF with Camelot Using Python."
I checked the code and it is working with unencrypted files. However, it does not work with encrypted and decrypted files, and that is my goal.
Camelot is oriented to get tables from PDFs.
Here is the code:
Python
import camelot
import pandas
name_table = camelot.read_pdf("uncrypted.pdf")
type(name_table)
#This is a Pandas dataframe
name_table[0]
first_table = name_table[0]
#Translate camelot table object to a pandas dataframe
first_table.df
first_table.to_excel("unencrypted.xlsx")
#This creates an excel file.
#Same can be done with csv, json, html, or sqlite.
#To get all the tables of the pdf you need to use this code.
for table in name_table:
print(table.df)
UPDATE 10/7/2019
I found one trick. If I open the secured pdf with Adobe Reader, and I print it using Microsoft to PDF, and I save it as a PDF, I can extract the data using that copy. I also can convert the PDF file to JSON, Excel, SQLite, CSV, HTML, and another formats. This is a possible solution to my question. However, I am still looking for an option to do it without that trick because the goal is to do it 100% with Python. I am also concerned that if a better method of encryption is used the trick maybe would not work. Sometimes you need to use Adobe Reader several times to get an extractable copy.
UPDATE 10/8/2019. Third question.
I have now a third question. Do all secured/encrypted pdf are password protected? Why pikepdf is not working? My guess is that the current version of pikepdf can break some type of encryptions but not all of them.
#constt mentioned that PyPDF2 can break some type of protection. However, I replied to him that I found an article that PyPDF2 can break encryptions made with Adobe Acrobat Pro 6.0, but no with posterior versions.
LAST UPDATED 10-11-2019
I'm unsure if I understand your question completely. The code below can be refined, but it reads in either an encrypted or unencrypted PDF and extracts the text. Please let me know if I misunderstood your requirements.
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
def extract_encrypted_pdf_text(path, encryption_true, decryption_password):
output = StringIO()
resource_manager = PDFResourceManager()
laparams = LAParams()
device = TextConverter(resource_manager, output, codec='utf-8', laparams=laparams)
pdf_infile = open(path, 'rb')
interpreter = PDFPageInterpreter(resource_manager, device)
page_numbers = set()
if encryption_true == False:
for page in PDFPage.get_pages(pdf_infile, page_numbers, maxpages=0, caching=True, check_extractable=True):
interpreter.process_page(page)
elif encryption_true == True:
for page in PDFPage.get_pages(pdf_infile, page_numbers, maxpages=0, password=decryption_password, caching=True, check_extractable=True):
interpreter.process_page(page)
text = output.getvalue()
pdf_infile.close()
device.close()
output.close()
return text
results = extract_encrypted_pdf_text('encrypted.pdf', True, 'password')
print (results)
I noted that your pikepdf code used to open an encrypted PDF was missing a password, which should have thrown this error message:
pikepdf._qpdf.PasswordError: encrypted.pdf: invalid password
import pikepdf
with pikepdf.open("encrypted.pdf", password='password') as pdf:
num_pages = len(pdf.pages)
del pdf.pages[-1]
pdf.save("decrypted.pdf")
You can use tika to extract the text from the decrypted.pdf created by pikepdf.
from tika import parser
parsedPDF = parser.from_file("decrypted.pdf")
pdf = parsedPDF["content"]
pdf = pdf.replace('\n\n', '\n')
Additionally, pikepdf does not currently implement text extraction this includes the latest release v1.6.4.
I decided to run a couple of test using various encrypted PDF files.
I named all the encrypted files 'encrypted.pdf' and they all used the same encryption and decryption password.
Adobe Acrobat 9.0 and later - encryption level 256-bit AES
pikepdf was able to decrypt this file
PyPDF2 could not extract the text correctly
tika could extract the text correctly
Adobe Acrobat 6.0 and later - encryption level 128-bit RC4
pikepdf was able to decrypt this file
PyPDF2 could not extract the text correctly
tika could extract the text correctly
Adobe Acrobat 3.0 and later - encryption level 40-bit RC4
pikepdf was able to decrypt this file
PyPDF2 could not extract the text correctly
tika could extract the text correctly
Adobe Acrobat 5.0 and later - encryption level 128-bit RC4
created with Microsoft Word
pikepdf was able to decrypt this file
PyPDF2 could extract the text correctly
tika could extract the text correctly
Adobe Acrobat 9.0 and later - encryption level 256-bit AES
created using pdfprotectfree
pikepdf was able to decrypt this file
PyPDF2 could extract the text correctly
tika could extract the text correctly
PyPDF2 was able to extract text from decrypted PDF files not created with Adobe Acrobat.
I would assume that the failures have something to do with embedded formatting in the PDFs created by Adobe Acrobat. More testing is required to confirm this conjecture about the formatting.
tika was able to extract text from all the documents decrypted with pikepdf.
import pikepdf
with pikepdf.open("encrypted.pdf", password='password') as pdf:
num_pages = len(pdf.pages)
del pdf.pages[-1]
pdf.save("decrypted.pdf")
from PyPDF2 import PdfFileReader
def text_extractor(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)
page = pdf.getPage(1)
print('Page type: {}'.format(str(type(page))))
text = page.extractText()
print(text)
text_extractor('decrypted.pdf')
PyPDF2 cannot decrypt Acrobat PDF files => 6.0
This issue has been open with the module owners, since September 15, 2015. It unclear in the comments related to this issue when this problem will be fixed by the project owners. The last commit was June 25, 2018.
PyPDF4 decryption issues
PyPDF4 is the replacement for PyPDF2. This module also has decryption issues with certain algorithms used to encrypt PDF files.
test file: Adobe Acrobat 9.0 and later - encryption level 256-bit AES
PyPDF2 error message: only algorithm code 1 and 2 are supported
PyPDF4 error message: only algorithm code 1 and 2 are supported. This PDF uses code 5
UPDATE SECTION 10-11-2019
This section is in response to your updates on 10-07-2019 and 10-08-2019.
In your update you stated that you could open a 'secured pdf with Adobe Reader' and print the document to another PDF, which removes the 'SECURED' flag. After doing some testing, I believe that have figured out what is occurring in this scenario.
Adobe PDFs level of security
Adobe PDFs have multiple types of security controls that can be enabled by the owner of the document. The controls can be enforced with either a password or a certificate.
Document encryption (enforced with a document open password)
Encrypt all document contents (most common)
Encrypt all document contents except metadata => Acrobat 6.0
Encrypt only file attachments => Acrobat 7.0
Restrictive editing and printing (enforced with a permissions password)
Printing Allowed
Changes Allowed
The image below shows an Adobe PDF being encrypted with 256-Bit AES encryption. To open or print this PDF a password is required. When you open this document in Adobe Reader with the password, the title will state SECURED
This document requires a password to open with the Python modules that are mentioned in this answer. If you attempt to open an encrypted PDF with Adobe Reader. You should see this:
If you don't get this warning then the document either has no security controls enable or only has the restrictive editing and printing ones enabled.
The image below shows restrictive editing being enabled with a password in a PDF document. Note printing is enabled. To open or print this PDF a password is not required. When you open this document in Adobe Reader without a password, the title will state SECURED This is the same warning as the encrypted PDF that was opened with a password.
When you print this document to a new PDF the SECURED warning is removed, because the restrictive editing has been removed.
All Adobe products enforce the restrictions set by the permissions password. However, if third-party products do not support these settings, document recipients are able to bypass some or all of the restrictions set.
So I assume that the document that you are printing to PDF has restrictive editing enabled and does not have a password required to open enabled.
Concerning breaking PDF encryption
Neither PyPDF2 or PyPDF4 are designed to break the document open password function of a PDF document. Both the modules will throw the following error if they attempt to open an encrypted password protected PDF file.
PyPDF2.utils.PdfReadError: file has not been decrypted
The opening password function of an encrypted PDF file can be bypassed using a variety of methods, but a single technique might not work and some will not be acceptable because of several factors, including password complexity.
PDF encryption internally works with encryption keys of 40, 128, or 256 bit depending on the PDF version. The binary encryption key is derived from a password provided by the user. The password is subject to length and encoding constraints.
For example, PDF 1.7 Adobe Extension Level 3 (Acrobat 9 - AES-256) introduced Unicode characters (65,536 possible characters) and bumped the maximum length to 127 bytes in the UTF-8 representation of the password.
The code below will open a PDF with restrictive editing enabled. It will save this file to a new PDF without the SECURED warning being added. The tika code will parse the contents from the new file.
from tika import parser
import pikepdf
# opens a PDF with restrictive editing enabled, but that still
# allows printing.
with pikepdf.open("restrictive_editing_enabled.pdf") as pdf:
pdf.save("restrictive_editing_removed.pdf")
# plain text output
parsedPDF = parser.from_file("restrictive_editing_removed.pdf")
# XHTML output
# parsedPDF = parser.from_file("restrictive_editing_removed.pdf", xmlContent=True)
pdf = parsedPDF["content"]
pdf = pdf.replace('\n\n', '\n')
print (pdf)
This code checks if a password is required for opening the file. This code be refined and other functions can be added. There are several other features that can be added, but the documentation for pikepdf does not match the comments within the code base, so more research is required to improve this.
# this would be removed once logging is used
############################################
import sys
sys.tracebacklimit = 0
############################################
import pikepdf
from tika import parser
def create_pdf_copy(pdf_file_name):
with pikepdf.open(pdf_file_name) as pdf:
new_filename = f'copy_{pdf_file_name}'
pdf.save(new_filename)
return new_filename
def extract_pdf_content(pdf_file_name):
# plain text output
# parsedPDF = parser.from_file("restrictive_editing_removed.pdf")
# XHTML output
parsedPDF = parser.from_file(pdf_file_name, xmlContent=True)
pdf = parsedPDF["content"]
pdf = pdf.replace('\n\n', '\n')
return pdf
def password_required(pdf_file_name):
try:
pikepdf.open(pdf_file_name)
except pikepdf.PasswordError as error:
return ('password required')
except pikepdf.PdfError as results:
return ('cannot open file')
filename = 'decrypted.pdf'
password = password_required(filename)
if password != None:
print (password)
elif password == None:
pdf_file = create_pdf_copy(filename)
results = extract_pdf_content(pdf_file)
print (results)
You can try to handle the error these files produce when you open these files without a password.
import pikepdf
def open_pdf(pdf_file_path, pdf_password=''):
try:
pdf_obj = pikepdf.Pdf.open(pdf_file_path)
except pikepdf._qpdf.PasswordError:
pdf_obj = pikepdf.Pdf.open(pdf_file_path, password=pdf_password)
finally:
return pdf_obj
You can use the returned pdf_obj for your parsing work.
Also, you can provide the password in case you have an encrypted PDF.
For tabula-py, you can try password option with read_pdf. It depends on tabula-java's function so I'm not sure which encryption is supported though.
I got a test for job application, my deal is read some .doc files. Does anyone know a library to do this? I had started with a raw python code:
f = open('test.doc', 'r')
f.read()
but this does not return a friendly string I need to convert it to utf-8
Edit: I just want get the text from this file
One can use the textract library.
It take care of both "doc" as well as "docx"
import textract
text = textract.process("path/to/file.extension")
You can even use 'antiword' (sudo apt-get install antiword) and then convert doc to first into docx and then read through docx2txt.
antiword filename.doc > filename.docx
Ultimately, textract in the backend is using antiword.
You can use python-docx2txt library to read text from Microsoft Word documents. It is an improvement over python-docx library as it can, in addition, extract text from links, headers and footers. It can even extract images.
You can install it by running: pip install docx2txt.
Let's download and read the first Microsoft document on here:
import docx2txt
my_text = docx2txt.process("test.docx")
print(my_text)
Here is a screenshot of the Terminal output the above code:
EDIT:
This does NOT work for .doc files. The only reason I am keep this answer is that it seems there are people who find it useful for .docx files.
I was trying to to the same, I found lots of information on reading .docx but much less on .doc; Anyway, I managed to read the text using the following:
import win32com.client
word = win32com.client.Dispatch("Word.Application")
word.visible = False
wb = word.Documents.Open("myfile.doc")
doc = word.ActiveDocument
print(doc.Range().Text)
The answer from Shivam Kotwalia works perfectly. However, the object is imported as a byte type. Sometimes you may need it as a string for performing REGEX or something like that.
I recommend the following code (two lines from Shivam Kotwalia's answer) :
import textract
text = textract.process("path/to/file.extension")
text = text.decode("utf-8")
The last line will convert the object text to a string.
I agree with Shivam's answer except for textract doesn't exist for windows.
And, for some reason antiword also fails to read the '.doc' files and gives an error:
'filename.doc' is not a word document. # This happens when the file wasn't generated via MS Office. Eg: Web-pages may be stored in .doc format offline.
So, I've got the following workaround to extract the text:
from bs4 import BeautifulSoup as bs
soup = bs(open(filename).read())
[s.extract() for s in soup(['style', 'script'])]
tmpText = soup.get_text()
text = "".join("".join(tmpText.split('\t')).split('\n')).encode('utf-8').strip()
print text
This script will work with most kinds of files.
Have fun!
Prerequisites :
install antiword : sudo apt-get install antiword
install docx : pip install docx
from subprocess import Popen, PIPE
from docx import opendocx, getdocumenttext
from cStringIO import StringIO
def document_to_text(filename, file_path):
cmd = ['antiword', file_path]
p = Popen(cmd, stdout=PIPE)
stdout, stderr = p.communicate()
return stdout.decode('ascii', 'ignore')
print document_to_text('your_file_name','your_file_path')
Notice – New versions of python-docx removed this function. Make sure to pip install docx and not the new python-docx
I looked for solution so long. Materials about .doc file is not enough, finally I solved this problem by changing type .doc to .docx
from win32com import client as wc
w = wc.Dispatch('Word.Application')
# Or use the following method to start a separate process:
# w = wc.DispatchEx('Word.Application')
doc=w.Documents.Open(os.path.abspath('test.doc'))
doc.SaveAs("test_docx.docx",16)
I had to do the same to search through a ton of *.doc files for a specific number and came up with:
special_chars = {
"b'\\t'": '\t',
"b'\\r'": '\n',
"b'\\x07'": '|',
"b'\\xc4'": 'Ä',
"b'\\xe4'": 'ä',
"b'\\xdc'": 'Ü',
"b'\\xfc'": 'ü',
"b'\\xd6'": 'Ö',
"b'\\xf6'": 'ö',
"b'\\xdf'": 'ß',
"b'\\xa7'": '§',
"b'\\xb0'": '°',
"b'\\x82'": '‚',
"b'\\x84'": '„',
"b'\\x91'": '‘',
"b'\\x93'": '“',
"b'\\x96'": '-',
"b'\\xb4'": '´'
}
def get_string(path):
string = ''
with open(path, 'rb') as stream:
stream.seek(2560) # Offset - text starts after byte 2560
current_stream = stream.read(1)
while not (str(current_stream) == "b'\\xfa'"):
if str(current_stream) in special_chars.keys():
string += special_chars[str(current_stream)]
else:
try:
char = current_stream.decode('UTF-8')
if char.isalnum():
string += char
except UnicodeDecodeError:
string += ''
current_stream = stream.read(1)
return string
I'm not sure how 'clean' this solution is, but it works well with regex.
!pip install python-docx
import docx
#Creating a word file object
doc = open("file.docx","rb")
#creating word reader object
document = docx.Document(doc)
This code will run when if you are looking for how to read the doc file in python install the all related packages first and see the result.
if doc_file:
_file=requests.get(request.values['MediaUrl0'])
doc_file_link=BytesIO(_file.content)
file_path=os.getcwd()+'\+data.doc'
E=open(file_path,'wb')
E.write(doc_file_link.getbuffer())
E.close()
word = win32.gencache.EnsureDispatch('Word.Application',pythoncom.CoInitialize())
doc = word.Documents.Open(file_path)
doc.Activate()
doc_data=doc.Range().Text
print(doc_data)
doc.Close(False)
if os.path.exists(file_path):
os.remove(file_path)
I found several questions that were similar to mine, but none of the answers came close to what I need.
Specifications: I'm working with Python 3 and do not have MS Word. My programming machine is running OS X and cloud machine is linux/ubuntu too.
I'm using python-docx to extract values from a .doc file that is sent to me nightly. However, python-docx only works with .docx files, so I need to convert the file to that extension first.
So, I've got a .doc file that I need to convert to .docx. This script might have to run in the cloud so I can't install any kind of Office or Office-like software. Can this be done?
You are working with Linux/ubuntu, you can use LibreOffice’s inbuilt converter.
SYNTAX
lowriter --convert-to docx *.doc
Example
lowriter --convert-to docx testdoc.doc
This will convert all doc files to docx and save in the same folder itself.
You could use unoconv - Universal Office Converter. Convert between any document format supported by LibreOffice/OpenOffice.
unoconv -d document --format=docx *.doc
subprocess.call(['unoconv', '-d', 'document', '--format=docx', filename])
Aspose.Words Cloud SDK for Python can convert DOC to DOCX. The package can open, generate, edit, split, merge, compare and convert a Word document in Python on any platform without depending on MS Word.
It is a paid product, but the free plan provides 150 free monthly API calls.
P.S: I'm developer evangelist at Aspose.
# Import module
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile
# Get your credentials from https://dashboard.aspose.cloud (free registration is required).
words_api = asposewordscloud.WordsApi(app_sid='xxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx',app_key='xxxxxxxxxxxxxxxxxxxxxxxxx')
words_api.api_client.configuration.host = 'https://api.aspose.cloud'
filename = 'C:/Temp/02_pages.doc'
dest_name = 'C:/Temp/02_pages.docx'
#Convert RTF to text
request = asposewordscloud.models.requests.ConvertDocumentRequest(document=open(filename, 'rb'), format='docx')
result = words_api.convert_document(request)
copyfile(result, dest_name)
import aspose.words as aw
path1="doc file path"
path2="path to save converted file"
file2=file.rsplit('.',1)[0]+'.docx'
filename1=os.path.join(path2,file2)
filename=os.path.join(path1,file)
doc = aw.Document(filename)
doc.save(filename1)
First you will need to be using Windows. If that is an acceptable barrier then please read on....
Next you need to install the Microsoft Office Compatibility Pack.
Now download and install the Microsoft Office Migration Planning Manager.
To run the tool you need to create a .ini file that controls the program. An example .ini file and further information is available on this blog post.
There is more detailed information from Microsoft here.
I'm trying to use python-docx module (pip install python-docx)
but it seems to be very confusing as in github repo test sample they are using opendocx function but in readthedocs they are using Document class. Even though they are only showing how to add text to a docx file, not reading existing one?
1st one (opendocx) is not working, may be deprecated. For second case I was trying to use:
from docx import Document
document = Document('test_doc.docx')
print(document.paragraphs)
It returned a list of <docx.text.Paragraph object at 0x... >
Then I did:
for p in document.paragraphs:
print(p.text)
It returned all text but there were few thing missing. All URLs (CTRL+CLICK to go to URL) were not present in text on console.
What is the issue? Why URLs are missing?
How could I get complete text without iterating over loop (something like open().read())
you can try this
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
You can use python-docx2txt which is adapted from python-docx but can also extract text from links, headers and footers. It can also extract images.
Without Installing python-docx
docx is basically is a zip file with several folders and files within it. In the link below you can find a simple function to extract the text from docx file, without the need to rely on python-docx and lxml the latter being sometimes hard to install:
http://etienned.github.io/posts/extract-text-from-word-docx-simply/
There are two "generations" of python-docx. The initial generation ended with the 0.2.x versions and the "new" generation started at v0.3.0. The new generation is a ground-up, object-oriented rewrite of the legacy version. It has a distinct repository located here.
The opendocx() function is part of the legacy API. The documentation is for the new version. The legacy version has no documentation to speak of.
Neither reading nor writing hyperlinks are supported in the current version. That capability is on the roadmap, and the project is under active development. It turns out to be quite a broad API because Word has so much functionality. So we'll get to it, but probably not in the next month unless someone decides to focus on that aspect and contribute it. UPDATE Hyperlink support was added subsequent to this answer.
Using python-docx, as #Chinmoy Panda 's answer shows:
for para in doc.paragraphs:
fullText.append(para.text)
However, para.text will lost the text in w:smarttag (Corresponding github issue is here: https://github.com/python-openxml/python-docx/issues/328), you should use the following function instead:
def para2text(p):
rs = p._element.xpath('.//w:t')
return u" ".join([r.text for r in rs])
It seems that there is no official solution for this problem, but there is a workaround posted here
https://github.com/savoirfairelinux/python-docx/commit/afd9fef6b2636c196761e5ed34eb05908e582649
just update this file
"...\site-packages\docx\oxml_init_.py"
# add
import re
import sys
# add
def remove_hyperlink_tags(xml):
if (sys.version_info > (3, 0)):
xml = xml.decode('utf-8')
xml = xml.replace('</w:hyperlink>', '')
xml = re.sub('<w:hyperlink[^>]*>', '', xml)
if (sys.version_info > (3, 0)):
xml = xml.encode('utf-8')
return xml
# update
def parse_xml(xml):
"""
Return root lxml element obtained by parsing XML character string in
*xml*, which can be either a Python 2.x string or unicode. The custom
parser is used, so custom element classes are produced for elements in
*xml* that have them.
"""
root_element = etree.fromstring(remove_hyperlink_tags(xml), oxml_parser)
return root_element
and of course don't forget to mention in the documentation that use are changing the official library