How to extract text from an existing docx file using python-docx

How to extract text from an existing docx file using python-docx - python

I'm trying to use python-docx module (pip install python-docx)
but it seems to be very confusing as in github repo test sample they are using opendocx function but in readthedocs they are using Document class. Even though they are only showing how to add text to a docx file, not reading existing one?
1st one (opendocx) is not working, may be deprecated. For second case I was trying to use:
from docx import Document
document = Document('test_doc.docx')
print(document.paragraphs)
It returned a list of <docx.text.Paragraph object at 0x... >
Then I did:
for p in document.paragraphs:
print(p.text)
It returned all text but there were few thing missing. All URLs (CTRL+CLICK to go to URL) were not present in text on console.
What is the issue? Why URLs are missing?
How could I get complete text without iterating over loop (something like open().read())

you can try this
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)

You can use python-docx2txt which is adapted from python-docx but can also extract text from links, headers and footers. It can also extract images.

Without Installing python-docx
docx is basically is a zip file with several folders and files within it. In the link below you can find a simple function to extract the text from docx file, without the need to rely on python-docx and lxml the latter being sometimes hard to install:
http://etienned.github.io/posts/extract-text-from-word-docx-simply/

There are two "generations" of python-docx. The initial generation ended with the 0.2.x versions and the "new" generation started at v0.3.0. The new generation is a ground-up, object-oriented rewrite of the legacy version. It has a distinct repository located here.
The opendocx() function is part of the legacy API. The documentation is for the new version. The legacy version has no documentation to speak of.
Neither reading nor writing hyperlinks are supported in the current version. That capability is on the roadmap, and the project is under active development. It turns out to be quite a broad API because Word has so much functionality. So we'll get to it, but probably not in the next month unless someone decides to focus on that aspect and contribute it. UPDATE Hyperlink support was added subsequent to this answer.

Using python-docx, as #Chinmoy Panda 's answer shows:
for para in doc.paragraphs:
fullText.append(para.text)
However, para.text will lost the text in w:smarttag (Corresponding github issue is here: https://github.com/python-openxml/python-docx/issues/328), you should use the following function instead:
def para2text(p):
rs = p._element.xpath('.//w:t')
return u" ".join([r.text for r in rs])

It seems that there is no official solution for this problem, but there is a workaround posted here
https://github.com/savoirfairelinux/python-docx/commit/afd9fef6b2636c196761e5ed34eb05908e582649
just update this file
"...\site-packages\docx\oxml_init_.py"
# add
import re
import sys
# add
def remove_hyperlink_tags(xml):
if (sys.version_info > (3, 0)):
xml = xml.decode('utf-8')
xml = xml.replace('</w:hyperlink>', '')
xml = re.sub('<w:hyperlink[^>]*>', '', xml)
if (sys.version_info > (3, 0)):
xml = xml.encode('utf-8')
return xml
# update
def parse_xml(xml):
"""
Return root lxml element obtained by parsing XML character string in
*xml*, which can be either a Python 2.x string or unicode. The custom
parser is used, so custom element classes are produced for elements in
*xml* that have them.
"""
root_element = etree.fromstring(remove_hyperlink_tags(xml), oxml_parser)
return root_element
and of course don't forget to mention in the documentation that use are changing the official library

Related

How to upgrade the docx format to the latest by docx？

👉The original file is docx format, which has multiple tables, but there may be format problems, so it cannot be read by python-docx.
✔️ 1.Solution by hand:
solve the question by click [save as ....] menu. A prompt box appears:
prompt box : appears upgrade to newest
❓2. Question:
How to implement [save as] function through Python-docx, upgrade the docx format to the latest?
😃Thanks for any suggestion！
3. appendix
from docx import Document
from win32com import client as wc
file = 'D:\\1.docx'
word = wc.Dispatch("Word.Application")
word.Visible = False
doc = word.Documents.Open(file)
doc.SaveAs("{}".format(file), 12)
doc.Close()
word.Quit()

With a compromise method,
we first created a blank DOCX,
then using Win32 libraries to copy the content as a whole to the blank DOCX,
testing available
Still looking forward to optimization methods

How to access remote and encrypted PDF text without writing to local drive [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I am very new to the coding world and have been stuck on this one problem for 3 days now, searching everywhere for an answer, so any help will be greatly appreciated. I am needing to extract a small amount of text from a url-located Pdf file. I'm using sessions.get(chart_PDF) as the driver for locating the URL where chart_PDF is the example url below.
Example url is https://www.airservicesaustralia.com/aip/pending/dap/PADGN01-166_09SEP2021.pdf
I know I am able to write it to my local drive but I don't want to do that, I want to be able to do it remotely, since I only need a couple of numbers from it.
I have tried finding the password from the url page for decrypting, couldn't find. I've tried to use PyPDF2, pdfminer and pikepdf (probably not well).
I only need to retrieve two numbers near the bottom of the PDF that can be used for the rest of my code. Please help, even if it is a simple fix, I'm new to all this and need some help. Thanks.
from io import BytesIO
from pikepdf import Pdf as PDF
from pdfminer import high_level
chart_PDF = https://www.airservicesaustralia.com/aip/pending/dap/PADGN01-166_09SEP2021.pdf
retrieve = s.get(chart_PDF)
content = retrieve.content
response =urllib.request.urlopen(chart_PDF)
p = BytesIO(content)
p.getbuffer()
check = PDFPage.get_pages(p, check_extractable=False)
extract = high_level.extract_text(p)
I'm getting:
PDFTextExtractionNotAllowedWarning: The PDF <_io.BytesIO object at 0x000001B007ABEC20> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding.warnings.warn(warning_msg, PDFTextExtractionNotAllowedWarning)
Alternately, if I try this:
from pikepdf import Pdf as PDF
from pdfminer.pdfpage import PDFPage
from PyPDF2 import PdfFileReader
new_pdf = PDF.new()
with PDF.open(p) as pdf:
print(len(pdf.pages))
page1 = pdf.pages[0]
if PdfFileReader.getIsEncrypted(pdf):
print(True)
PdfFileReader.decrypt(page1, password='')
pdf.close()
I get:
line 1987, in decrypt
return self._decrypt(password)
AttributeError: _decrypt
UPDATE 3/8/21
Thank you so much K J! You've seriously been a huge help!
from io import BytesIO
from pdfminer.pdfpage import PDFPage
from pdfminer import high_level
retrieve = s.get(chart_PDF)
content = retrieve.content
bytes = BytesIO(content)
bytes.getbuffer()
PDFPage.get_pages(bytes, check_extractable=False)
extract = high_level.extract_text(bytes, password='') #THIS LINE THROWS ERROR
joined = ''.join(extract)
find_txt = re.findall(r'[(]\d*[-]\d[.]\d[)]', joined)
print(find_txt)
bytes.close()
This is now working well and I have been able to pull the numbers that I need (I have basically pulled all numbers from inside brackets off the PDF). I'll sort through that to find which one I need.
Strangely enough, although its giving me what I need, my extract = high_level.extract_text(bytes, password='') line still throws the Warning: (warning_msg, PDFTextExtractionNotAllowedWarning) which is rather annoying. Not sure how this process works but its still letting the info out.
I can't use try except or it skips over it. What is the way around this? how can I stop that error coming up?
FINAL UPDATE
I got around the warning and it works well now.
with warnings.catch_warnings():
warnings.simplefilter("ignore")
extract = high_level.extract_text(bytes)
Cheers fellas for putting up with my ignorance, you've helped so much.

The whole file has to be downloaded to a device via RAM so the blob as a FILE can be parsed at the very END for one OR more %%EOF and the location of page 0 (it gets converted to 1 or i) it could be ANYWHERE IN THE STREAM,.
THEN you can navigate to other sequential numbered pages in the RANDOM order they are built. Any complaints please contact Adobe.
However it is easiest if it is cached as a physical FILE object. If you dont want that on disk use a ram drive for your browser.
Again those two objects at bottom of page one could be anywhere mixed into the content of "page" 99's objects, or otherwise. each letter in a PDF can in its extreme be more than one object anywhere in the file. but a good authoring editor would try to keep them as lines by lines. (there is no such PDF thing as a word or paragraph.)
We can Print the file as Plain Text to see how it is composited and although (secured) that is allowed.
I tried printing from browser with little success but know that can depend on browser system and OS print drivers. Here I have printed the page as text using Acrobat portable, so we can see the sequential offsets of each text block from Left Hand margin JUST LIKE a PDF VIEWER would need to rebuild them.
UPDATE
You said your target is (1380-4.4) to the RIGHT of ALTERNATE but again A PDF has no concept of Left and Right or BEFORE or AFTER so we find IN THIS FILE the variable target is in 2 separate pieces PRIOR to the KNOWN characters which luckily is a complete single block (alternate). Thus here proximity of plain text could well work if the capture is confined to that nearby locality. However there is no guarantee that ALTERNATE would always be a single block.
It was perhaps not a good Idea To show the way a Printer would be given a stream of sequential data
Here is the way one PDF viewer goes about decrypting the file
As stated on this occasion the word ALTERNATE is defined as text however the next item is the "3" under "B" which is text as a vector path it is not called a "character" although it looks like one but a numbered glyph from a font table. We do see later that some of those numbers are stored as "text" and for your target it is mixed in with similar text in the same object.
Thus you need to call a PDF interpreter to give you a meaningful translation of all bits and pieces of objects so that you can extract the "right" text.
The easiest way for a "simple" one line target in a complex file is to use MuPDF to first tidy up the file
mutool clean -gggg -D infile.pdf outfile.pdf
combined with
PDFTOTXT -layout outfile.pdf outfile.txt
or similar to hopefully export that text on a line by line basis, such that you can consistently find your target instantly before ! or after ALTERNATE.
N.B Mutool convert to HTML would place the target value in a table entry AFTER the key word, and if the lines are consistent in number that would be a simpler way to find or grep.

element tree treats similar files differently

Here are two different files that my python (2.6) script encounters. One will parse, the other will not. I'm just curious as to why this happens.
This xml file will not parse and the script will fail:
<Landfire_Feedback_Point_xlsform id="fbfm40v10" instanceID="uuid:9e062da6-b97b-4d40-b354-6eadf18a98ab" submissionDate="2013-04-30T23:03:32.881Z" isComplete="true" markedAsCompleteDate="2013-04-30T23:03:32.881Z" xmlns="http://opendatakit.org/submissions">
<date_test>2013-04-17</date_test>
<plot_number>10</plot_number>
<select_multiple_names>BillyBob</select_multiple_names>
<geopoint_plot>43.2452830500 -118.2149402900 210.3000030518 3.0000000000</geopoint_plot><fbfm40_new>GS2</fbfm40_new>
<select_grazing>NONE</select_grazing>
<image_close>1366230030355.jpg</image_close>
<plot_note>No road present.</plot_note>
<n0:meta xmlns:n0="http://openrosa.org/xforms">
<n0:instanceID>uuid:9e062da6-b97b-4d40-b354-6eadf18a98ab</n0:instanceID>
</n0:meta>
</Landfire_Feedback_Point_xlsform>
This xml file will parse correctly and the script succeeds:
<Landfire_Feedback_Point_xlsform id="fbfm40v10">
<date_test>2013-05-14</date_test>
<plot_number>010</plot_number>
<select_multiple_names>BillyBob</select_multiple_names>
<geopoint_plot>43.26630563 -118.39881809 351.70001220703125 5.0</geopoint_plot>
<fbfm40_new>GR1</fbfm40_new>
<select_grazing>HIGH</select_grazing>
<image_close>fbfm40v10_PLOT_010_ID_6.jpg</image_close>
<plot_note>Heavy grazing</plot_note>
<meta><instanceID>uuid:90e7d603-86c0-46fc-808f-ea0baabdc082</instanceID></meta>
</Landfire_Feedback_Point_xlsform>
Here is a little python script that demonstrates that one will work, while the other will not. I'm just looking for an explanation as to why one is seen by ElementTree as an xml file while the other isn't. Specifically, the one that doesn't seem to parse fails with a "'NONE' type doesn't have a 'text' attribute" or something similar. But, it's because it doesn't seem to consider the file as xml or it can't see any elements beyond the opening line. Any explanation or direction with regard to this error would be appreciated. Thanks in advance.
Python script:
import os
from xml.etree import ElementTree
def replace_xml_attribute_in_file(original_file,element_name,attribute_value):
#THIS FUNCTION ONLY WORKS ON XML FILES WITH UNIQUE ELEMENT NAMES
# -DUPLICATE ELEMENT NAMES WILL ONLY GET THE FIRST ELEMENT WITH A GIVEN NAME
#split original filename and add tempfile name
tempfilename="temp.xml"
rootsplit = original_file.rsplit('\\') #split the root directory on the backslash
rootjoin = '\\'.join(rootsplit[:-1]) #rejoin the root diretory parts with a backslash -minus the last
temp_file = os.path.join(rootjoin,tempfilename)
et = ElementTree.parse(original_file)
author=et.find(element_name)
author.text = attribute_value
et.write(temp_file)
if os.path.exists(temp_file) and os.path.exists(original_file): #if both the original and the temp files exist
os.remove(original_file) #erase the original
os.rename(temp_file,original_file) #rename the new file
else:
print "Something went wrong."
replace_xml_attribute_in_file("testfile1.xml","image_close","whoopdeedoo.jpg");

Here is a little python script that demonstrates that one will work, while the other will not. I'm just looking for an explanation as to why one is seen by ElementTree as an xml file while the other isn't.
Your code doesn't demonstrate that at all. It demonstrates that they're both seen by ElementTree as valid XML files chock full of nodes. They both parse just fine, they both read past the first line, etc.
The only problem is that the first one doesn't have a node named 'image_close', so your code doesn't work.
You can see that pretty easily:
for node in et.getroot().getchildren():
print node.tag
You get 9 children of the root, with either version.
And the output to that should show you the problem. The node you want is actually named {http://opendatakit.org/submissions}image_close in the first example, rather than image_close as in the second.
And, as you can probably guess, this is because of the namespace=http://opendatakit.org/submissions in the root node. ElementTree uses the "James Clark notation" for mapping unknown-namespaced names to universal names.
Anyway, because none of the nodes are named image_close, the et.find(element_name) returns None, so your code stores author=None, then tries to assign to author.text, and gets an error.
As for how to fix this problem… well, you could learn how namespaces work by default in ElementTree, or you could upgrade to Python 2.7 or install a newer ElementTree for 2.6 that lets you customize things more easily. But if you want to do custom namespace handling and also stick with your old version… I'd start with this article (and its two predecessors) and this one.

Docx content and formatting extraction in python

I am trying to parse a docx folder and take specific elements base on wether or not a certain word is bolded. If this is the text in the document:
Foo: Hello
Boo:
Blah Blah
•Blah
•Blah
Choo: Hello
I would want to scan, line by line, and take all the text after the bolded word until the next bolded word.
As of right now I am using using an XML parser that parses based on newline charactrs. I cannot find anything in the Zipfile or the individual lines that would give me metadata like that.
Is it possible to do this?

I'd use a higher-level library that supports reading docx files rather than parsing the XML document.
One library that looks up to the task is python-docx.
If you're using Jython, Apache POI HWPF is another option.

pyPdf unable to extract text from some pages in my PDF

I'm trying to use pyPdf to extract and print pages from a multipage PDF. Problem is, text is not extracted from some pages. I've put an example file here:
http://www.4shared.com/document/kmJF67E4/forms.html
If you run the following, the first 81 pages return no text, while the final 11 extract properly. Can anyone help?
from pyPdf import PdfFileReader
input = PdfFileReader(file("forms.pdf", "rb"))
for page in input1.pages:
print page.extractText()

Note that extractText() still has problems extracting the text properly. From the documentation for extractText():
This works well for some PDF files,
but poorly for others, depending on
the generator used. This will be
refined in the future. Do not rely on
the order of text coming out of this
function, as it will change if this
function is made more sophisticated.
Since it is the text you want, you can use the Linux command pdftotext.
To invoke that using Python, you can do this:
>>> import subprocess
>>> subprocess.call(['pdftotext', 'forms.pdf', 'output'])
The text is extracted from forms.pdf and saved to output.
This works in the case of your PDF file and extracts the text you want.

This isn't really an answer, but the problem with pyPdf is this: it doesn't yet support CMaps. PDF allows fonts to use CMaps to map character IDs (bytes in the PDF) to Unicode character codes. When you have a PDF that contains non-ASCII characters, there's probably a CMap in use, and even sometimes when there's no non-ASCII characters. When pyPdf encounters strings that are not in standard Unicode encoding, it just sees a bunch of byte code; it can't convert those bytes to Unicode, so it just gives you empty strings. I actually had this same problem and I'm working on the source code at the moment. It's time consuming, but I hope to send a patch to the maintainer some time around mid-2011.

You could also try the pdfminer library (also in python), and see if it's better at extracting the text. For splitting however, you will have to stick with pyPdf as pdfminer doesn't support that.

I find it sometimes useful to convert it to ps (try with pdf2psand pdftops for potential differences) then back to pdf (ps2pdf). Then try your original script again.

I had similar problem with some pdfs and for windows, this is working excellent for me:
1.- Download Xpdf tools for windows
2.- copy pdftotext.exe from xpdf-tools-win-4.00\bin32 to C:\Windows\System32 and also to C:\Windows\SysWOW64
3.- use subprocess to run command from console:
import subprocess
try:
extInfo = subprocess.check_output('pdftotext.exe '+filePath + ' -',shell=True,stderr=subprocess.STDOUT).strip()
except Exception as e:
print (e)

I'm starting to think I should adopt a messy two-part solution. there are two sections to the PDF, pp 1-82 which have text page labels (pdftotext can extract), and pp 83-end which have no page labels but pyPDF can extract and it explicitly knows pages.
I think I need to combine the two. Clunky, but I don't see any way round it. Sadly I'm having to do this on a Windows machine.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.