Docx content and formatting extraction in python - python

I am trying to parse a docx folder and take specific elements base on wether or not a certain word is bolded. If this is the text in the document:
Foo: Hello
Boo:
Blah Blah
•Blah
•Blah
Choo: Hello
I would want to scan, line by line, and take all the text after the bolded word until the next bolded word.
As of right now I am using using an XML parser that parses based on newline charactrs. I cannot find anything in the Zipfile or the individual lines that would give me metadata like that.
Is it possible to do this?

I'd use a higher-level library that supports reading docx files rather than parsing the XML document.
One library that looks up to the task is python-docx.
If you're using Jython, Apache POI HWPF is another option.

Related

Html rich text to Microsoft Word

Right now i have this format:
"This is a bold word, this is in italic, this is regular"
Which translates to:
<p>This is a bold <strong>word</strong>, <em>this is in italic</em>, this is regular.</p>
Is there any python library which turns the above code into Microsoft Word format? I couldn't find any, i found only pandoc and the subsequent pypandoc which read html but can only translate it into .docx format by saving a .docx file- and this isn't helpful.
I figured to ask a question here before i put work into writing a parser to do this.

Get text from Gtk3 TextView/TextBuffer with formatting tags included in Python

I'm working on a Python 3 project that uses the Gtk3 TextView/TextBuffer to get a user's input, and I've got it working to where I can have the user typing in rich text and able to format it as Bold/Italic/Underline/Combination of these.
However, I'm stuck on trying to figure out how to get the text from the TextBuffer with those flags included so I can use the formatting flags to convert the text to properly formatted HTML when I need to.
Calling textbuffer.get_text(start, end, True) simply returns the text without any flags.
Here's the code and the editor.glade file. Save them both in the same directory.
How can I get the text with the flags included? Or, alternatively, is there a way I can get the user's input formatted as HTML automatically in another variable automatically?
That's not very easy. Here is a link to some code that I once wrote to do the same thing for RTF output. You can probably adapt it to produce HTML output. If you manage to do so, I'd possibly integrate it into that library's successor.
Alternatively, if you prefer text processing to the above, you can export the rich text in GtkTextBuffer's internal serialization format and convert it to HTML yourself later:
format = textbuffer.register_serialize_tagset('my-tagset')
exported = textbuffer.serialize(textbuffer, format, start, end)

How to extract text from an existing docx file using python-docx

I'm trying to use python-docx module (pip install python-docx)
but it seems to be very confusing as in github repo test sample they are using opendocx function but in readthedocs they are using Document class. Even though they are only showing how to add text to a docx file, not reading existing one?
1st one (opendocx) is not working, may be deprecated. For second case I was trying to use:
from docx import Document
document = Document('test_doc.docx')
print(document.paragraphs)
It returned a list of <docx.text.Paragraph object at 0x... >
Then I did:
for p in document.paragraphs:
print(p.text)
It returned all text but there were few thing missing. All URLs (CTRL+CLICK to go to URL) were not present in text on console.
What is the issue? Why URLs are missing?
How could I get complete text without iterating over loop (something like open().read())
you can try this
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
You can use python-docx2txt which is adapted from python-docx but can also extract text from links, headers and footers. It can also extract images.
Without Installing python-docx
docx is basically is a zip file with several folders and files within it. In the link below you can find a simple function to extract the text from docx file, without the need to rely on python-docx and lxml the latter being sometimes hard to install:
http://etienned.github.io/posts/extract-text-from-word-docx-simply/
There are two "generations" of python-docx. The initial generation ended with the 0.2.x versions and the "new" generation started at v0.3.0. The new generation is a ground-up, object-oriented rewrite of the legacy version. It has a distinct repository located here.
The opendocx() function is part of the legacy API. The documentation is for the new version. The legacy version has no documentation to speak of.
Neither reading nor writing hyperlinks are supported in the current version. That capability is on the roadmap, and the project is under active development. It turns out to be quite a broad API because Word has so much functionality. So we'll get to it, but probably not in the next month unless someone decides to focus on that aspect and contribute it. UPDATE Hyperlink support was added subsequent to this answer.
Using python-docx, as #Chinmoy Panda 's answer shows:
for para in doc.paragraphs:
fullText.append(para.text)
However, para.text will lost the text in w:smarttag (Corresponding github issue is here: https://github.com/python-openxml/python-docx/issues/328), you should use the following function instead:
def para2text(p):
rs = p._element.xpath('.//w:t')
return u" ".join([r.text for r in rs])
It seems that there is no official solution for this problem, but there is a workaround posted here
https://github.com/savoirfairelinux/python-docx/commit/afd9fef6b2636c196761e5ed34eb05908e582649
just update this file
"...\site-packages\docx\oxml_init_.py"
# add
import re
import sys
# add
def remove_hyperlink_tags(xml):
if (sys.version_info > (3, 0)):
xml = xml.decode('utf-8')
xml = xml.replace('</w:hyperlink>', '')
xml = re.sub('<w:hyperlink[^>]*>', '', xml)
if (sys.version_info > (3, 0)):
xml = xml.encode('utf-8')
return xml
# update
def parse_xml(xml):
"""
Return root lxml element obtained by parsing XML character string in
*xml*, which can be either a Python 2.x string or unicode. The custom
parser is used, so custom element classes are produced for elements in
*xml* that have them.
"""
root_element = etree.fromstring(remove_hyperlink_tags(xml), oxml_parser)
return root_element
and of course don't forget to mention in the documentation that use are changing the official library

How can I say a file is SVG without using a magic number?

An SVG file is basically an XML file so I could use the string <?xml (or the hex representation: '3c 3f 78 6d 6c') as a magic number but there are a few opposing reason not to do that if for example there are extra white-spaces it could break this check.
The other images I need/expect to check are all binaries and have magic numbers. How can I fast check if the file is an SVG format without using the extension eventually using Python?
XML is not required to start with the <?xml preamble, so testing for that prefix is not a good detection technique — not to mention that it would identify every XML as SVG. A decent detection, and really easy to implement, is to use a real XML parser to test that the file is well-formed XML that contains the svg top-level element:
import xml.etree.cElementTree as et
def is_svg(filename):
tag = None
with open(filename, "r") as f:
try:
for event, el in et.iterparse(f, ('start',)):
tag = el.tag
break
except et.ParseError:
pass
return tag == '{http://www.w3.org/2000/svg}svg'
Using cElementTree ensures that the detection is efficient through the use of expat; timeit shows that an SVG file was detected as such in ~200μs, and a non-SVG in 35μs. The iterparse API enables the parser to forego creating the whole element tree (module name notwithstanding) and only read the initial portion of the document, regardless of total file size.
You could try reading the beginning of the file as binary - if you can't find any magic numbers, you read it as a text file and match to any textual patterns you wish. Or vice-versa.
This is from man file (here), for the unix file command:
The magic tests are used to check for files with data in particular fixed formats. The canonical example of this is a binary executable ... These files have a “magic number” stored in a particular place near the beginning of the file that tells the UNIX operating system that the file is a binary executable, and which of several types thereof. The concept of a “magic” has been applied by extension to data files. Any file with some invariant identifier at a small fixed offset into the file can usually be described in this way. ...
(my emphasis)
And here's one example of the "magic" that the file command uses to identify an svg file (see source for more):
...
0 string \<?xml\ version=
>14 regex ['"\ \t]*[0-9.]+['"\ \t]*
>>19 search/4096 \<svg SVG Scalable Vector Graphics image
...
0 string \<svg SVG Scalable Vector Graphics image
...
As described by man magic, each line follows the format <offset> <type> <test> <message>.
If I understand correctly, the code above looks for the literal "<?xml version=". If that is found, it looks for a version number, as described by the regular expression. If that is found, it searches the next 4096 bytes until it finds the literal "<svg". If any of this fails, it looks for the literal "<svg" at the start of the file, and so on.
Something similar could be implemented in Python.
Note there's also python-magic, which provides an interface to libmagic, as used by the unix file command.

pyPdf unable to extract text from some pages in my PDF

I'm trying to use pyPdf to extract and print pages from a multipage PDF. Problem is, text is not extracted from some pages. I've put an example file here:
http://www.4shared.com/document/kmJF67E4/forms.html
If you run the following, the first 81 pages return no text, while the final 11 extract properly. Can anyone help?
from pyPdf import PdfFileReader
input = PdfFileReader(file("forms.pdf", "rb"))
for page in input1.pages:
print page.extractText()
Note that extractText() still has problems extracting the text properly. From the documentation for extractText():
This works well for some PDF files,
but poorly for others, depending on
the generator used. This will be
refined in the future. Do not rely on
the order of text coming out of this
function, as it will change if this
function is made more sophisticated.
Since it is the text you want, you can use the Linux command pdftotext.
To invoke that using Python, you can do this:
>>> import subprocess
>>> subprocess.call(['pdftotext', 'forms.pdf', 'output'])
The text is extracted from forms.pdf and saved to output.
This works in the case of your PDF file and extracts the text you want.
This isn't really an answer, but the problem with pyPdf is this: it doesn't yet support CMaps. PDF allows fonts to use CMaps to map character IDs (bytes in the PDF) to Unicode character codes. When you have a PDF that contains non-ASCII characters, there's probably a CMap in use, and even sometimes when there's no non-ASCII characters. When pyPdf encounters strings that are not in standard Unicode encoding, it just sees a bunch of byte code; it can't convert those bytes to Unicode, so it just gives you empty strings. I actually had this same problem and I'm working on the source code at the moment. It's time consuming, but I hope to send a patch to the maintainer some time around mid-2011.
You could also try the pdfminer library (also in python), and see if it's better at extracting the text. For splitting however, you will have to stick with pyPdf as pdfminer doesn't support that.
I find it sometimes useful to convert it to ps (try with pdf2psand pdftops for potential differences) then back to pdf (ps2pdf). Then try your original script again.
I had similar problem with some pdfs and for windows, this is working excellent for me:
1.- Download Xpdf tools for windows
2.- copy pdftotext.exe from xpdf-tools-win-4.00\bin32 to C:\Windows\System32 and also to C:\Windows\SysWOW64
3.- use subprocess to run command from console:
import subprocess
try:
extInfo = subprocess.check_output('pdftotext.exe '+filePath + ' -',shell=True,stderr=subprocess.STDOUT).strip()
except Exception as e:
print (e)
I'm starting to think I should adopt a messy two-part solution. there are two sections to the PDF, pp 1-82 which have text page labels (pdftotext can extract), and pp 83-end which have no page labels but pyPDF can extract and it explicitly knows pages.
I think I need to combine the two. Clunky, but I don't see any way round it. Sadly I'm having to do this on a Windows machine.

Categories