Extracting text from highlighted annotations in a PDF file

Extracting text from highlighted annotations in a PDF file - python

Since yesterday I'm trying to extract the text from some highlighted annotations in one pdf, using python-poppler-qt4.
According to this documentation, looks like I have to get the text using the Page.text() method, passing a Rectangle argument from the higlighted annotation, which I get using Annotation.boundary(). But I get only blank text. Can someone help me? I copied my code below and added a link for the PDF I am using. Thanks for any help!
import popplerqt4
import sys
import PyQt4
def main():
doc = popplerqt4.Poppler.Document.load(sys.argv[1])
total_annotations = 0
for i in range(doc.numPages()):
page = doc.page(i)
annotations = page.annotations()
if len(annotations) > 0:
for annotation in annotations:
if isinstance(annotation, popplerqt4.Poppler.Annotation):
total_annotations += 1
if(isinstance(annotation, popplerqt4.Poppler.HighlightAnnotation)):
print str(page.text(annotation.boundary()))
if total_annotations > 0:
print str(total_annotations) + " annotation(s) found"
else:
print "no annotations found"
if __name__ == "__main__":
main()
Test pdf:
https://www.dropbox.com/s/10plnj67k9xd1ot/test.pdf

Looking at the documentation for Annotations it seems that the boundary property Returns this annotation's boundary rectangle in normalized coordinates. Although this seems a strange decision we can simply scale the coordinates by the page.pageSize().width() and .height() values.
import popplerqt4
import sys
import PyQt4
def main():
doc = popplerqt4.Poppler.Document.load(sys.argv[1])
total_annotations = 0
for i in range(doc.numPages()):
#print("========= PAGE {} =========".format(i+1))
page = doc.page(i)
annotations = page.annotations()
(pwidth, pheight) = (page.pageSize().width(), page.pageSize().height())
if len(annotations) > 0:
for annotation in annotations:
if isinstance(annotation, popplerqt4.Poppler.Annotation):
total_annotations += 1
if(isinstance(annotation, popplerqt4.Poppler.HighlightAnnotation)):
quads = annotation.highlightQuads()
txt = ""
for quad in quads:
rect = (quad.points[0].x() * pwidth,
quad.points[0].y() * pheight,
quad.points[2].x() * pwidth,
quad.points[2].y() * pheight)
bdy = PyQt4.QtCore.QRectF()
bdy.setCoords(*rect)
txt = txt + unicode(page.text(bdy)) + ' '
#print("========= ANNOTATION =========")
print(unicode(txt))
if total_annotations > 0:
print str(total_annotations) + " annotation(s) found"
else:
print "no annotations found"
if __name__ == "__main__":
main()
Additionally, I decided to concatenate the .highlightQuads() to get a better representation of what was actually highlighted.
Please be aware of the explicit <space> I have appended to each quad region of text.
In the example document the returned QString could not be passed directly to print() or str(), the solution to this was to use unicode() instead.
I hope this helps someone as it helped me.
Note: Page rotation may affect the scaling values, I have not been able to test this.

Related

underline text with odfpy

I'd like to generate an odf file with odfpy, and am stuck on underlining text.
Here is a minimal example inspired from official documentation, where i can't find any information about what attributes can be used and where.
Any suggestion?
from odf.opendocument import OpenDocumentText
from odf.style import Style, TextProperties
from odf.text import H, P, Span
textdoc = OpenDocumentText()
ustyle = Style(name="Underline", family="text")
#uprop = TextProperties(fontweight="bold") #uncommented, this works well
#uprop = TextProperties(attributes={"fontsize":"26pt"}) #this either
uprop = TextProperties(attributes={"underline":"solid"}) # bad guess, wont work !!
ustyle.addElement(uprop)
textdoc.automaticstyles.addElement(ustyle)
p = P(text="Hello world. ")
underlinedpart = Span(stylename=ustyle, text="This part would like to be underlined. ")
p.addElement(underlinedpart)
p.addText("This is after the style test.")
textdoc.text.addElement(p)
textdoc.save("myfirstdocument.odt")

Here is how I finally got it:
I created a sample document with underlining using libreoffice, and unzipped it. Looking in styles.xml part of the extracted files, I got the part that makes underlining in the document:
<style:style style:name="Internet_20_link" style:display-name="Internet link" style:family="text">
<style:text-properties fo:color="#000080" fo:language="zxx" fo:country="none" style:text-underline-style="solid" style:text-underline-width="auto" style:text-underline-color="font-color" style:language-asian="zxx" style:country-asian="none" style:language-complex="zxx" style:country-complex="none"/>
</style:style>
The interesting style attributes are named: text-underline-style,
text-underline-width and text-underline-color.
To use them in odfpy, '-' characters must be removed, and attributes keys must be used as str (with quotes) like in the following code. A correct style family (text in our case) must be specified in the Style constructor call.
from odf.opendocument import OpenDocumentText
from odf.style import Style, TextProperties
from odf.text import H, P, Span
textdoc = OpenDocumentText()
#underline style
ustyle = Style(name="Underline", family="text") #here style family
uprop = TextProperties(attributes={
"textunderlinestyle":"solid",
"textunderlinewidth":"auto",
"textunderlinecolor":"font-color"
})
ustyle.addElement(uprop)
textdoc.automaticstyles.addElement(ustyle)
p = P(text="Hello world. ")
underlinedpart = Span(stylename=ustyle, text="This part would like to be underlined. ")
p.addElement(underlinedpart)
p.addText("This is after the style test.")
textdoc.text.addElement(p)
textdoc.save("myfirstdocument.odt")

R or Python: relating extracted data AND images from pdf

I am trying to extract data and images from a pdf and pass them to a database. I had tried several libraries/packages in R and Python, nut still facing the problem that I can not relate the extracted image with the data which describes it.
I attached an image of a pdf file as a sample to illustrate the problem.
My need is to finally have a dataframe as follows:
NUMBER ORDER IMAGE
09090087 345679 345679.jpg
09090087 535278 535278.jpg
And the files 345679.jpg, which is a cat, and 535278.jpg, which is a dog, extracted to some folder...
At the moment I have managed to extract images but I can not figure out how to relate the image with text labels.
My code:
from __future__ import print_function
import fitz
import sys, time, re
checkXO = r"/Type(?= */XObject)"
checkIM = r"/Subtype(?= */Image)"
doc = fitz.open(sys.argv[1])
imgcount = 0
lenXREF = doc._getXrefLength()
for i in range(1, lenXREF):
text = doc._getObjectString(i)
isXObject = re.search(checkXO, text)
isImage = re.search(checkIM, text)
if not isXObject or not isImage:
continue
imgcount += 1
pix = fitz.Pixmap(doc, i)
if pix.n < 5:
pix.writePNG("pdfimg/img-%s.png" % (i,))
else:
pix0 = fitz.Pixmap(fitz.csRGB, pix)
pix0.writePNG("pdfimg/img-%s.png" % (i,))
pix0 = None
pix = None
ANY ideas?

Pandoc: Markdown to Tex -- with using filter show error "failed to parse field blocks"

I try adapt this pandoc filter but I need use Span instead Div.
input file (myfile.md):
### MY HEADER
[File > Open]{.menu}
[\ctrl + C]{.keys}
Simply line
filter file (myfilter.py):
#!/usr/bin/env python
from pandocfilters import *
def latex(x):
return RawBlock('latex', x)
def latex_menukeys(key, value, format, meta):
if key == 'Span':
[[ident, classes, kvs], contents] = value
if classes[0] == "menu":
return([latex('\\menu{')] + contents + [latex('}')])
elif classes[0] == "keys":
return([latex('\\keys{')] + contents + [latex('}')])
if __name__ == "__main__":
toJSONFilter(latex_menukeys)
run:
pandoc myfile.md -o myfile.tex -F myfilter.py
pandoc:Error in $.blocks[1].c[0]: failed to parse field blocks: failed to parse field c: mempty
CallStack <fromHasCallStack>:
error, called at pandoc.hs:144:42 in main:Main
How I should use varyable "contents" correct?

Suppose Span is inside a paragraph. Then you would be trying to replace it with a RawBlock, which is not going to work. Maybe try using RawInline instead?

How to color specific part of the text with some separator

I'm trying to color some specific part of the text, i have tried to say:
if word.strip().startswith(":"):
self.setAttributesForRange(NSColor.greenColor(), None, highlightOffset, len(word))
When someone types the sign : it gets colored green. That is good, but it keeps coloring the word after it like this:
:Hello Hello :Hello <---- this all gets colored green, but I want something like:
:Hello Hello :Hello <---- where everything get colored except the middle "hello" because it doesn't start with the sign : , please help me out
from Foundation import *
from AppKit import *
import objc
class PyObjC_HighlightAppDelegate(NSObject):
# The connection to our NSTextView in the UI
highlightedText = objc.IBOutlet()
# Default font size to use when highlighting
fontSize = 12
def applicationDidFinishLaunching_(self, sender):
NSLog("Application did finish launching.")
def textDidChange_(self, notification):
"""
Delegate method called by the NSTextView whenever the contents of the
text view have changed. This is called after the text has changed and
been committed to the view. See the Cocoa reference documents:
http://developer.apple.com/documentation/Cocoa/Reference/ApplicationKit/Classes/NSText_Class/Reference/Reference.html
http://developer.apple.com/documentation/Cocoa/Reference/ApplicationKit/Classes/NSTextView_Class/Reference/Reference.html
Specifically the sections on Delegate Methods for information on additional
delegate methods relating to text control is NSTextView objects.
"""
# Retrieve the current contents of the document and start highlighting
content = self.highlightedText.string()
self.highlightText(content)
def setAttributesForRange(self, color, font, rangeStart, rangeLength):
"""
Set the visual attributes for a range of characters in the NSTextView. If
values for the color and font are None, defaults will be used.
The rangeStart is an index into the contents of the NSTextView, and
rangeLength is used in combination with this index to create an NSRange
structure, which is passed to the NSTextView methods for setting
text attributes. If either of these values are None, defaults will
be provided.
The "font" parameter is used as an key for the "fontMap", which contains
the associated NSFont objects for each font style.
"""
fontMap = {
"normal" : NSFont.systemFontOfSize_(self.fontSize),
"bold" : NSFont.boldSystemFontOfSize_(self.fontSize)
}
# Setup sane defaults for the color, font and range if no values
# are provided
if color is None:
color = NSColor.blackColor()
if font is None:
font = "normal"
if font not in fontMap:
font = "normal"
displayFont = fontMap[font]
if rangeStart is None:
rangeStart = 0
if rangeLength is None:
rangeLength = len(self.highlightedText.string()) - rangeStart
# Set the attributes for the specified character range
range = NSRange(rangeStart, rangeLength)
self.highlightedText.setTextColor_range_(color, range)
self.highlightedText.setFont_range_(displayFont, range)
def highlightText(self, content):
"""
Apply our customized highlighting to the provided content. It is assumed that
this content was extracted from the NSTextView.
"""
# Calling the setAttributesForRange with no values creates
# a default that "resets" the formatting on all of the content
self.setAttributesForRange(None, None, None, None)
# We'll highlight the content by breaking it down into lines, and
# processing each line one by one. By storing how many characters
# have been processed we can maintain an "offset" into the overall
# content that we use to specify the range of text that is currently
# being highlighted.
contentLines = content.split("\n")
highlightOffset = 0
for line in contentLines:
if line.strip().startswith("#"):
# Comment - we want to highlight the whole comment line
self.setAttributesForRange(NSColor.greenColor(), None, highlightOffset, len(line))
elif line.find(":") > -1:
# Tag - we only want to highlight the tag, not the colon or the remainder of the line
startOfLine = line[0: line.find(":")]
yamlTag = startOfLine.strip("\t ")
yamlTagStart = line.find(yamlTag)
self.setAttributesForRange(NSColor.blueColor(), "bold", highlightOffset + yamlTagStart, len(yamlTag))
elif line.strip().startswith("-"):
# List item - we only want to highlight the dash
listIndex = line.find("-")
self.setAttributesForRange(NSColor.redColor(), None, highlightOffset + listIndex, 1)
# Add the processed line to our offset, as well as the newline that terminated the line
highlightOffset += len(line) + 1

It all depends on what word is.
In [6]: word = ':Hello Hello :Hello'
In [7]: word.strip().startswith(':')
Out[7]: True
In [8]: len(word)
Out[8]: 19
Compare:
In [1]: line = ':Hello Hello :Hello'.split()
In [2]: line
Out[2]: [':Hello', 'Hello', ':Hello']
In [3]: for word in line:
print word.strip().startswith(':')
print len(word)
...:
True
6
False
5
True
6
Notice the difference in len(word), which I suspect is causing your problem.

Showing page count with ReportLab

I'm trying to add a simple "page x of y" to a report made with ReportLab.. I found this old post about it, but maybe six years later something more straightforward has emerged? ^^;
I found this recipe too, but when I use it, the resulting PDF is missing the images..

I was able to implement the NumberedCanvas approach from ActiveState. It was very easy to do and did not change much of my existing code. All I had to do was add that NumberedCanvas class and add the canvasmaker attribute when building my doc. I also changed the measurements of where the "x of y" was displayed:
self.doc.build(pdf)
became
self.doc.build(pdf, canvasmaker=NumberedCanvas)
doc is a BaseDocTemplate and pdf is my list of flowable elements.

use doc.multiBuild
and in the page header method (defined by "onLaterPages="):
global TOTALPAGES
if doc.page > TOTALPAGES:
TOTALPAGES = doc.page
else:
canvas.drawString(270 * mm, 5 * mm, "Seite %d/%d" % (doc.page,TOTALPAGES))

Just digging up some code for you, we use this:
SimpleDocTemplate(...).build(self.story,
onFirstPage=self._on_page,
onLaterPages=self._on_page)
Now self._on_page is a method that gets called for each page like:
def _on_page(self, canvas, doc):
# ... do any additional page formatting here for each page
print doc.page

I came up with a solution for platypus, that is easier to understand (at least I think it is). You can manually do two builds. In the first build, you can store the total number of pages. In the second build, you already know it in advance. I think it is easier to use and understand, because it works with platypus level event handlers, instead of canvas level events.
import copy
import io
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib.units import inch
styles = getSampleStyleSheet()
Title = "Hello world"
pageinfo = "platypus example"
total_pages = 0
def on_page(canvas, doc: SimpleDocTemplate):
global total_pages
total_pages = max(total_pages, doc.page)
canvas.saveState()
canvas.setFont('Times-Roman', 9)
canvas.drawString(inch, 0.75 * inch, "Page %d %s" % (doc.page, total_pages))
canvas.restoreState()
Story = [Spacer(1, 2 * inch)]
style = styles["Normal"]
for i in range(100):
bogustext = ("This is Paragraph number %s. " % i) * 20
p = Paragraph(bogustext, style)
Story.append(p)
Story.append(Spacer(1, 0.2 * inch))
# You MUST use a deep copy of the story!
# https://mail.python.org/pipermail/python-list/2022-March/905728.html
# First pass
with io.BytesIO() as out:
doc = SimpleDocTemplate(out)
doc.build(copy.deepcopy(Story), onFirstPage=on_page, onLaterPages=on_page)
# Second pass
with open("test.pdf", "wb+") as out:
doc = SimpleDocTemplate(out)
doc.build(copy.deepcopy(Story), onFirstPage=on_page, onLaterPages=on_page)
You just need to make sure that you always render a deep copy of your original story. Otherwise it won't work. (You will either get an empty page as the output, or a render error telling that a Flowable doesn't fit in the frame.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting text from highlighted annotations in a PDF file - python

Related

underline text with odfpy

R or Python: relating extracted data AND images from pdf

Pandoc: Markdown to Tex -- with using filter show error "failed to parse field blocks"

How to color specific part of the text with some separator

Showing page count with ReportLab

Categories

Resources