pdfminer error message: pdfminer.pdfdocument.PDFTextExtractionNotAllowed: Text extraction is not allowed - python

I need to process some PDF files and add their form field contents in a database.
This document has not Security Method set, as I can see in the PDF Viewer document properties.
I tried the suggestions I found here.
When I test using pdfminer (or pdfminer.six), I didn't get an error message, but it didn't retrieve any field.
Using PyPDF2, I get the error message: "file has not been decrypted."
This is the pdfminer code:
import sys
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
fname=r'D:\Atrium\Projects\CTFC\psgf\database\19022021\formulari-dinamic-redaccio-plans-simples-gestio-forestal_Filled.pdf'
fp = open(fname, 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']
for i in fields:
field = resolve1(i)
name, value = field.get('T'), field.get('V')
print('{0}: {1}'.format(name, value))
print('Done!')
A sample file can be download here.
How can I do to obtain field names and content?

As mkl explained, my PDF files store form data in XFA form, a deprecated format. The XFA is an array of XML docs and I have to procure field names in each one of these docs.
I used PyPDF2 library to do that:
import PyPDF2 as pypdf
import xml.etree.ElementTree as ET
fname=r'form.pdf'
def findInDict(needle, haystack):
xlas = []
for key in haystack.keys():
try:
value=haystack[key]
except:
continue
if key==needle:
return value
if isinstance(value,dict):
x=findInDict(needle,value)
if x is not None:
return x
pdfobject=open(fname,'rb')
pdf=pypdf.PdfFileReader(pdfobject)
xfaparts=findInDict('/XFA',pdf.resolvedObjects)
for xfa in xfaparts:
if isinstance(xfa,pypdf.generic.IndirectObject):
xml = str(xfa.getObject().getData())
## Then process XML to find form tags

Related

Fetching checkbox and radio fields with PyPDF2

My project involves reading text from a bunch of PDF form files for which I'm using PyPDF2 open source library. There is no issue in getting the text data as follows:
reader = PdfReader("data/test.pdf")
cnt = len(reader.pages)
print("reading pdf (%d pages)" % cnt)
page = reader.pages[cnt-1]
lines = page.extract_text().splitlines()
print("%d lines extracted..." % len(lines))
However, this text doesn't contain the checked statuses of the radio and checkboxes. I just get normal text (like Yes No or Check-1 Check-2 Check-3 for example) instead of these values.
I also tried the reader.get_fields() and reader.get_form_text_fields() methods as described in their documentation but they return empty values. I also tried reading it through annotations but no "/Annots" found on the page. When I open the PDF in a notepad++ to see its meta data, this is what I get:
%PDF-1.4
%²³´µ
%Generated by ExpertPdf v9.2.2
It appears to me that these checkboxes aren't usual form fields used in PDF but appear similar to HTML elements. Is there any way to extract these fields using python?
Finally, I've also tried pdfminer.six, the other popular pdf library for python with same results.
It seems like the PDF you have contains XFA forms, which are not supported by the pdfminer.six library.
You may try this tool to extract the XFA forms.
If your PDF does not contain XFA forms, you can extract the form field and value data using the following snippet from pdfminer.six docs
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
from pdfminer.psparser import PSLiteral, PSKeyword
from pdfminer.utils import decode_text
data = {}
def decode_value(value):
# decode PSLiteral, PSKeyword
if isinstance(value, (PSLiteral, PSKeyword)):
value = value.name
# decode bytes
if isinstance(value, bytes):
value = decode_text(value)
return value
with open(file_path, 'rb') as fp:
parser = PDFParser(fp)
doc = PDFDocument(parser)
res = resolve1(doc.catalog)
if 'AcroForm' not in res:
raise ValueError("No AcroForm Found")
fields = resolve1(doc.catalog['AcroForm'])['Fields'] # may need further resolving
for f in fields:
field = resolve1(f)
name, values = field.get('T'), field.get('V')
# decode name
name = decode_text(name)
# resolve indirect obj
values = resolve1(values)
# decode value(s)
if isinstance(values, list):
values = [decode_value(v) for v in values]
else:
values = decode_value(values)
data.update({name: values})
print(name, values)
If the above code throws No AcroForm found error, that means your PDF most probably contains XFA forms.

How to read data from a PDF form using python

I need to read data from hundreds of PDF forms. These forms have all text entry boxes, the forms are not editable. I have been trying to use Python and PyPDF2 to read these forms to a CSV file (since the ultimate goal is an excel database.
I have tried using acrobats export as csv function, but this is extremely slow as each form has 4 embedded images that export as plaintext. I have the following code,
from PyPDF2 import PdfFileReader
infile = "FormSample.pdf"
pdf_reader = PdfFileReader(open(infile, "rb"))
with open('exportharvest.csv','w') as exportharvestcsv:
dictionary = pdf_reader.getFields(fileobj = exportharvestcsv)
textfields = pdf_reader.getFormTextFields()
dest = pdf_reader.getNamedDestinations()
print(dest)
The issue with the above code is as follows: the getFields command only gets the ~4 digital signature fields in the form (form has ~300 entries). Is there some way to instruct python to look through all the fields? I know the field names in the document as they are listed when I export to pdf.
getFormTextFields() returns a dictionary of {}
getNamedDestinations() returns a dictionary of {}
Thanks for any help.
From my experience pyPDF is slow as well.
this here should do what you want:
from PyPDF2 import PdfFileReader
from pprint import pprint
pdf_file_name = 'formdocument.pdf'
f = PdfFileReader(pdf_file_name)
fields = f.getFields()
fdfinfo = dict((k, v.get('/V', '')) for k, v in fields.items())
pprint(fdfinfo)
with open('test.csv', 'w') as f2:
for key in fdfinfo.keys():
if type(key)==type("string") and type(str(fdfinfo[key]))==type("string"):
f2.write('"'+key+'","'+fdfinfo[key]+'"\n')

How can I extract text from textboxes within a PDF in Python?

I'm not having any luck with pyPDF2 or PDFMiner. The tools always return _______________ for the textboxes even if they are filled in. Does anyone have any idea on how to extract the text within the textbox fields?
You need to extract text fields, not a text. So you need something like this:
import sys
import six
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
fp = open("c:\\tmp\\test.pdf", "rb")
parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog["AcroForm"])["Fields"]
for i in fields:
field = resolve1(i)
name, value = field.get("T"), field.get("V")
print ("{0}:{1}".format(name,value))

PDFMiner conditional extraction of text

So I've just played around with PDFMiner and can now extract text from a PDF and throw it into an html or textfile.
pdf2txt.py -o outputfile.txt -t txt inputfile.pdf
I have then written a simple script to extract all certain strings:
with open('output.txt', 'r') as searchfile:
for line in searchfile:
if 'HELLO' in line:
print(line)
And now I can use all these strings containing the word HELLO to add to my databse if that is what I wanted.
My questions is:
Is the only way or can PDFinder grab conditional stuff before even spitting it out to the txt, html or even straight into the database?
Well, yes, you can: PDFMiner has API.
The basic example sais
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
# Open a PDF file.
fp = open('mypdf.pdf', 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, password)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
# do stuff with the page here
and in the loop you should go with
# receive the LTPage object for the page.
layout = device.get_result()
and then use LTTextBox object. You have to analyse that. There's no full example in the docs, but you may check out the pdf2txt.py source which will help you to find missing pieces (although it does much more since it parses options and applies them).
And once you have the code that extracts the text, you can do what you want before saving a file. Including searching for certain parts of text.
PS looks like this was, in a way, asked before: How do I use pdfminer as a library which should be helpful, too.

Generate flattened PDF with Python

When I print a PDF from any of my source PDFs, the file size drops and removes the text boxes presents in form. In short, it flattens the file.
This is behavior I want to achieve.
The following code to create a PDF using another PDF as a source (the one I want to flatten), it writes the text boxes form as well.
Can I get a PDF without the text boxes, flatten it? Just like Adobe does when I print a PDF as a PDF.
My other code looks something like this minus some things:
import os
import StringIO
from pyPdf import PdfFileWriter, PdfFileReader
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
directory = os.path.join(os.getcwd(), "source") # dir we are interested in
fif = [f for f in os.listdir(directory) if f[-3:] == 'pdf'] # get the PDFs
for i in fif:
packet = StringIO.StringIO()
can = canvas.Canvas(packet, pagesize=letter)
can.rotate(-90)
can.save()
packet.seek(0)
new_pdf = PdfFileReader(packet)
fname = os.path.join('source', i)
existing_pdf = PdfFileReader(file(fname, "rb"))
output = PdfFileWriter()
nump = existing_pdf.getNumPages()
page = existing_pdf.getPage(0)
for l in range(nump):
output.addPage(existing_pdf.getPage(l))
page.mergePage(new_pdf.getPage(0))
outputStream = file("out-"+i, "wb")
output.write(outputStream)
outputStream.close()
print fName + " written as", i
Summing up: I have a pdf, I add a text box to it, covering up info and adding new info, and then I print a pdf from that pdf. The text box becomes not editable or moveable any longer. I wanted to automate that process but everything I tried still allowed that text box to be editable.
If installing an OS package is an option, then you could use pdftk with its python wrapper pypdftk like this:
import pypdftk
pypdftk.fill_form('filled.pdf', out_file='flattened.pdf', flatten=True)
You would also need to install the pdftk package, which on Ubuntu could be done like this:
sudo apt-get install pdftk
The pypdftk library can by downloaded from PyPI:
pip install pypdftk
Update: pdftk was briefly removed from Ubuntu in version 18.04, but it seems it is back since 20.04.
A simple but more of a round about way it to covert the pdf to images than to put those image into a pdf.
You'll need pdf2image and PIL
Like So
from pdf2image import convert_from_path
from PIL import Image
images = convert_from_path('temp.pdf')
im1 = images[0]
images.pop(0)
pdf1_filename = "flattened.pdf"
im1.save(pdf1_filename, "PDF" ,resolution=100.0, save_all=True, append_images=images)
Edit:
I created a library to do this called fillpdf
pip install fillpdf
from fillpdf import fillpdfs
fillpdfs.flatten_pdf('input.pdf', 'newflat.pdf')
Per the Adobe Docs, you can change the Bit Position of the Editable Form Fields to 1 to make the field ReadOnly. I provided a full solution here, but it uses Django:
https://stackoverflow.com/a/55301804/8382028
Adobe Docs (page 552):
https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf
Use PyPDF2 to fill the fields, then loop through the annotations to change the bit position:
from io import BytesIO
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import BooleanObject, NameObject, NumberObject
# open the pdf
input_stream = open("YourPDF.pdf", "rb")
reader = PdfFileReader(input_stream, strict=False)
if "/AcroForm" in reader.trailer["/Root"]:
reader.trailer["/Root"]["/AcroForm"].update(
{NameObject("/NeedAppearances"): BooleanObject(True)}
)
writer = PdfFileWriter()
writer.set_need_appearances_writer()
if "/AcroForm" in writer._root_object:
# Acro form is form field, set needs appearances to fix printing issues
writer._root_object["/AcroForm"].update(
{NameObject("/NeedAppearances"): BooleanObject(True)}
)
data_dict = dict() # this is a dict of your form values
writer.addPage(reader.getPage(0))
page = writer.getPage(0)
# update form fields
writer.updatePageFormFieldValues(page, data_dict)
for j in range(0, len(page["/Annots"])):
writer_annot = page["/Annots"][j].getObject()
for field in data_dict:
if writer_annot.get("/T") == field:
# make ReadOnly:
writer_annot.update({NameObject("/Ff"): NumberObject(1)})
output_stream = BytesIO()
writer.write(output_stream)
# output_stream is your flattened PDF
A solution that goes for Windows as well, converts many pdf pages and flatens the chackbox values as well. For some reason #ViaTech code did not work in my pc (Windows7 python 3.8)
Followed #ViaTech indications and used extensively #hchillon code from this post
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import BooleanObject, NameObject, IndirectObject, TextStringObject, NumberObject
def set_need_appearances_writer(writer):
try:
catalog = writer._root_object
# get the AcroForm tree and add "/NeedAppearances attribute
if "/AcroForm" not in catalog:
writer._root_object.update({
NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)})
need_appearances = NameObject("/NeedAppearances")
writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
return writer
except Exception as e:
print('set_need_appearances_writer() catch : ', repr(e))
return writer
class PdfFileFiller(object):
def __init__(self, infile):
self.pdf = PdfFileReader(open(infile, "rb"), strict=False)
if "/AcroForm" in self.pdf.trailer["/Root"]:
self.pdf.trailer["/Root"]["/AcroForm"].update(
{NameObject("/NeedAppearances"): BooleanObject(True)})
# newvals and newchecks have keys have to be filled. '' is not accepted
def update_form_values(self, outfile, newvals=None, newchecks=None):
self.pdf2 = MyPdfFileWriter()
trailer = self.pdf.trailer['/Root'].get('/AcroForm', None)
if trailer:
self.pdf2._root_object.update({
NameObject('/AcroForm'): trailer})
set_need_appearances_writer(self.pdf2)
if "/AcroForm" in self.pdf2._root_object:
self.pdf2._root_object["/AcroForm"].update(
{NameObject("/NeedAppearances"): BooleanObject(True)})
for i in range(self.pdf.getNumPages()):
self.pdf2.addPage(self.pdf.getPage(i))
self.pdf2.updatePageFormFieldValues(self.pdf2.getPage(i), newvals)
for j in range(0, len(self.pdf.getPage(i)['/Annots'])):
writer_annot = self.pdf.getPage(i)['/Annots'][j].getObject()
for field in newvals:
writer_annot.update({NameObject("/Ff"): NumberObject(1)})
self.pdf2.updatePageFormCheckboxValues(self.pdf2.getPage(i), newchecks)
with open(outfile, 'wb') as out:
self.pdf2.write(out)
class MyPdfFileWriter(PdfFileWriter):
def __init__(self):
super().__init__()
def updatePageFormCheckboxValues(self, page, fields):
for j in range(0, len(page['/Annots'])):
writer_annot = page['/Annots'][j].getObject()
for field in fields:
writer_annot.update({NameObject("/Ff"): NumberObject(1)})
origin = ## Put input pdf path here
destination = ## Put output pdf path here, even if the file does not exist yet
newchecks = {} # A dict with all checkbox values that need to be changed
newvals = {'':''} # A dict with all entry values that need to be changed
# newvals dict has to be equal to {'':''} in case that no changes are needed
c = PdfFileFiller(origin)
c.update_form_values(outfile=destination, newvals=newvals, newchecks=newchecks)
print('PDF has been created\n')
I had trouble flattening a form that I had entered content into using pdfrw (How to Populate Fillable PDF's with Python) and found that I had to add an additional step using generate_fdf (pdftk flatten loses fillable field data).
os.system('pdftk '+outtemp+' generate_fdf output '+outfdf)
os.system('pdftk '+outtemp+' fill_form '+outfdf+' output '+outpdf)
I came to this solution because I was able to flatten a file just fine using ghostscript's pdf2ps followed by ps2pdf on my Mac, but the quality had low resolution when I ran it on an Amazon Linux instance. I couldn't figure out why that was the case and so moved to the pdftk solution.

Categories