How to check / uncheck checkboxes in a PDF with Python (preferably PyPDF2)? - python

I have the code below
from PyPDF2 import PdfFileReader, PdfFileWriter
d = {
"Name": "James",
" Date": "1/1/2016",
"City": "Wilmo",
"County": "United States"
}
reader = PdfFileReader("medicareRRF.pdf")
inFields = reader.getFields()
watermark = PdfFileReader("justSign.pdf")
writer = PdfFileWriter()
page = reader.getPage(0)
page.mergePage(watermark.getPage(0))
writer.addPage(page)
written_page = writer.getPage(0)
writer.updatePageFormFieldValues(written_page, d)
Which correctly fills in the PDF with the dictionary (d), but how can I check and uncheck boxes on the PDF? Here is the getField() info for one of the boxes:
u'Are you ok': {'/FT': '/Btn','/Kids': [IndirectObject(36, 0),
IndirectObject(38, 0)],'/T': u'Are you ok','/V': '/No'}
I tried adding {'Are you ok' : '/Yes'} and several other similar ways, but nothing worked.

I came across the same issue, looked in several places, and was disappointed that I couldn't find the answer. After a few frustrating hours looking at my code, the pyPDF2 code, and the Adobe PDF 1.7 spec, I finally figured it out. If you debug into updatePageFormFieldValues, you'll see that it uses only TextStringObjects. Checkboxes are not text fields -- even the /V values are not text fields, which seemed counterintuitive at least to me. Debugging into that function showed me that checkboxes are instead NameObjects so I created my own function to handle them. I create two dicts: one with only text values that I pass to the built-in updatePageFormFieldValues function and a second with only checkbox values. I also set the /AS to ensure visibility (see PDF spec). My function looks like this:
def updateCheckboxValues(page, fields):
for j in range(0, len(page['/Annots'])):
writer_annot = page['/Annots'][j].getObject()
for field in fields:
if writer_annot.get('/T') == field:
writer_annot.update({
NameObject("/V"): NameObject(fields[field]),
NameObject("/AS"): NameObject(fields[field])
})
However, as far as I can tell, whether you use /1, /On, or /Yes depends on how the form was defined or perhaps what the PDF reader is looking for. For me, /1 worked.

I will like to add on to the answer #rpsip.
from PyPDF2 import PdfReader, PdfWriter
from PyPDF2.generic import NameObject
reader = PdfReader(r"form2.pdf") #where you read the pdf in the same directory
writer = PdfWriter()
page = reader.pages[0] #read page 1 of your pdf
fields = reader.get_fields()
print (fields) # this is to identify if you can see the form fills in that page
writer.add_page(page) #this line is necessary otherwise the pdf will be corrupted
for i in range(len(page["/Annots"])): #in order to access the "Annots" key
print ((page["/Annots"][i].get_object())) #to find out which of the form fills are checkbox or text fill
if (page["/Annots"][i].get_object())['/FT']=="/Btn" and (page["/Annots"][i].get_object())['/T']=='Check Box3': #this is my filter so that I can filter checkboxes and the checkbox I want i.e. "Check Box 3"
print (page["/Annots"][i].get_object()) #further check if I got what I wanted as per the filter
writer_annot = page["/Annots"][i].get_object()
writer_annot.update(
{
NameObject("/V"): NameObject(
"/Yes"), #NameObject being only for checkbox, and please try "/Yes" or "/1" or "/On" to see which works
NameObject("/AS"): NameObject(
"/Yes" #NameObject being only for checkbox, and please try "/Yes" or "/1" or "/On" to see which works
)
}
)
with open("filled-out.pdf", "wb") as output_stream:
writer.write(output_stream) #save the ticked pdf file as another file named "filled-out.pdf"
hoped I helped.

Related

How to Extract a PDF Form Including CheckBox ( X ) Data in C#

I'm working on a PDF, the main idea is to extract the pdf content including images,text as well as checkboxes, as far as the text and images I extract the text content and images
but I can't able to extract the checkbox data. I have tried itextsharp and another open-source tool regarding this, unable to get the check-status ( like true or false ).
my c# is rusty, but using the latest version of iText, it should be something like this:
PdfDocument doc = new PdfDocument(new PdfReader(#"c:\\temp\\form.pdf"));
PdfAcroForm form = PdfAcroForm.GetAcroForm(doc, false);
IDictionary<string, PdfFormField> fields = form.GetFormFields();
foreach (KeyValuePair<string, PdfFormField> entry in fields)
{
PdfFormField field = entry.Value;
if (field is PdfButtonFormField)
{
Console.WriteLine(entry.Key + " has " + field.GetValueAsString());
}
}
where GetValueAsString() will typically have "Yes" for checked or "Off" or empty for unchecked.

Python PDF how to add bookmark url instead of page number

I'm using python 3.6 and PyPDF2 to create bookmarks in a pdf.
Instead of adding a bookmark to a page within the pdf. I want to add a url (eg. https://stackoverflow.com) as a bookmark.
Something like this?
output.addBookmark('TEST', 'https://stackoverflow.com', parent=None)
I don't think PyPDF2 supports something like this or does it? Is there another library that can support this?
from PyPDF2 import PdfFileReader, PdfFileWriter
output = PdfFileWriter()
input = PdfFileReader(open('test.pdf', 'rb'))
output.addPage(input.getPage(0))
output.addBookmark('TEST', 0, parent=None) # add bookmark
outputStream = open('output.pdf', 'wb')
output.write(outputStream)
outputStream.close()
I might be a bit late but hear me out. I had to solve the same problem and since there is literally 0 information on this anywhere, I've investigated the issue and have managed to find the answer. It's not my most efficient/nicest code but definitely one of my proudest since nobody before did this.
The problem one faces is that there is no Pdf library for python that solves all
pdf related problems. Each tackle problems differently and some can do stuff that others can't. For this purpose, I had to use 2 libraries for the two functions below. PyPDF2 is here to add the bookmarks to the pdf. Pdfrw is here to alter those bookmarks to have the action of opening a url.
In short, we create a new pdf with the added bookmarks, and another new one with the changed bookmarks actions that point to a url.
One thing to mention is that for some reason (for me at least) PyPDF2 adds all bookmarks in a way that if you have multiple ones, they all become child elements to the previous bookmark. this is why we have the while loop, we collect all the bookmarks into a list with it, and then we can select the one we want.
If you already have bookmars and don't add them with PyPDF2, it might be enough to just loop over the metaObjects dictionary and get the values which contain the /Title key. Thus making the code significantly smaller. I've added this part as a comment.
Here an example on how to use the code below:
inputPdf = r"C:\......\first.pdf"
bookmarkedPdf = r"C:\......\second.pdf"
pdfWithWeblink = r"C:\......\final.pdf"
bookmarks = [
{"Title": "The Phantom Menace", "Page": 5},
{"Title": "Attack of the Clones", "Page": 10},
{"Title": "Revenge Of The Sith", "Page": 13},
{"Title": "A New hope", "Page": 18},
{"Title": "The Empire Strikes Back", "Page": 26},
{"Title": "Return of the Jedi", "Page": 32}
]
AddBookmarks(inputPdf, bookmarkedPdf, bookmarks)
AddWebLinkToBookmark(bookmarkedPdf, pdfWithWeblink, "Revenge Of The Sith", "https://stackoverflow.com")
The code:
from PyPDF2 import PdfFileWriter, PdfFileReader
import pdfrw
def AddBookmarks(inputPdfPath: str, outputPdfPath: str, headers: dict) -> None:
""" Adds bookmarks to a PDF. """
output = PdfFileWriter()
input = PdfFileReader(open(inputPdfPath, 'rb'))
for i in range(input.getNumPages()):
output.addPage(input.getPage(i))
for header in headers:
if header["Page"] - 1 == i:
output.addBookmark(header["Title"], header["Page"] - 1, parent=None)
output.setPageMode("/UseOutlines")
outputStream = open(outputPdfPath,'wb')
output.write(outputStream)
outputStream.close()
return outputPdfPath
def AddWebLinkToBookmark(inputPdfPath: str, outputPdfPath: str, bookmarkTitle: str, url: str) -> None:
""" Changes the bookmark action to opening a web url. """
# Reading the Pdf with pdfrw and collecting its meta objects. The bookmarks are among these.
pdf = pdfrw.PdfReader(inputPdfPath, decompress=True)
metaObjects = pdf.indirect_objects
# If you did not add the bookmarks with PyPDF2 previously, use this part for getting the bookmarkToChange variable:
# bookmarkToChange = None
# for _, annotation in metaObjects.items():
# if '/Title' in annotation:
# if annotation["/Title"] == f"({bookmarkTitle})".replace(" ", "\\040"):
# bookmarkToChange = annotation
# if bookmarkToChange == None:
# print(f"There is no bookmark called '{bookmarkTitle}' in this pdf.")
# return
try:
# Selecting the first, top parent bookmark.
bookmark = [annotation for _, annotation in metaObjects.items() if '/Title' in annotation][0]
except IndexError:
print("There are no bookmarks in this pdf.")
return
# Each bookmark is the child of the previous bookmark. They can be accessed from the parent with the '/Next' key.
bookmarkAnnotations = [bookmark]
while "/Next" in bookmark:
if "/Title" not in bookmark["/Next"]:
break
bookmark = bookmark["/Next"]
bookmarkAnnotations.append(bookmark)
try:
# Selecting the bookmark we want to add the url to.
bookmarkToChange = [annotation for annotation in bookmarkAnnotations if annotation["/Title"] == f"({bookmarkTitle})".replace(" ", "\\040")][0]
except IndexError:
print(f"There is no bookmark called '{bookmarkTitle}' in this pdf.")
return
# Changing the internal PDF commands to point to a url instead of a page.
bookmarkToChange.A.D = None # Deletes the page information the 'Go to page' action is pointing to.
bookmarkToChange.A.S = pdfrw.PdfName("URI") # Changes the 'Go to page' action to an 'Open a web link' action.
bookmarkToChange.A.URI = pdfrw.objects.pdfstring.PdfString(f"({url})") # Specifies the url for the 'Open a web link' action.
# Saving the end result into a new file.
pdfrw.PdfWriter().write(outputPdfPath, pdf)

Python PDF Parser - Engineering Drawing

I am trying to write a Python Script to parse through a PDF file using PyPDF2. Only thing is, my PDF file isnt your traditional document, it's an engineering drawing.
Anyway, I need the code to parse through the text that is written on the bottom right corner, as well as a red stamp that has text written on it. The drawing will look something like this: enter image description here
I tried to write some basic code to just parse it and extract the data, but its not working.
import PyPDF2
# creating a pdf file object
pdfFileObj = open('example.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
Anyone have any recomendations?
Late to the party...
None the less, we developed a commercial product to do exactly that: Werk24. It has a simple python client pip install werk24
With this your task becomes very simple. You can read the Title Block with a simple command. Imagine you want to obtain the Designation
from werk24 import Hook, W24AskTitleBlock
from werk24.models.techread import W24TechreadMessage
from werk24.utils import w24_read_sync
from . import get_drawing_bytes # define your own
def recv_title_block(message: W24TechreadMessage) -> None:
""" Print the Designation
NOTE: Other fields like Drawing ID, Material etc are
also available.
"""
print(message.payload_dict.get('designation'))
if __name__ == "__main__":
# submit the request to Werk24
w24_read_sync(
get_drawing_bytes(),
[Hook(
ask=W24AskTitleBlock(),
function=recv_title_block
)])
For the drawing that your provided, the response will be:
"designation": {
"captions": [
{
"language": "eng",
"text": "Descr"
}
],
"values": [
{
"language": "eng",
"test": "Shaft",
}
]
}
NOTE: Your files is very blurry, so I created the response manually - the API requires a minimal resolution of 180 dpi (also works with TIF and DXF files).

Extracting CSV from Export Button

I apologize for not being able to specifically give out the url im dealing with. I'm trying to extract some data from a certain site but its not organized well enough. However, they do have an "Export To CSV file" and the code for that block is ...
<input type="submit" name="ctl00$ContentPlaceHolder1$ExportValueCSVButton" value="Export to Value CSV" id="ContentPlaceHolder1_ExportValueCSVButton" class="smallbutton">
In this type of situation, whats the best way to go about grabbing that data when there is no specific url to the CSV, Im using Mechanize and BS4.
If you're able to click a button that could download the data as a csv, it sounds like you might be able to wget link that data and save it on your machine and work with it there. I'm not sure if that's what you're getting at here though, any more details you can offer?
You should try Selenium, Selenium is a suite of tools to automate web browsers across many platforms. It can do a lot thing including click button.
Well, you need SOME starting URL to feed br.open() to even start the process.
It appears that you have an aspnetForm type control there and the below code MAY serve as a bit of a starting point, even though it does not work as-is (it's a work in progress...:-).
You'll need to look at the headers and parameters via the network tab of your browser dev tools to see them.
br.open("http://media.ethics.ga.gov/search/Lobbyist/Lobbyist_results.aspx?&Year=2016&LastName="+letter+"&FirstName=&City=&FilerID=")
soup = BS(br.response().read())
table = soup.find("table", { "id" : "ctl00_ContentPlaceHolder1_Results" }) # Need to add error check here...
if table is None: # No lobbyist with last name starting with 'X' :-)
continue
records = table.find_all('tr') # List of all results for this letter
for form in br.forms():
print "Form name:", form.name
print form
for row in records:
rec_print = ""
span = row.find_all('span', 'lblentry', 'value')
for sname in span:
if ',' in sname.get_text(): # They actually have a field named 'comma'!!
continue
rec_print = rec_print + sname.get_text() + "," # Create comma-delimited output
print(rec_print[:-1]) # Strip final comma
lnk = row.find('a', 'lblentrylink')
if lnk is None: # For some reason, first record is blank.
continue
print("Lnk: ", lnk)
newlnk = lnk['id']
print("NEWLNK: ", newlnk)
newstr = lnk['href']
newctl = newstr[+25:-5] # Matching placeholder (strip javascript....)
br.select_form('aspnetForm') # Tried (nr=0) also...
print("NEWCTL: ", newctl)
br[__EVENTTARGET] = newctl
response = br.submit(name=newlnk).read()

Provide tab title with reportlab generated pdf

This question is really simple, but I can't find any data on it.
When I generate a pdf with reportlab, passing the httpresponse as a file, browsers that are configured to show files display the pdf correctly. However, the title of the tab remains "(Anonymous) 127.0.0.1/whatnot", which is kinda ugly for the user.
Since most sites are able to somehow display an appropiate title, I think it's doable... Is there some sort of title parameter that I can pass to the pdf? Or some header for the response? This is my code:
def render_pdf_report(self, context, file_name):
response = HttpResponse(content_type='application/pdf')
response['Content-Disposition'] = 'filename="{}"'.format(file_name)
document = BaseDocTemplate(response, **self.get_create_document_kwargs())
# pdf generation code
document.build(story)
return response
Seems that Google Chrome doesn't display the PDF titles at all.
I tested the link in your comment (biblioteca.org.ar) and it displays in Firefox as " - 211756.pdf", seems there's an empty title and Firefox then just displays the filename instead of the full URL path.
I reproduced the same behaviour using this piece of code:
from reportlab.pdfgen import canvas
c = canvas.Canvas("hello.pdf")
c.setTitle("hello stackoverflow")
c.drawString(100, 750, "Welcome to Reportlab!")
c.save()
Opening it in Firefox yields the needed result:
I found out about setTitle in ReportLab's User Guide. It has it listed on page 16. :)
I was also looking for this and I found this in the source code.
reportlab/src/reportlab/platypus/doctemplate.py
# line - 467
We can set the document's title by
document.title = 'Sample Title'
I realise this is an old question but dropping in an answer for anyone using SimpleDocTemplate. The title property can be set in constructor of SimpleDocTemplate as a kwarg. e.g.
doc = SimpleDocTemplate(pdf_bytes, title="my_pdf_title")
If you are using trml2pdf, you will need to add the "title" attribute in the template tag, ie., <template title="Invoices" ...
In addition to what others have said, you can use
Canvas.setTitle("yourtitle")
which shows up fine in chrome.

Categories