How to edit editable pdf using the pdfrw library? - python

I been doing research on how to edit PDF using Python and i have found this article:
How to Populate Fillable PDF's with Python
However there is a problem once the program runs and you open the PDF the document is not populated only when you click on the tags it shows the data and when you click away it disappears again. This is code that can be found online that someone else has written.
#! /usr/bin/python
import os
import pdfrw
INVOICE_TEMPLATE_PATH = 'invoice_template.pdf'
INVOICE_OUTPUT_PATH = 'invoice.pdf'
ANNOT_KEY = '/Annots'
ANNOT_FIELD_KEY = '/T'
ANNOT_VAL_KEY = '/V'
ANNOT_RECT_KEY = '/Rect'
SUBTYPE_KEY = '/Subtype'
WIDGET_SUBTYPE_KEY = '/Widget'
def write_fillable_pdf(input_pdf_path, output_pdf_path, data_dict):
template_pdf = pdfrw.PdfReader(input_pdf_path)
annotations = template_pdf.pages[0][ANNOT_KEY]
for annotation in annotations:
if annotation[SUBTYPE_KEY] == WIDGET_SUBTYPE_KEY:
if annotation[ANNOT_FIELD_KEY]:
key = annotation[ANNOT_FIELD_KEY][1:-1]
if key in data_dict.keys():
annotation.update(
pdfrw.PdfDict(V='{}'.format(data_dict[key]))
)
pdfrw.PdfWriter().write(output_pdf_path, template_pdf)
data_dict = {
'business_name_1': 'Bostata',
'customer_name': 'company.io',
'customer_email': 'joe#company.io',
'invoice_number': '102394',
'send_date': '2018-02-13',
'due_date': '2018-03-13',
'note_contents': 'Thank you for your business, Joe',
'item_1': 'Data consulting services',
'item_1_quantity': '10 hours',
'item_1_price': '$200/hr',
'item_1_amount': '$2000',
'subtotal': '$2000',
'tax': '0',
'discounts': '0',
'total': '$2000',
'business_name_2': 'Bostata LLC',
'business_email_address': 'hi#bostata.com',
'business_phone_number': '(617) 930-4294'
}
if __name__ == '__main__':
write_fillable_pdf(INVOICE_TEMPLATE_PATH, INVOICE_OUTPUT_PATH, data_dict)

I figure out that if you add NeedAppearances param you will solve your problem:
template_pdf = pdfrw.PdfReader(TEMPLATE_PATH)
template_pdf.Root.AcroForm.update(pdfrw.PdfDict(NeedAppearances=pdfrw.PdfObject('true')))

Updating the write function to have keys AP and V fixed the problem for me in preview
pdfrw.PdfDict(AP=data_dict[key], V=data_dict[key])

The error is because no appearance stream is associated with the field, but you've created it in a wrong way. You've just assigned and stream to AP dictionary. What you need to do is to assign an indirect Xobject to /N in /AP dictionary; and you need to crate Xobject from scratch.
The code should be something like the following:
from pdfrw import PdfWriter, PdfReader, IndirectPdfDict, PdfName, PdfDict
INVOICE_TEMPLATE_PATH = 'untitled.pdf'
INVOICE_OUTPUT_PATH = 'untitled-output.pdf'
field1value = 'im field_1 value'
template_pdf = PdfReader(INVOICE_TEMPLATE_PATH)
template_pdf.Root.AcroForm.Fields[0].V = field1value
#this depends on page orientation
rct = template_pdf.Root.AcroForm.Fields[0].Rect
hight = round(float(rct[3]) - float(rct[1]),2)
width =(round(float(rct[2]) - float(rct[0]),2)
#create Xobject
xobj = IndirectPdfDict(
BBox = [0, 0, width, hight],
FormType = 1,
Resources = PdfDict(ProcSet = [PdfName.PDF, PdfName.Text]),
Subtype = PdfName.Form,
Type = PdfName.XObject
)
#assign a stream to it
xobj.stream = '''/Tx BMC
BT
/Helvetica 8.0 Tf
1.0 5.0 Td
0 g
(''' + field1value + ''') Tj
ET EMC'''
#put all together
template_pdf.Root.AcroForm.Fields[0].AP = PdfDict(N = xobj)
#output to new file
PdfWriter().write(INVOICE_OUTPUT_PATH, template_pdf)
Note: FYI: /Type, /FormType, /Resorces are optional (/Resources is strongly recomended).

To expand on Sergio's answer above, the following line:
template_pdf.Root.AcroForm.update(pdfrw.PdfDict(NeedAppearances=pdfrw.PdfObject('true')))
Should be put after this line in the example code from OP:
template_pdf = pdfrw.PdfReader(input_pdf_path)

In case someone has dropdown fields on the form you want to populate with data you can use the code below. (Might save someone the hassle I went through)
if key in data_dict.keys():
#see if its a dropdown
if('/I' in annotation.keys()):
#field is a dropdown
#Check if value is in preset list of dropdown, and at what value
if data_dict[key] in annotation['/Opt']:
#Value is in dropdown list,select value from list
annotation.update(pdfrw.PdfDict(I='[{}]'.format(annotation['/Opt'].index(data_dict[key]))))
else:
#Value is not in dropdown list, add as 'free input'
annotation.update(pdfrw.PdfDict(I='{}'.format(None)))
annotation.update(pdfrw.PdfDict(V='{}'.format(data_dict[key])))
else:
#update the textfieldvalue
annotation.update(pdfrw.PdfDict(V='{}'.format(data_dict[key])))
also not that the OP code only works for the first page due to
template_pdf.pages[0]

Related

Extract Text from a word document

I am trying to scrape data from a word document available at:-
https://dl.dropbox.com/s/pj82qrctzkw9137/HE%20Distributors.docx
I need to scrape the Name, Address, City, State, and Email ID. I am able to scrape the E-mail using the below code.
import docx
content = docx.Document('HE Distributors.docx')
location = []
for i in range(len(content.paragraphs)):
stat = content.paragraphs[i].text
if 'Email' in stat:
location.append(i)
for i in location:
print(content.paragraphs[i].text)
I tried to use the steps mentioned:
How to read data from .docx file in python pandas?
I need to convert this into a data frame with all the columns mentioned above.
Still facing issues with the same.
There are some inconsistencies in the document - phone numbers starting with Tel: sometimes, and Tel.: other times, and even Te: once, and I noticed one of the emails is just in the last line for that distributor without the Email: prefix, and the State isn't always in the last line.... Still, for the most part, most of the data can be extracted with regex and/or splits.
The distributors are separated by empty lines, and the names are in a different color - so I defined this function to get the font color of any paragraph from its xml:
# from bs4 import BeautifulSoup
def getParaColor(para):
try:
return BeautifulSoup(
para.paragraph_format.element.xml, 'xml'
).find('color').get('w:val')
except:
return ''
The try...except hasn't been necessary yet, but just in case...
(The xml is actually also helpful for double-checking that .text hasn't missed anything - in my case, I noticed that the email for Shri Adhya Educational Books wasn't getting extracted.)
Then, you can process the paragraphs from docx.Document with a function like:
# import re
def splitParas(paras):
ptc = [(
p.text, getParaColor(p), p.paragraph_format.element.xml
) for p in paras]
curSectn = 'UNKNOWN'
splitBlox = [{}]
for pt, pc, px in ptc:
# double-check for missing text
xmlText = BeautifulSoup(px, 'xml').text
xmlText = ' '.join([s for s in xmlText.split() if s != ''])
if len(xmlText) > len(pt): pt = xmlText
# initiate
if not pt:
if splitBlox[-1] != {}:
splitBlox.append({})
continue
if pc == '20752E':
curSectn = pt.strip()
continue
if splitBlox[-1] == {}:
splitBlox[-1]['section'] = curSectn
splitBlox[-1]['raw'] = []
splitBlox[-1]['Name'] = []
splitBlox[-1]['address_raw'] = []
# collect
splitBlox[-1]['raw'].append(pt)
if pc == 'D12229':
splitBlox[-1]['Name'].append(pt)
elif re.search("^Te.*:.*", pt):
splitBlox[-1]['tel_raw'] = re.sub("^Te.*:", '', pt).strip()
elif re.search("^Mob.*:.*", pt):
splitBlox[-1]['mobile_raw'] = re.sub("^Mob.*:", '', pt).strip()
elif pt.startswith('Email:') or re.search(".*[#].*[.].*", pt):
splitBlox[-1]['Email'] = pt.replace('Email:', '').strip()
else:
splitBlox[-1]['address_raw'].append(pt)
# some cleanup
if splitBlox[-1] == {}: splitBlox = splitBlox[:-1]
for i in range(len(splitBlox)):
addrsParas = splitBlox[i]['address_raw'] # for later
# join lists into strings
splitBlox[i]['Name'] = ' '.join(splitBlox[i]['Name'])
for k in ['raw', 'address_raw']:
splitBlox[i][k] = '\n'.join(splitBlox[i][k])
# search address for City, State and PostCode
apLast = addrsParas[-1].split(',')[-1]
maybeCity = [ap for ap in addrsParas if '–' in ap]
if '–' not in apLast:
splitBlox[i]['State'] = apLast.strip()
if maybeCity:
maybePIN = maybeCity[-1].split('–')[-1].split(',')[0]
maybeCity = maybeCity[-1].split('–')[0].split(',')[-1]
splitBlox[i]['City'] = maybeCity.strip()
splitBlox[i]['PostCode'] = maybePIN.strip()
# add mobile to tel
if 'mobile_raw' in splitBlox[i]:
if 'tel_raw' not in splitBlox[i]:
splitBlox[i]['tel_raw'] = splitBlox[i]['mobile_raw']
else:
splitBlox[i]['tel_raw'] += (', ' + splitBlox[i]['mobile_raw'])
del splitBlox[i]['mobile_raw']
# split tel [as needed]
if 'tel_raw' in splitBlox[i]:
tel_i = [t.strip() for t in splitBlox[i]['tel_raw'].split(',')]
telNum = []
for t in range(len(tel_i)):
if '/' in tel_i[t]:
tns = [t.strip() for t in tel_i[t].split('/')]
tel1 = tns[0]
telNum.append(tel1)
for tn in tns[1:]:
telNum.append(tel1[:-1*len(tn)]+tn)
else:
telNum.append(tel_i[t])
splitBlox[i]['Tel_1'] = telNum[0]
splitBlox[i]['Tel'] = telNum[0] if len(telNum) == 1 else telNum
return splitBlox
(Since I was getting font color anyway, I decided to add another
column called "section" to put East/West/etc in. And I added "PostCode" too, since it seems to be on the other side of "City"...)
Since "raw" is saved, any other value can be double checked manually at least.
The function combines "Mobile" into "Tel" even though they're extracted with separate regex.
I'd say "Tel_1" is fairly reliable, but some of the inconsistent patterns mean that other numbers in "Tel" might come out incorrect if they were separated with '/'.
Also, "Tel" is either a string or a list of strings depending on how many numbers there were in "tel_raw".
After this, you can just view as DataFrame with:
#import docx
#import pandas
content = docx.Document('HE Distributors.docx')
# pandas.DataFrame(splitParas(content.paragraphs)) # <--all Columns
pandas.DataFrame(splitParas(content.paragraphs))[[
'section', 'Name', 'address_raw', 'City',
'PostCode', 'State', 'Email', 'Tel_1', 'tel_raw'
]]

Folium - add larger pop ups with data from XML file

I would like to create a table-like pop-up for my folium map but don't know how to do it (I'm a novice).
My data comes from an XML file that contains the gps coordinates, name, sales, etc. of stores.
Right now I can display the name of the stores in the pop-up, but I would also like to display the sales and other information below the name.
I reckon I should maybe use GeoJson but I don't know how to implement it in the code I already have (which contains clusterization) :
xml_data = 'Data Stores.xml'
tree = ElementTree.parse(xml_data)
counter = tree.find('counter')
name = counter.find('Name')
counter.find('Latitude').text
name = []
latitude = []
longitude = []
for c in tree.findall('counter'):
name.append(c.find('Name').text)
latitude.append(c.find('Latitude').text)
longitude.append(c.find('Longitude').text)
df_counters = pd.DataFrame(
{'Name' : name,
'Latitude' : latitude,
'Longitude' : longitude,
})
df_counters.head()
locations = df_counters[['Latitude', 'Longitude']]
locationlist = locations.values.tolist()
map3 = folium.Map(location=[31.1893,121.2781], tiles='CartoDB positron', zoom_start=6)
marker_cluster = folium.plugins.MarkerCluster().add_to(map3)
for point in range(0, len(locationlist)):
popup=folium.Popup(df_counters['Name'][point], max_width=300,min_width=300)
folium.Marker(locationlist[point],
popup=popup,
icon=folium.Icon(color='blue', icon_color='white',
icon='fa-shopping-bag', angle=0, prefix='fa')
).add_to(marker_cluster)
map3.save("WorldMap.html")`
Right now I have 4 other columns in my XML file besides 'Name' that have the information that I want to appear in the popup as well, kinda like this :
example popup
Thank you for your help
Edit :
I did some digging and changed my code a little bit by adding the folium.features.GeoJsonPopup instead of the simple folium.Popup that I had before :
for point in range(0, len(locationlist)):
popup=folium.features.GeoJsonPopup(
fields=[['Name'],['Opening']],
aliases=['Name','Opening'])
folium.Marker(locationlist[point],
popup=popup,
icon=folium.Icon(color='blue', icon_color='white',
icon='fa-shopping-bag', angle=0, prefix='fa')
).add_to(marker_cluster)
I added the 'Opening' data, however I don't know how to transfer it into the pop up along with the 'Name' since it comes from a panda DataFrame. Right now my popups are empty.
I have done something similar, steps were:
create an IFrame with the content you want to display (coded in HTML)
use this IFrame in a popup
connect this popup with your marker
htmlstr = ... # Here you can add your table, use HTML
# 1. iframe
iframe = folium.IFrame(htmlstr, # places your content in the iframe
width=200,
height=200 # adjust size to your needs
)
# 2. popup
fpop = folium.Popup(iframe)
# 3. marker
mrk = folium.Marker(location=latlng,
popup=fpop,
)
mrk.add_to( ... )

How to check a checkbox and a radio button in a PDF with Python?

I need to use Python to fill a complete PDF file, But I've been searching for 8 hours now and all I could find is how to fill a text field in a PDF file only. I need to fill check checkboxes and also use radio buttons where you can check one but not the other, like Yes or No radio buttons or Gender radio buttons.
Here is a code I've been using with the pdfrw python module...
import pdfrw
template_pdf = pdfrw.PdfReader("template.pdf")
ANNOT_KEY = '/Annots'
ANNOT_FIELD_KEY = '/T'
ANNOT_VAL_KEY = '/V'
ANNOT_RECT_KEY = '/Rect'
SUBTYPE_KEY = '/Subtype'
WIDGET_SUBTYPE_KEY = '/Widget'
def fillPDF(data_dict):
for page in template_pdf.pages:
annotations = page[ANNOT_KEY]
for annotation in annotations:
if annotation[SUBTYPE_KEY] == WIDGET_SUBTYPE_KEY:
if annotation[ANNOT_FIELD_KEY]:
key = annotation[ANNOT_FIELD_KEY][1:-1]
if key in data_dict.keys():
if type(data_dict[key]) == bool:
if data_dict[key] == True:
annotation.update(pdfrw.PdfDict(AS=pdfrw.PdfName('Yes')))
else:
annotation.update(pdfrw.PdfDict(V='{}'.format(data_dict[key])))
annotation.update(pdfrw.PdfDict(AP=''))
else:
print(key)
template_pdf.Root.AcroForm.update(pdfrw.PdfDict(NeedAppearances=pdfrw.PdfObject('true')))
pdfrw.PdfWriter().write("output.pdf", template_pdf)
This code is supposed to fill text fields and checkboxes but it doesn't work with checkboxes and it does not even detect the radio buttons.
If you have a way to check radio buttons and checkboxes with python please let me know. Thank you.
I think it is something silly. My code is pretty similar to yours and it works.
if key in data_dict.keys():
if type(data_dict[key]) is str:
annotation.update(pdfrw.PdfDict(AP=data_dict[key], V=data_dict[key]))
elif type(data_dict[key]) is bool:
if data_dict[key] is True:
annotation.update(pdfrw.PdfDict(AS=pdfrw.PdfName('Yes')))
elif data_dict[key] is False:
annotation.update(pdfrw.PdfDict(AS=pdfrw.PdfName('')))
else:
pass
annotation.update(pdfrw.PdfDict(Ff=1)) # Field non-editable.

How can I create a table in a DOCX file with Python?

I am trying to send selected values from radiobuttons into a .docx file
importing what I need, focus is on docx
import tkinter as tk
from docx import Document
main = tk.Tk()
these are my options that I need to place into a word document on the left of the table, they act as questions in a survey.
info = ["option 1", "option 2", "option 3", "option 4"
]
Here I am placing radiobuttons called Yes, No & N/A which are answers to the options on the left(list of info above) and also Label to represent options or in other words questions..
vars = []
for idx,i in enumerate(info):
var = tk.IntVar(value=0)
vars.append(var)
lblOption = tk.Label(main,text=i)
btnYes = tk.Radiobutton(main, text="Yes", variable=var, value=2)
btnNo = tk.Radiobutton(main, text="No", variable=var, value=1)
btnNa = tk.Radiobutton(main, text="N/A", variable=var,value=0)
lblOption.grid(column=0,row=idx)
btnYes.grid(column=1,row=idx)
btnNo.grid(column=2,row=idx)
btnNa.grid(column=3,row=idx)
Here is my function, creating a document and saving is the easy part. My issue is that I am muddled up creating a table that will have; Options on the left (from info) at the top are the headers (see RadioButtons yes, no, & N/a). And selected data, as an example, if for option 1 I have selected No, then save the data into a .docx file with the one been selected (See example bottom of page at Desired output).
def send():
document = Document()
section = document.sections[0]
#add table
table = document.add_table(1, 4)
#style table
table.style = 'Table Grid'
#table data retrived from Radiobuttons
items = vars.get()
#populate header row
heading_cells = table.rows[0].cells
heading_cells[0].text = "Options"
heading_cells[1].text = btnYes.cget("text")
heading_cells[2].text = btnNo.cget("text")
heading_cells[3].text = btnNa.cget("text")
for item in items:
cells = table.add_row().cells
cells[0].text = #Options
cells[1].text = #Yes values
cells[2].text = #No values
cells[3].text = #N/A values
#save doc
document.save("test.docx")
#button to send data to docx file
btn = tk.Button(main, text="Send to File", command= send)
btn.grid()
main.mainloop()
this is what it opens up:
Here is the desired output:
Number 1 represents selected items from the tkinter application. But will figure out how to change it to a tick box.
I am kinda confused where I am at, I am new using docx.. been trying to read the documentation.. and this is where I digged my self a hole into.
In your current code, vars is a list of IntVars. You want to get each value individually instead of vars.get(). Also when writing to docx file, you need both info and values of radiobuttons, to track them both you can use an index.
With minimal changes to your code, you can use something like this.
def send():
...
...
heading_cells[3].text = btnNa.cget("text")
for idx, item in enumerate(vars):
cells = table.add_row().cells
cells[0].text = info[idx] # gets the option name
val = item.get() #radiobutton value
if val == 2: # checks if yes
cells[1].text = "1"
elif val == 1: # checks if no
cells[2].text = "1"
elif val == 0: # checks if N/A
cells[3].text = "1"
#save doc
document.save("test.docx")
or you can use a dictionary to map radiobuttons to cells.
valuesCells = {0: 3, 1: 2, 2: 1} # value of radiobutton: cell to write
# hard to read what's going on though
for idx, item in enumerate(vars):
cells = table.add_row().cells
cells[0].text = info[idx] # gets the option name
val = item.get()
cells[valuesCells[val]].text = "1"
#save doc
document.save("test.docx")

Python Folium MarkerCluster Color Customization

I'm creating a leaflet map in folium using MarkerCluster. I have been all over the documentation and searched for examples, but I cannot figure out how to customize the color for a given MarkerCluster or FeatureGroup (e.g., one set in green rather than default blue).
I tried creating the markers individually and iteratively adding them to the MarkerCluster, and that gave me the color I wanted, but then the iFrame html table woudn't function properly, and the popups were not appearing.
The code I've written works flawlessly (an html table used for popups is not supplied), but I'd really like to be able to change the color for one set of markers and retain the popups using the methods in my code. Any guidance would be greatly appreciated!
or_map = folium.Map(location=OR_COORDINATES, zoom_start=8)
res_popups, res_locations = [], []
com_popups, com_locations = [], []
for idx, row in geo.iterrows():
if row['Type'] == 'Residential':
res_locations.append([row['geometry'].y, row['geometry'].x])
property_type = row['Type']
property_name = row['Name']
address = row['address']
total_units = row['Total Unit']
iframe = folium.IFrame(table(property_type, property_name,
address, total_units), width=width,
height=height)
res_popups.append(iframe)
else:
com_locations.append([row['geometry'].y, row['geometry'].x])
property_type = row['Type']
property_name = row['Name']
address = row['address']
total_units = row['Total Unit']
iframe = folium.IFrame(table(property_type, property_name, address,
total_units), width=width,
height=height)
com_popups.append(iframe)
r = folium.FeatureGroup(name='UCPM Residential Properties')
r.add_child(MarkerCluster(locations=res_locations, popups=res_popups))
or_map.add_child(r)
c = folium.FeatureGroup(name='UCPM Commercial Properties')
c.add_child(MarkerCluster(locations=com_locations, popups=com_popups))
or_map.add_child(c)
display(or_map)
Instead of just dumping all your locations into the Cluster, you could loop over them and create a Marker for each of them - that way you can set the Marker's color. After creation, you can add the Marker to the desired MarkerCluster.
for com_location, com_popup in zip(com_locations, com_popups):
folium.Marker(com_location,
popup=com_popup
icon=folium.Icon(color='red', icon='info-sign')
).add_to(cluster)
A different approach would be to modify the style function, as shown here (In[4] and In[5]).

Categories