I am trying to write a Python Script to parse through a PDF file using PyPDF2. Only thing is, my PDF file isnt your traditional document, it's an engineering drawing.
Anyway, I need the code to parse through the text that is written on the bottom right corner, as well as a red stamp that has text written on it. The drawing will look something like this: enter image description here
I tried to write some basic code to just parse it and extract the data, but its not working.
import PyPDF2
# creating a pdf file object
pdfFileObj = open('example.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
Anyone have any recomendations?
Late to the party...
None the less, we developed a commercial product to do exactly that: Werk24. It has a simple python client pip install werk24
With this your task becomes very simple. You can read the Title Block with a simple command. Imagine you want to obtain the Designation
from werk24 import Hook, W24AskTitleBlock
from werk24.models.techread import W24TechreadMessage
from werk24.utils import w24_read_sync
from . import get_drawing_bytes # define your own
def recv_title_block(message: W24TechreadMessage) -> None:
""" Print the Designation
NOTE: Other fields like Drawing ID, Material etc are
also available.
"""
print(message.payload_dict.get('designation'))
if __name__ == "__main__":
# submit the request to Werk24
w24_read_sync(
get_drawing_bytes(),
[Hook(
ask=W24AskTitleBlock(),
function=recv_title_block
)])
For the drawing that your provided, the response will be:
"designation": {
"captions": [
{
"language": "eng",
"text": "Descr"
}
],
"values": [
{
"language": "eng",
"test": "Shaft",
}
]
}
NOTE: Your files is very blurry, so I created the response manually - the API requires a minimal resolution of 180 dpi (also works with TIF and DXF files).
Related
I have created an add-on that outputs a csv file with the ratings of anki cards and the time they were rated.
In the anki-review_results folder, I stored an empty init.py, anki_rating_time_record.py, and a file named manifest.json, and after compressing it, I converted the extension to .ankiaddon.
However, I got the following error
An error occurred during the installation of anki-review-results.ankiaddon: Invalid add-on manifest.
Please report this to the add-on author in question.
Anki_rating_time_record.py is as follows
import anki_python_api
import csv
import datetime
# reset anki api
Anki = anki_python_api.
review_results = []
def review_did_answer_card(result, card, ease):
now = datetime.datetime.now()
review_results.append((result, card, ease, now))
# register Anki API
anki.add_review_did_answer_card_callback(review_did_answer_card)
# start anki
anki.run()
# save the results on csv format
with open("review_results.csv", "w") as f:
writer = csv.writer(f)
writer.writer(["timestamp", "result", "card", "ease"])
for all_results in review_results:
writer.writer([all_result[0].strftime("%Y-%m-%d %H:%M:%S"),all_result[1],
all_result[2],all_result[3]])
The manifest.json is as follows.
{
"id" : "anki-review-results",
"version": "0.1",
"name": "Anki Review Results",
"description": "Save the results of your Anki reviews in a CSV file",
"author": "Shohei",
"addonType":["background"],
"minimumAnkiVersion": "2.1.0"
"files":["Anki_rating_time_record.py","_init_.py"]
}
The version of anki in use is 2.1.58.
What can I do to make it function properly? Please let me know.
I'm using python 3.6 and PyPDF2 to create bookmarks in a pdf.
Instead of adding a bookmark to a page within the pdf. I want to add a url (eg. https://stackoverflow.com) as a bookmark.
Something like this?
output.addBookmark('TEST', 'https://stackoverflow.com', parent=None)
I don't think PyPDF2 supports something like this or does it? Is there another library that can support this?
from PyPDF2 import PdfFileReader, PdfFileWriter
output = PdfFileWriter()
input = PdfFileReader(open('test.pdf', 'rb'))
output.addPage(input.getPage(0))
output.addBookmark('TEST', 0, parent=None) # add bookmark
outputStream = open('output.pdf', 'wb')
output.write(outputStream)
outputStream.close()
I might be a bit late but hear me out. I had to solve the same problem and since there is literally 0 information on this anywhere, I've investigated the issue and have managed to find the answer. It's not my most efficient/nicest code but definitely one of my proudest since nobody before did this.
The problem one faces is that there is no Pdf library for python that solves all
pdf related problems. Each tackle problems differently and some can do stuff that others can't. For this purpose, I had to use 2 libraries for the two functions below. PyPDF2 is here to add the bookmarks to the pdf. Pdfrw is here to alter those bookmarks to have the action of opening a url.
In short, we create a new pdf with the added bookmarks, and another new one with the changed bookmarks actions that point to a url.
One thing to mention is that for some reason (for me at least) PyPDF2 adds all bookmarks in a way that if you have multiple ones, they all become child elements to the previous bookmark. this is why we have the while loop, we collect all the bookmarks into a list with it, and then we can select the one we want.
If you already have bookmars and don't add them with PyPDF2, it might be enough to just loop over the metaObjects dictionary and get the values which contain the /Title key. Thus making the code significantly smaller. I've added this part as a comment.
Here an example on how to use the code below:
inputPdf = r"C:\......\first.pdf"
bookmarkedPdf = r"C:\......\second.pdf"
pdfWithWeblink = r"C:\......\final.pdf"
bookmarks = [
{"Title": "The Phantom Menace", "Page": 5},
{"Title": "Attack of the Clones", "Page": 10},
{"Title": "Revenge Of The Sith", "Page": 13},
{"Title": "A New hope", "Page": 18},
{"Title": "The Empire Strikes Back", "Page": 26},
{"Title": "Return of the Jedi", "Page": 32}
]
AddBookmarks(inputPdf, bookmarkedPdf, bookmarks)
AddWebLinkToBookmark(bookmarkedPdf, pdfWithWeblink, "Revenge Of The Sith", "https://stackoverflow.com")
The code:
from PyPDF2 import PdfFileWriter, PdfFileReader
import pdfrw
def AddBookmarks(inputPdfPath: str, outputPdfPath: str, headers: dict) -> None:
""" Adds bookmarks to a PDF. """
output = PdfFileWriter()
input = PdfFileReader(open(inputPdfPath, 'rb'))
for i in range(input.getNumPages()):
output.addPage(input.getPage(i))
for header in headers:
if header["Page"] - 1 == i:
output.addBookmark(header["Title"], header["Page"] - 1, parent=None)
output.setPageMode("/UseOutlines")
outputStream = open(outputPdfPath,'wb')
output.write(outputStream)
outputStream.close()
return outputPdfPath
def AddWebLinkToBookmark(inputPdfPath: str, outputPdfPath: str, bookmarkTitle: str, url: str) -> None:
""" Changes the bookmark action to opening a web url. """
# Reading the Pdf with pdfrw and collecting its meta objects. The bookmarks are among these.
pdf = pdfrw.PdfReader(inputPdfPath, decompress=True)
metaObjects = pdf.indirect_objects
# If you did not add the bookmarks with PyPDF2 previously, use this part for getting the bookmarkToChange variable:
# bookmarkToChange = None
# for _, annotation in metaObjects.items():
# if '/Title' in annotation:
# if annotation["/Title"] == f"({bookmarkTitle})".replace(" ", "\\040"):
# bookmarkToChange = annotation
# if bookmarkToChange == None:
# print(f"There is no bookmark called '{bookmarkTitle}' in this pdf.")
# return
try:
# Selecting the first, top parent bookmark.
bookmark = [annotation for _, annotation in metaObjects.items() if '/Title' in annotation][0]
except IndexError:
print("There are no bookmarks in this pdf.")
return
# Each bookmark is the child of the previous bookmark. They can be accessed from the parent with the '/Next' key.
bookmarkAnnotations = [bookmark]
while "/Next" in bookmark:
if "/Title" not in bookmark["/Next"]:
break
bookmark = bookmark["/Next"]
bookmarkAnnotations.append(bookmark)
try:
# Selecting the bookmark we want to add the url to.
bookmarkToChange = [annotation for annotation in bookmarkAnnotations if annotation["/Title"] == f"({bookmarkTitle})".replace(" ", "\\040")][0]
except IndexError:
print(f"There is no bookmark called '{bookmarkTitle}' in this pdf.")
return
# Changing the internal PDF commands to point to a url instead of a page.
bookmarkToChange.A.D = None # Deletes the page information the 'Go to page' action is pointing to.
bookmarkToChange.A.S = pdfrw.PdfName("URI") # Changes the 'Go to page' action to an 'Open a web link' action.
bookmarkToChange.A.URI = pdfrw.objects.pdfstring.PdfString(f"({url})") # Specifies the url for the 'Open a web link' action.
# Saving the end result into a new file.
pdfrw.PdfWriter().write(outputPdfPath, pdf)
How to get the fields from this PDF file? It is a dynamic PDF created by Adobe LiveCycle Designer. If you open the link in a web browser, you will probably see a single page starting from 'Please wait...' If you download the file and open it via Adobe Reader (5.0 or higher), you should see all 8 pages.
So, when reading via PyPDF2, you get an empty dictionary because it renders the file as a single page like that you see via a web browser.
def print_fields(path):
from PyPDF2 import PdfFileReader
reader = PdfFileReader(str(path))
fields = reader.getFields()
print(fields)
You can use Java-dependent library tika to read the contents for all 8 pages. However the results are messy and I am avoiding Java dependency.
def read_via_tika(path):
from tika import parser
raw = parser.from_file(str(path))
content = raw['content']
print(content)
So, basically, I can manually Edit -> Form Options -> Export Data… in Adobe Actobat DC to get a nice XML. Similarly, I need to get the nice form fields and their values via Python.
Thanks to this awesome answer, I managed to retrieve the fields using pdfminer.six.
Navigate through Catalog > AcroForm > XFA, then pdfminer.pdftypes.resolve1 the object right after b'datasets' element in the list.
In my case, the following code worked (source: ankur garg)
import PyPDF2 as pypdf
def findInDict(needle, haystack):
for key in haystack.keys():
try:
value=haystack[key]
except:
continue
if key==needle:
return value
if isinstance(value,dict):
x=findInDict(needle,value)
if x is not None:
return x
pdfobject=open('CTRX_filled.pdf','rb')
pdf=pypdf.PdfFileReader(pdfobject)
xfa=findInDict('/XFA',pdf.resolvedObjects)
xml=xfa[7].getObject().getData()
I have the code below
from PyPDF2 import PdfFileReader, PdfFileWriter
d = {
"Name": "James",
" Date": "1/1/2016",
"City": "Wilmo",
"County": "United States"
}
reader = PdfFileReader("medicareRRF.pdf")
inFields = reader.getFields()
watermark = PdfFileReader("justSign.pdf")
writer = PdfFileWriter()
page = reader.getPage(0)
page.mergePage(watermark.getPage(0))
writer.addPage(page)
written_page = writer.getPage(0)
writer.updatePageFormFieldValues(written_page, d)
Which correctly fills in the PDF with the dictionary (d), but how can I check and uncheck boxes on the PDF? Here is the getField() info for one of the boxes:
u'Are you ok': {'/FT': '/Btn','/Kids': [IndirectObject(36, 0),
IndirectObject(38, 0)],'/T': u'Are you ok','/V': '/No'}
I tried adding {'Are you ok' : '/Yes'} and several other similar ways, but nothing worked.
I came across the same issue, looked in several places, and was disappointed that I couldn't find the answer. After a few frustrating hours looking at my code, the pyPDF2 code, and the Adobe PDF 1.7 spec, I finally figured it out. If you debug into updatePageFormFieldValues, you'll see that it uses only TextStringObjects. Checkboxes are not text fields -- even the /V values are not text fields, which seemed counterintuitive at least to me. Debugging into that function showed me that checkboxes are instead NameObjects so I created my own function to handle them. I create two dicts: one with only text values that I pass to the built-in updatePageFormFieldValues function and a second with only checkbox values. I also set the /AS to ensure visibility (see PDF spec). My function looks like this:
def updateCheckboxValues(page, fields):
for j in range(0, len(page['/Annots'])):
writer_annot = page['/Annots'][j].getObject()
for field in fields:
if writer_annot.get('/T') == field:
writer_annot.update({
NameObject("/V"): NameObject(fields[field]),
NameObject("/AS"): NameObject(fields[field])
})
However, as far as I can tell, whether you use /1, /On, or /Yes depends on how the form was defined or perhaps what the PDF reader is looking for. For me, /1 worked.
I will like to add on to the answer #rpsip.
from PyPDF2 import PdfReader, PdfWriter
from PyPDF2.generic import NameObject
reader = PdfReader(r"form2.pdf") #where you read the pdf in the same directory
writer = PdfWriter()
page = reader.pages[0] #read page 1 of your pdf
fields = reader.get_fields()
print (fields) # this is to identify if you can see the form fills in that page
writer.add_page(page) #this line is necessary otherwise the pdf will be corrupted
for i in range(len(page["/Annots"])): #in order to access the "Annots" key
print ((page["/Annots"][i].get_object())) #to find out which of the form fills are checkbox or text fill
if (page["/Annots"][i].get_object())['/FT']=="/Btn" and (page["/Annots"][i].get_object())['/T']=='Check Box3': #this is my filter so that I can filter checkboxes and the checkbox I want i.e. "Check Box 3"
print (page["/Annots"][i].get_object()) #further check if I got what I wanted as per the filter
writer_annot = page["/Annots"][i].get_object()
writer_annot.update(
{
NameObject("/V"): NameObject(
"/Yes"), #NameObject being only for checkbox, and please try "/Yes" or "/1" or "/On" to see which works
NameObject("/AS"): NameObject(
"/Yes" #NameObject being only for checkbox, and please try "/Yes" or "/1" or "/On" to see which works
)
}
)
with open("filled-out.pdf", "wb") as output_stream:
writer.write(output_stream) #save the ticked pdf file as another file named "filled-out.pdf"
hoped I helped.
I have searched the web far and wide for a still working example of uploading a photo to facebook through the Python API (Python for Facebook). Questions like this have been asked on stackoverflow before but non of the answers I have found work anymore.
What I got working is:
import facebook as fb
cfg = {
"page_id" : "my_page_id",
"access_token" : "my_access_token"
}
api = get_api(cfg)
msg = "Hello world!"
status = api.put_wall_post(msg)
where I have defined the get_api(cfg) function as this
graph = fb.GraphAPI(cfg['access_token'], version='2.2')
# Get page token to post as the page. You can skip
# the following if you want to post as yourself.
resp = graph.get_object('me/accounts')
page_access_token = None
for page in resp['data']:
if page['id'] == cfg['page_id']:
page_access_token = page['access_token']
graph = fb.GraphAPI(page_access_token)
return graph
And this does indeed post a message to my page.
However, if I instead want to upload an image everything goes wrong.
# Upload a profile photo for a Page.
api.put_photo(image=open("path_to/my_image.jpg",'rb').read(), message='Here's my image')
I get the dreaded GraphAPIError: (#324) Requires upload file for which non of the solutions on stackoverflow works for me.
If I instead issue the following command
api.put_photo(image=open("path_to/my_image.jpg",'rb').read(), album_path=cfg['page_id'] + "/picture")
I get GraphAPIError: (#1) Could not fetch picture for which I haven't been able to find a solution either.
Could someone out there please point me in the right direction of provide me with a currently working example? It would be greatly appreciated, thanks !
A 324 Facebook error can result from a few things depending on how the photo upload call was made
a missing image
an image not recognised by Facebook
incorrect directory path reference
A raw cURL call looks like
curl -F 'source=#my_image.jpg' 'https://graph.facebook.com/me/photos?access_token=YOUR_TOKEN'
As long as the above calls works, you can be sure the photo agrees with Facebook servers.
An example of how a 324 error can occur
touch meow.jpg
curl -F 'source=#meow.jpg' 'https://graph.facebook.com/me/photos?access_token=YOUR_TOKEN'
This can also occur for corrupted image files as you have seen.
Using .read() will dump the actual data
Empty File
>>> image=open("meow.jpg",'rb').read()
>>> image
''
Image File
>>> image=open("how.png",'rb').read()
>>> image
'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00...
Both of these will not work with the call api.put_photo as you have seen and Klaus D. mentioned the call should be without read()
So this call
api.put_photo(image=open("path_to/my_image.jpg",'rb').read(), message='Here's my image')
actually becomes
api.put_photo('\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00...', message='Here's my image')
Which is just a string, which isn't what is wanted.
One needs the image reference <open file 'how.png', mode 'rb' at 0x1085b2390>
I know this is old and doesn't answer the question with the specified API, however, I came upon this via a search and hopefully my solution will help travelers on a similar path.
Using requests and tempfile
A quick example of how I do it using the tempfile and requests modules.
Download an image and upload to Facebook
The script below should grab an image from a given url, save it to a file within a temporary directory and automatically cleanup after finished.
In addition, I can confirm this works running on a Flask service on Google Cloud Run. That comes with the container runtime contract so that we can store the file in-memory.
import tempfile
import requests
# setup stuff - certainly change this
filename = "your-desired-filename"
filepath = f"{directory}/{filename}"
image_url = "your-image-url"
act_id = "your account id"
access_token = "your access token"
# create the temporary directory
temp_dir = tempfile.TemporaryDirectory()
directory = temp_dir.name
# stream the image bytes
res = requests.get(image_url, stream=True)
# write them to your filename at your temporary directory
# assuming this works
# add logic for non 200 status codes
with open(filepath, "wb+") as f:
f.write(res.content)
# prep the payload for the facebook call
files = {
"filename": open(filepath, "rb"),
}
url = f"https://graph.facebook.com/v10.0/{act_id}/adimages?access_token={access_token}"
# send the POST request
res = requests.post(url, files=files)
res.raise_for_status()
if res.status_code == 200:
# get your image data back
image_upload_data = res.json()
temp_dir.cleanup()
if "images" in image_upload_data:
return image_upload_data["images"][filepath.split("/")[-1]]
return image_upload_data
temp_dir.cleanup() # paranoid: just in case an error isn't raised