Edit text in PDF with python - python

I have a pdf file and I need to edit some text/values in the pdf. For example, In the pdf files that I have BIRTHDAY DD/MM/YYYY is always N/A. I want to change it to whatever value I desire and then save it as a new document. Overwriting existing document is also alright.
I have previously done this so far:
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("abc.pdf")
page = reader.pages[0]
writer = PdfWriter()
writer.add_page(reader.pages[0])
pdf_doc = writer.update_page_form_field_values(
reader.pages[0], {"BIRTHDAY DD/MM/YYYY": "123"}
)
with open("new_abc1.pdf", "wb") as fh:
writer.write(fh)
But this update_page_form_field_values() doesn't change the desired value, maybe because this is not a form field?
Screenshot of pdf showing the value to be changed:
Any clues?

I'm the current maintainer of pypdf and PyPDF2 (Please use pypdf; PyPDF2 is deprecated)
It is not possible to change a text with pypdf at the moment.
Changing form contents is a different story. However, we have several issues with form fields: https://github.com/py-pdf/pypdf/labels/workflow-forms
The update_page_form_field_values is the correct function to use.

Related

Page number of added bookmarks with PyPDF2

I have added pdf files using PdfFileMerger from PyPDF2 and added a bookmark at the beginning of each pdf file using PdfFileMerger.addbookmark. When I open the new file with PdfFileReader and extract the pages, where the bookmarks wer placed, I get for the page number -1.
I use the following code for merging:
merger = PdfFileMerger
for path in paths:
merger.append(path, import_bookmarks=False)
merger.addBookmark(f"{title}", page)
merger.write(save_path)
merger.close()
For reading the file I use:
pdf = PdfFileReader(file, "rb")
for i in pdf.getOutlines():
pdf.getDestinationPageNumber(i)
Why is the page number -1 for the new bookmarks?
You are getting the -1 value because the method getDestinationPageNumber is not finding the page associated with the Destination object, see the documentation. In addition, outline iteration might be broken in PyPDF2, as suggested by this SO answer. You can achieve what you want with this code:
pdf = PdfFileReader(file, "rb")
for i in pdf.outline:
print(i.page)

pdftotext return blank but pdf has multiple lines and multiple pages why?

import pdftotext
# Load your PDF
with open("docs/doc1.pdf", "rb") as f:
docs = pdftotext.PDF(f)
print(docs[0])
this code print blank for this specific file, if i change the file it is giving me result.
I tried even apache Tika. Tika also return None, How to solve this problem?
One thing I would like to mention here is that pdf is made of multiple images
Here is the file
This is sample pdf, not the original one. but i want to extract text from the pdf something like this

Convert pdf files to raw text in new directory

Here is what I'm trying:
import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader
import re
import config
import sys
import os
with open(config.ENCRYPTED_FILE_PATH, mode='rb') as f:
reader = PyPDF2.PdfFileReader(f)
if reader.isEncrypted:
reader.decrypt('Password123')
print(f"Number of page: {reader.getNumPages()}")
for i in range(reader.numPages):
output = PdfFileWriter()
output.addPage(reader.getPage(i))
with open("./pdfs/document-page%s.pdf" % i, "wb") as outputStream:
output.write(outputStream)
print(outputStream)
for page in output.pages: # failing here
print page.extractText() # failing here
The entire program is decrypting a large pdf file from one location, and splitting into a separate pdf file per page in new directory -- this is working fine. However, after this I would like to convert each page to a raw .txt file in a new directory. i.e. /txt_versions/ (for which I'll use later)
Ideally, I can use my current imports, i.e. PyPDF2 without importing/installing more modules/. Any thoughts?
You have not described how the last two lines are failing, but extract text does not function well on some PDFs:
def extractText(self):
"""
Locate all text drawing commands, in the order they are provided in the
content stream, and extract the text. This works well for some PDF
files, but poorly for others, depending on the generator used. This will
be refined in the future. Do not rely on the order of text coming out of
this function, as it will change if this function is made more
sophisticated.
:return: a unicode string object.
"""
One thing to do is to see if there is text in your pdf. Just because you can see words doesn't mean they have been OCR'd or otherwise encoded in the file as text. Attempt to highlight the text in the pdf and copy/paste it into a text file to see what sort of text can even be extracted.
If you can't get your solution working you'll need to use another package like Tika.

add custom metadata to pdf using python

I want to add custom metadata to pdf file. This can be achieved by pypdf2 or pdrw library. I have referred Change metadata of pdf file with pypdf2
solution works fine, when there is no space between the two words of attribute.
In my case my metadata is like
meta = {'Page description' : 'description',
'create time' : '123455' }
when I try to add above metadata as :
reader = PdfFileReader(filename)
writer = PdfFileWriter()
writer.appendPagesFromReader(reader)
writer.addMetadata(meta)
with open(filename, 'wb') as fout:
writer.write(fout)
when we try to see the custom properties in document property of pdf, nothing is displayed.
The above solution works fine if attributes are without spaces
I do not know about pypdfw, but for pdfrw, you should use PdfName('Page description') as your dictionary key. This will properly encode the space.
Disclaimer: I am the primary pdfrw author.

How to create a PDF from a binary string?

There is a request has been made to the server using Python's requests module:
requests.get('myserver/pdf', headers)
It returned a status-200 response, which all contains PDF binary data in response.content
Question
How does one create a PDF file from the response.content?
You can create an empty pdf then save write to that pdf in binary like this:
from reportlab.pdfgen import canvas
import requests
# Example of path. This file has not been created yet but we
# will use this as the location and name of the pdf in question
path_to_create_pdf_with_name_of_pdf = r'C:/User/Oleg/MyDownloadablePdf.pdf'
# Anything you used before making the request. Since you did not
# provide code I did not know what you used
.....
request = requests.get('myserver/pdf', headers)
#Actually creates the empty pdf that we will use to write the binary data to
pdf_file = canvas.Canvas(path_to_create_pdf_with_name_of_pdf)
#Open the empty pdf that we created above and write the binary data to.
with open(path_to_create_pdf_with_name_of_pdf, 'wb') as f:
f.write(request.content)
f.close()
The reportlab.pdfgen allows you to make a new pdf by specifying the path you want to save the pdf in along with the name of the pdf using the canvas.Canvas method. As stated in my answer you need to provide the path to do this.
Once you have an empty pdf, you can open the pdf file as wb (write binary) and write the content of the pdf from the request to the file and close the file.
When using the path - ensure that the name is not the name of any existing files to ensure that you do not overwrite any existing files. As the comments show, if this name is the name of any other file then you risk overwriting the data. If you are doing this in a loop for example, you will need to specify the path with a new name at each iteration to ensure that you have a new pdf each time. But if it is a one-off thing then you do not run that risk so as long as it is not the name of another file.

Categories