Page number of added bookmarks with PyPDF2 - python

I have added pdf files using PdfFileMerger from PyPDF2 and added a bookmark at the beginning of each pdf file using PdfFileMerger.addbookmark. When I open the new file with PdfFileReader and extract the pages, where the bookmarks wer placed, I get for the page number -1.
I use the following code for merging:
merger = PdfFileMerger
for path in paths:
merger.append(path, import_bookmarks=False)
merger.addBookmark(f"{title}", page)
merger.write(save_path)
merger.close()
For reading the file I use:
pdf = PdfFileReader(file, "rb")
for i in pdf.getOutlines():
pdf.getDestinationPageNumber(i)
Why is the page number -1 for the new bookmarks?

You are getting the -1 value because the method getDestinationPageNumber is not finding the page associated with the Destination object, see the documentation. In addition, outline iteration might be broken in PyPDF2, as suggested by this SO answer. You can achieve what you want with this code:
pdf = PdfFileReader(file, "rb")
for i in pdf.outline:
print(i.page)

Related

Analyzing a Specific Page of a PDF with Amazon Textract

I am using Amazon Textract to extract text from PDF files. For some of these documents, I want to be able to specify the pages from which data is to be extracted, rather than having to go through the entire thing. Is this possible? If so, how do I do it? I cannot seem to find an answer in the docs.
I do not believe Textract offers this feature, but you can easily implement it programatically. Since your tags mention python, I'll suggest a way to do this using python.
You can use a library like PyPDF2 which lets you specify which pages you want to extract and creates a new pdf with just those pages.
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_file_path = 'Unknown.pdf'
file_base_name = pdf_file_path.replace('.pdf', '')
pdf = PdfFileReader(pdf_file_path)
pages = [0, 2, 4] # page 1, 3, 5
pdfWriter = PdfFileWriter()
for page_num in pages:
pdfWriter.addPage(pdf.getPage(page_num))
with open('{0}_subset.pdf'.format(file_base_name), 'wb') as f:
pdfWriter.write(f)
f.close()
This library can be used with AWS Lambda as a layer. You can save the file temporarily in the /tmp/ folder on lambda.
Source: https://learndataanalysis.org/how-to-extract-pdf-pages-and-save-as-a-separate-pdf-file-using-python/

How to write separate DOCX files by page from one DOCX file?

I have a MS Word document that consists of several hundred pages.
Each page is identical apart from the name of a person which is unique across each page. (One page is one user).
I would like to take this word document and automate the process to save each page individually so I end up with several hundred word documents, one for each person, rather than one document that consists of everyone, that I can then distribute to the different people.
I have been using the module python-docx found here: https://python-docx.readthedocs.io/en/latest/
I am struggling on how to achieve this task.
As far as I have researched it is not possible to loop over each page as the pages are not determined in the .docx file itself but are generated by the program, i.e. Microsoft Word.
However python-docx can interpret the text and since each page is the same can I not say to python when you see this text (the last piece of text on a given page) consider this to be the end of a page and anything after this point is a new page.
Ideally if I could write a loop that would consider such a point and create a document up until that point, and repeating the over all pages that would be great. It would need to take all formatting/pictures as well.
I am not against other methods, such as converting to PDF first if that is an option.
Any thoughts?
I had the exact same problem. Unfortunately I could not find a way to split .docx by page. The solution was to first use python-docx or docx2python (whatever you like) to iterate over each page and extract the unique (person) information and put it in a list so you end up with:
people = ['person_A', 'person_B', 'person_C', ....]
Then save the .docx as a pdf split the pdfs up by page and then save them as person_A.pdf etc like this:
from PyPDF2 import PdfFileWriter, PdfFileReader
inputpdf = PdfFileReader(open("document.pdf", "rb"))
for i in range(inputpdf.numPages):
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
with open(f"{people[i]}.pdf", "wb") as outputStream:
output.write(outputStream)
The result is a bunch of one page PDFs saved as Person_A.pdf, Person_B.pdf etc.
Hope that helps.
I would suggest another package aspose-words-cloud to split a word document into separate pages. Currently, it works with cloud storage(Aspose cloud storage, Amazon S3, DropBox, Google Drive Storage, Google Cloud Storage, Windows Azure Storage and FTP Storage). However, in near future, it will support process files from the request body(streams).
P.S: I am developer evangelist at Aspose.
# For complete examples and data files, please go to https://github.com/aspose-words-cloud/aspose-words-cloud-python
import os
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile
# Please get your Client ID and Secret from https://dashboard.aspose.cloud.
client_id='xxxxx-xxxxx-xxxx-xxxxx-xxxxxxxxxxx'
client_secret='xxxxxxxxxxxxxxxxxx'
words_api = asposewordscloud.WordsApi(client_id,client_secret)
words_api.api_client.configuration.host='https://api.aspose.cloud'
remoteFolder = 'Temp'
localFolder = 'C:/Temp'
localFileName = '02_pages.docx'
remoteFileName = '02_pages.docx'
#upload file
words_api.upload_file(asposewordscloud.models.requests.UploadFileRequest(open(localFolder + '/' + localFileName,'rb'),remoteFolder + '/' + remoteFileName))
#Split DOCX pages as a zip file
request = asposewordscloud.models.requests.SplitDocumentRequest(name=remoteFileName, format='docx', folder=remoteFolder, zip_output= 'true')
result = words_api.split_document(request)
print("Result {}".format(result.split_result.zipped_pages.href))
#download file
request_download=asposewordscloud.models.requests.DownloadFileRequest(result.split_result.zipped_pages.href)
response_download = words_api.download_file(request_download)
copyfile(response_download, 'C:/'+ result.split_result.zipped_pages.href)

Convert pdf files to raw text in new directory

Here is what I'm trying:
import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader
import re
import config
import sys
import os
with open(config.ENCRYPTED_FILE_PATH, mode='rb') as f:
reader = PyPDF2.PdfFileReader(f)
if reader.isEncrypted:
reader.decrypt('Password123')
print(f"Number of page: {reader.getNumPages()}")
for i in range(reader.numPages):
output = PdfFileWriter()
output.addPage(reader.getPage(i))
with open("./pdfs/document-page%s.pdf" % i, "wb") as outputStream:
output.write(outputStream)
print(outputStream)
for page in output.pages: # failing here
print page.extractText() # failing here
The entire program is decrypting a large pdf file from one location, and splitting into a separate pdf file per page in new directory -- this is working fine. However, after this I would like to convert each page to a raw .txt file in a new directory. i.e. /txt_versions/ (for which I'll use later)
Ideally, I can use my current imports, i.e. PyPDF2 without importing/installing more modules/. Any thoughts?
You have not described how the last two lines are failing, but extract text does not function well on some PDFs:
def extractText(self):
"""
Locate all text drawing commands, in the order they are provided in the
content stream, and extract the text. This works well for some PDF
files, but poorly for others, depending on the generator used. This will
be refined in the future. Do not rely on the order of text coming out of
this function, as it will change if this function is made more
sophisticated.
:return: a unicode string object.
"""
One thing to do is to see if there is text in your pdf. Just because you can see words doesn't mean they have been OCR'd or otherwise encoded in the file as text. Attempt to highlight the text in the pdf and copy/paste it into a text file to see what sort of text can even be extracted.
If you can't get your solution working you'll need to use another package like Tika.

Edit text in PDF with python

I have a pdf file and I need to edit some text/values in the pdf. For example, In the pdf files that I have BIRTHDAY DD/MM/YYYY is always N/A. I want to change it to whatever value I desire and then save it as a new document. Overwriting existing document is also alright.
I have previously done this so far:
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("abc.pdf")
page = reader.pages[0]
writer = PdfWriter()
writer.add_page(reader.pages[0])
pdf_doc = writer.update_page_form_field_values(
reader.pages[0], {"BIRTHDAY DD/MM/YYYY": "123"}
)
with open("new_abc1.pdf", "wb") as fh:
writer.write(fh)
But this update_page_form_field_values() doesn't change the desired value, maybe because this is not a form field?
Screenshot of pdf showing the value to be changed:
Any clues?
I'm the current maintainer of pypdf and PyPDF2 (Please use pypdf; PyPDF2 is deprecated)
It is not possible to change a text with pypdf at the moment.
Changing form contents is a different story. However, we have several issues with form fields: https://github.com/py-pdf/pypdf/labels/workflow-forms
The update_page_form_field_values is the correct function to use.

generating a local link/path to PDF File for direct access

I'm trying to write a programm for Data extraction from PDF in Python (Excel Macro could be an option) .
At first at want to select a text or a position in a pdf file and generate a local path/link to that file in that position. This link will be copied to an excel cell. When I click on the link the PDF document should open on the specified coordinates of the previously selected text.
I know the question is very broad. I'm an enthusiast beginner and need a nudge in the right direction and to know if it is possible.
How can I get the path of the active pdf file in the desktop? and the coordinate of the selected text? I could give these automatically as parameters then to my programm.
Thank you !
There are lot of ways to do this, I would say look into the Slate --> https://pypi.python.org/pypi/slate , or http://www.unixuser.org/~euske/python/pdfminer/index.html
And yes it's quite easy , also look into pyPdf
import pyPdf
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = pyPdf.PdfFileReader(file(path, "rb"))
# Iterate pages
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
# Collapse whitespace
content = " ".join(content.replace("\xa0", " ").strip().split())
return content
print getPDFContent("test.pdf")

Categories