Redact information in a pdf using Python - python

I need to redact some information in a pdf file using Python. Is there a library available to do it ? I have tried 'PyPDF2' and 'Reportlab' but no luck.

Try something like:
from PyPDF2 import PdfFileReader, PdfFileWriter
fin = open('source.pdf', 'rb')
reader = PdfFileReader(fin)
writer = PdfFileWriter()
Or consult a site like either of the following:
https://www.binpress.com/manipulate-pdf-python/
https://www.blog.pythonlibrary.org/2018/06/06/creating-and-manipulating-pdfs-with-pdfrw/
Let me know if that helps, given your question is more of a open-ended question.

Related

Analyzing a Specific Page of a PDF with Amazon Textract

I am using Amazon Textract to extract text from PDF files. For some of these documents, I want to be able to specify the pages from which data is to be extracted, rather than having to go through the entire thing. Is this possible? If so, how do I do it? I cannot seem to find an answer in the docs.
I do not believe Textract offers this feature, but you can easily implement it programatically. Since your tags mention python, I'll suggest a way to do this using python.
You can use a library like PyPDF2 which lets you specify which pages you want to extract and creates a new pdf with just those pages.
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_file_path = 'Unknown.pdf'
file_base_name = pdf_file_path.replace('.pdf', '')
pdf = PdfFileReader(pdf_file_path)
pages = [0, 2, 4] # page 1, 3, 5
pdfWriter = PdfFileWriter()
for page_num in pages:
pdfWriter.addPage(pdf.getPage(page_num))
with open('{0}_subset.pdf'.format(file_base_name), 'wb') as f:
pdfWriter.write(f)
f.close()
This library can be used with AWS Lambda as a layer. You can save the file temporarily in the /tmp/ folder on lambda.
Source: https://learndataanalysis.org/how-to-extract-pdf-pages-and-save-as-a-separate-pdf-file-using-python/

Create hyperlink from path in python

Kind of a noob question. Sorry for that.
I have a string path
W:\documents\files
Is it possible to create an hyperlink from that and store it in a csv so that when a click it in Excel it opens the file ?
How about this?
from pathlib import Path
link = Path('W:\\documents\\files\\sample.txt').as_uri()
It should return "file:///W:/documents/files/sample.txt"
I'm guessing you want something like this:
import csv
with open('my_csv_file.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(["file:///W:/documents/files"])
the csv module lets you read and write to csv files. Adding file:/// before your file path will let you link to it. Note that this code will write the text to a csv file but it won't become a hyperlink until you put the cursor in front of it and hit enter.
I don't know exactly what you said.
I think maybe you want to write hyperlink path in csv.
That is just use "".
like this.
function("W:\documents\files")

Edit text in PDF with python

I have a pdf file and I need to edit some text/values in the pdf. For example, In the pdf files that I have BIRTHDAY DD/MM/YYYY is always N/A. I want to change it to whatever value I desire and then save it as a new document. Overwriting existing document is also alright.
I have previously done this so far:
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("abc.pdf")
page = reader.pages[0]
writer = PdfWriter()
writer.add_page(reader.pages[0])
pdf_doc = writer.update_page_form_field_values(
reader.pages[0], {"BIRTHDAY DD/MM/YYYY": "123"}
)
with open("new_abc1.pdf", "wb") as fh:
writer.write(fh)
But this update_page_form_field_values() doesn't change the desired value, maybe because this is not a form field?
Screenshot of pdf showing the value to be changed:
Any clues?
I'm the current maintainer of pypdf and PyPDF2 (Please use pypdf; PyPDF2 is deprecated)
It is not possible to change a text with pypdf at the moment.
Changing form contents is a different story. However, we have several issues with form fields: https://github.com/py-pdf/pypdf/labels/workflow-forms
The update_page_form_field_values is the correct function to use.

How do I use ParaView's CSVReader in a Python Script?

How do I use ParaView's CSVReader in a Python Script? An example would be appreciated.
If you have a .csv file that looks like this:
x,y,z,attribute
0,0,0,0
1,0,0,1
0,1,0,2
1,1,0,3
0,0,1,4
1,0,1,5
0,1,1,6
1,1,1,7
then you can import it with a command that looks like this:
myReader = CSVReader(FileName='C:\foo.csv', guiName='foo.csv')
Also, if you don't add that guiName parameter, you can change the name later using the RenameSource command like this:
RenameSource(proxy = myReader, newName = 'MySuperNewName'
Credit for the renaming part of this answer to Sebastien Jourdain.
Unfortunately, I don't know Paraview at all. But I found "... simply record your work in the desktop application in the form of a python script ..." at their site. If you import a CSV like that, it might give you a hint.
Improving the #GregNash's answer. If you want to include only a single file (called foo.csv):
outcsv = CSVReader(FileName= 'foo.csv')
Or if you want to include all files with certain pattern use glob. For example if files start with string foo (aka foo.csv.0, foo.csv.1, foo.csv.2):
myreader = CSVReader(FileName=glob.glob('foo*'))
To use glob is neccesary import glob in the preamble. In general in Filename you could work with strings generated with python which could contain more complex pattern files and file's path.

PyPDF2, how to fix their example code to conform with Python 3

I am new to Python - I am using Python 3.4 but most of the online sample code I can find are for Python 2. In particular, the package I am trying to use, PyPDF2 has sample code:
https://github.com/mstamy2/PyPDF2/blob/master/Sample_Code/basic_features.py
that is for Python 2. I can't seem to run it. I fixed the print part (line 7) to include brackets. What I don't know how to fix are lines 44 and 45, the ones where you actually save a pdf with the modifications made. The relevant part of the code is:
from PyPDF2 import PdfFileWriter, PdfFileReader
output = PdfFileWriter()
input1 = PdfFileReader(open("document1.pdf", "rb"))
output.addPage(input1.getPage(0))
outputStream = file("PyPDF2-output.pdf", "wb")
output.write(outputStream)
My Python doesn't understand file(). Gives an error to
outputStream = file("PyPDF2-output.pdf", "wb")
Any suggestions? Is there a library I should have imported to run this or is there a difference between Python 2 and 3 in how one writes this?
Python 3 no longer allows you to open a file using the file constructor. Instead, use open.
You should follow the new pypdf docs: https://pypdf.readthedocs.io/en/latest/index.html
Updating your specific example:
from pydpf import PdfWriter, PdfReader
writer = PdfWriter()
reader = PdfReader("document1.pdf")
writer.add_page(reader.pages[0])
with open("PyPDF2-output.pdf", "wb") as fp:
writer.write(fp)

Categories