Saving a redacted PDF file in Python to mask underneath text - python

I read in a PDF file in Python, added a text box on top of the text that I'd like to redact, and saved the change in a new PDF file. When I searched for the text in the redacted PDF file using a PDF reader, the text can still be found.
Is there a way to save the PDF as a single layer file? Or is there a way to ensure that the text under the text box can be removed?
import PyPDF2
import re
import fitz
import io
import os
import pandas
import numpy as np
from PyPDF2 import PdfFileReader, PdfFileWriter
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import A4
from reportlab.graphics import renderPDF
from reportlab.lib import colors
from reportlab.graphics.shapes import *
reader = PyPDF2.PdfReader(files)
packet = io.BytesIO()
can = canvas.Canvas(packet, pagesize = A4)
can.rect(65, 750, 40, 30, stroke=1, fill=1)
can.setFillColorRGB(1, 1, 1)
can.save()
packet.seek(0)
new_pdf = PdfFileReader(packet)
output = PyPDF2.PdfFileWriter()
pageToOutput = reader.getPage(1)
pageToOutput.mergePage(new_pdf.getPage(0))
output.addPage(pageToOutput)
outputStream = open('NewFile.pdf', "wb")
output.write(outputStream)
outputStream.close()

I used one of the solutons (pdf2image and PIL) in the link provided by #Matt Pitken, and it worked well.

Disclaimer: I am the author of borb, the library used in this answer
Redaction in PDF is done through annotations.
You can think of annotations as "something I added later to the PDF". For instance a post-it note with a remark.
Redaction annotations are basically a post-it with the implied meaning "this content needs to be removed from the PDF"
In borb, you can add redaction annotations and then apply them.
This is purposefully a two-step process. The idea being that you can send the document (with annotations) to someone else, and ask them to review it (e.g. "Did I remove all the content that needed to be removed?)
Once your document is ready, you can apply the redaction annotations which will effectively remove the content.
Step 1 (creating a PDF with content, and redaction annotations):
from decimal import Decimal
from borb.pdf.canvas.layout.annotation.redact_annotation import RedactAnnotation
from borb.pdf.canvas.geometry.rectangle import Rectangle
from borb.pdf import SingleColumnLayout
from borb.pdf import PageLayout
from borb.pdf import Paragraph
from borb.pdf import Document
from borb.pdf import Page
from borb.pdf import PDF
def main():
doc: Document = Document()
page: Page = Page()
doc.add_page(page)
layout: PageLayout = SingleColumnLayout(page)
layout.add(
Paragraph(
"""
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
"""
)
)
page.add_annotation(
RedactAnnotation(
Rectangle(Decimal(405), Decimal(721), Decimal(40), Decimal(8)).grow(
Decimal(2)
)
)
)
# store
with open("output.pdf", "wb") as out_file_handle:
PDF.dumps(out_file_handle, doc)
if __name__ == "__main__":
main()
Of course, you can simply open an existing PDF and add a redaction annotation.
Step 2 (applying the redaction annotation):
import typing
from borb.pdf import Document
from borb.pdf import PDF
def main():
doc: typing.Optional[Document] = None
with open("output.pdf", "rb") as pdf_file_handle:
doc = PDF.loads(pdf_file_handle)
# apply redaction annotations
doc.get_page(0).apply_redact_annotations()
# store
with open("output.pdf", "wb") as out_file_handle:
PDF.dumps(out_file_handle, doc)
if __name__ == "__main__":
main()

Related

Insert stamp PDF with different position uipath and python?

I have some PDF files, I want to stamp on those PDF files but the location is not the same, is there any way to find the location in the file and stamp on that PDF? I use Uipath and Python
I still haven't found a solution yet
disclaimer: I am the author of borb the library used in this answer.
From what I understand of your question, you want to find a certain word on the page, and add a stamp on top of that.
Let's split that in two parts:
Finding the position of a word on the page
#!chapter_005/src/snippet_006.py
import typing
from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit import RegularExpressionTextExtraction
def main():
# read the Document
# fmt: off
doc: typing.Optional[Document] = None
l: RegularExpressionTextExtraction = RegularExpressionTextExtraction("[lL]orem .* [dD]olor")
with open("output.pdf", "rb") as in_file_handle:
doc = PDF.loads(in_file_handle, [l])
# fmt: on
# check whether we have read a Document
assert doc is not None
# print matching groups
for i, m in enumerate(l.get_matches()[0]):
print("%d %s" % (i, m.group(0)))
for r in m.get_bounding_boxes():
print(
"\t%f %f %f %f" % (r.get_x(), r.get_y(), r.get_width(), r.get_height())
)
if __name__ == "__main__":
main()
In this snippet we use RegularExpressionTextExtraction to process the Page events (rendering text, images, etc). This class acts as an EventListener, and keeps track of which text (being rendered) matches the given regex.
We can then print that text, and its position.
Putting a stamp on a page, at a given position
In the next snippet, we are going to:
create a PDF containing some text
add a rubber stamp (annotation) on that page, at precise coordinates
You can of course modify this snippet to only add the stamp, and to work from an existing PDF (rather than create one).
#!chapter_006/src/snippet_005.py
from decimal import Decimal
from borb.pdf.canvas.layout.annotation.rubber_stamp_annotation import (
RubberStampAnnotation,
RubberStampAnnotationIconType,
)
from borb.pdf.canvas.geometry.rectangle import Rectangle
from borb.pdf import SingleColumnLayout
from borb.pdf import PageLayout
from borb.pdf import Paragraph
from borb.pdf import Document
from borb.pdf import Page
from borb.pdf.page.page_size import PageSize
from borb.pdf import PDF
def main():
doc: Document = Document()
page: Page = Page()
doc.add_page(page)
layout: PageLayout = SingleColumnLayout(page)
layout.add(
Paragraph(
"""
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
"""
)
)
# This is where the stamp is added
page_width: Decimal = PageSize.A4_PORTRAIT.value[0]
page_height: Decimal = PageSize.A4_PORTRAIT.value[1]
s: Decimal = Decimal(100)
page.add_annotation(
RubberStampAnnotation(
Rectangle(
page_width / Decimal(2) - s / Decimal(2),
page_height / Decimal(2) - s / Decimal(2),
s,
s,
),
name=RubberStampAnnotationIconType.CONFIDENTIAL,
)
)
# store
with open("output.pdf", "wb") as out_file_handle:
PDF.dumps(out_file_handle, doc)
if __name__ == "__main__":
main()
The result should be something like this:
In order to change the appearance of the stamp, I encourage you to check out the documentation.

How to extract specific portion of text from file in Python?

I would like to extract a specific portion from a text.
For example, I have this text:
"*Lorem ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrum exercitationem ullamco laboriosam, nisi ut aliquid ex ea commodi consequatur.
Duis aute irure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint obcaecat cupiditat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum*",
I would like to extract the content from "Duis aute" to the start a new line ("nulla pariatur").
How could I do this in Python? Thanks in advance to everyone.
Sorry for poor English.
You can use this.
with open('filename.txt') as f: # open file and get the data.
data = f.read()
s_index = data.index('Duis aute') # get the starting index of text.
e_index = data.index('.',s_index) # get the end index of text here I also pass s_index as the parameter because I want the index of the dot after the starting index.
text = data[s_index:e_index]
print(text)
Output
Duis aute irure reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur
If you want to end the text by \n Then use this one
with open('filename.txt') as f:
data = f.readlines()
data = ''.join(data)
# here try and except for because if the substring not in the string then it will throw an error.
try:
s_index = data.index('Duis aute')
e_index = data.index('\n',s_index)
except:
print('Value Not Found.')
else:
text = data[s_index:e_index]
print(text)
Testing
with open('filename.txt') as f:
data = f.readlines()
data = ''.join(data)
# here try and except for because if the substring not in the string then it will throw an error.
try:
s_index = data.index('ipsum dolor')
e_index = data.index('\n',s_index)
except:
print('Value Not Found.')
else:
text = data[s_index:e_index]
print(text)
output
ipsum dolor sit amet, consectetur adipisci elit, sed do eiusmod tempor incidunt ut labore et dolore magna aliqua.
with open('filename.txt') as f:
data = f.readlines()
data = ''.join(data)
# here try and except for because if the substring not in the string then it will throw an error.
try:
s_index = data.index('Ut enim ad minim')
e_index = data.index('\n',s_index)
except:
print('Value Not Found.')
else:
text = data[s_index:e_index]
print(text)
output
Ut enim ad minim veniam, quis nostrum exercitationem ullamco laboriosam, nisi ut aliquid ex ea commodi consequatur.
And If you need only one word after the given word then use this.
with open('filename.txt') as f:
data = f.readlines()
data = ''.join(data)
# here try and except for because if the substring not in the string then it will throw an error.
try:
s_index = data.index('Lorem')
e_index = data.index(' ',s_index+len('Lorem')+1)
except:
print('Value Not Found.')
else:
text = data[s_index:e_index]
print(text)
output
Lorem ipsum
If you are trying to extract a particular "sentence" - then one way could be to split on the sentence separator (\n for example)
sentences = s.split('\n')
If you have multiple delimiters for a sentence - you can use the re module -
import re
sentences = re.split(r'\.|\n', s)
You can then extract the matches from sentences -
required = '\n'.join(_ for _ in sentences if _.strip().startswith('Duis aute'))
Of course, you can combine all of this into a single liner -
'\n'.join(_ for _ in s.split('.') if _.strip().startswith('Duis aute'))

How to use hex color value in python reportlab pdf generation

I am trying to generate multi page pdf document reading some py files and other doc files. I am trying do it with SimpleDocTemplate instead of Canvas. Now I am trying to color the text with hex value. I tried following:
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate
from reportlab.platypus.para import Paragraph
from reportlab.lib.styles import getSampleStyleSheet
doc_content = []
styles=getSampleStyleSheet()
doc = SimpleDocTemplate("form_letter.pdf",pagesize=letter,
rightMargin=72,leftMargin=72,
topMargin=72,bottomMargin=18)
titleFormat = '<font size="16" name="Helvetica" color="#FF8100"><b><i>%s</i></b></font>'
def generateDoc(docName):
paraTitle = Paragraph(titleFormat % 'Title', styles["Normal"])
doc_content.append(paraTitle)
doc.build(doc_content)
generateDoc("temp.pdf")
But this gives me error
AttributeError: module 'reportlab.lib.colors' has no attribute '#FF8100'
I also tried 0xFF8100, but it was giving same error:
AttributeError: module 'reportlab.lib.colors' has no attribute '0xFF8100'
When I use some named color say red, it works fine. How can use hex color values?
It's always better to create your custom StyleSheet if you need different colors of text in the pdf.You can pass your hex code value to def HexColor(val, htmlOnly=False, hasAlpha=False):
from reportlab.lib import colors
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate
from reportlab.platypus.para import Paragraph
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
doc_content = []
styles = getSampleStyleSheet()
#creating custom stylesheet
styles.add(ParagraphStyle(name='Content',
fontFamily='Helvetica',
fontSize=8,
textColor=colors.HexColor("#FF8100")))
doc = SimpleDocTemplate("form_letter.pdf", pagesize=letter,
rightMargin=72, leftMargin=72,
topMargin=72, bottomMargin=18)
#using a sample text here
titleFormat = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."
def generateDoc(docName):
paraTitle = Paragraph(titleFormat, styles["Content"])
doc_content.append(paraTitle)
doc.build(doc_content)
generateDoc("temp.pdf")

How to split a file which is delimited by bullet points

I'm trying to split a large file that has several paragraphs, each one is of variable length and the only delimiter would be the bullet point for the next paragraph...
Is there a way to get several different files with each individual paragraph?
The final thing is to write each individual paragraph to a MySQL DB...
example input:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
output: each paragraph is a separate entry in the DB
this is how you split your file by bullet point:
new_files = open(source_file).read().split(u'\u2022')
for par in new_files:
open("%s.txt"%new_files.index(par),"w").write("%s"%par)
LOAD DATA INFILE "%s.txt"%new_files.index(par) INTO TABLE your_DB_name.your_table;
This conects to mysql DB and reads the file and splits it at each bullet point and inserts the data into mysql DB table
My Code:
#Server Connection to MySQL:
import MySQLdb
conn = MySQLdb.connect(host= "localhost",
user="root",
passwd="newpassword",
db="db")
x = conn.cursor()
try:
file_data = open("FILE_NAME_WITH_EXTENSION").read().split(u'\u2022')
for text in file_data:
print text
x.execute("""INSERT INTO TABLE_NAME VALUES (%s)""",(text))
conn.commit()
except:
conn.rollback()
conn.close()

Any way to search zlib-compressed text?

For a project I have to store a great deal of text and I was hoping to keep the database size small by zlib-compressing the text. Is there a way to search zlib-compressed text by testing for substrings without decompressing?
I would like to do something like the following:
>>> import zlib
>>> lorem = zlib.compress("Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.")
>>> test_string = zlib.compress("Lorem")
>>> test_string in lorem
False
No. You cannot compress a short string and expect to find the result of that compression in the compressed version of a file that contains that original short string. Compression codes the data differently depending on the data that precedes it. In fact, that's how most compressors work -- by using the preceding data for matching strings and statistical distributions.
To search for a string, you have to decompress the data. You do not have to store the decompressed data though. You can read in the compressed data and decompress on the fly, discarding that data as you go until you find your string or get to the end. If the compressed data is very large and on slow mass media, this may be faster than searching for the string in the same data uncompressed on the same media.

Categories