Python - Remove watermark from pdf files

Python - Remove watermark from pdf files - python

I created a simple script to convert rtf file to pdf. The script worked perfect, but I found it has watermark on each converted pdf file. I tried to use "watermark.remove()", it seems it doesn't work for my script.
Could anyone help me to take a look at it and let me know how to do it correctly. Thanks.
Here is my script:
import aspose.words as aw
import fnmatch
import os
import sys
import os,os.path
import comtypes.client
# load RTF document
source_path = "file_path"
dest_path = "output_path"
in_file = [] # save the source file to an array
for in_file in os.listdir(source_path):
if fnmatch.fnmatch(in_file,'*.rtf'):
doc = aw.Document(source_path+in_file)
out_name = os.path.splitext(in_file)[0] # get file name only
print (out_name) # print the file name
doc.watermark.remove() #remove the watermark
doc.save(dest_path+out_name+".pdf", aw.SaveFormat.PDF)

The section "Licensing and Subscription" from the documentation says that (emphasis mine)
The Trial version of Aspose.Words without the specified license provides full product functionality, but inserts an evaluative watermark at the top of the document upon loading and saving and limits the maximum document size to a few hundred paragraphs.
So you may consider purchasing the license or request a 30-day temporary license.

Related

Creating report in PDF with images from python

I want to make a PDF with report from the python plots I have made.
I am importing those images to pdf with the following lines:
import os
import img2pdf
with open("report.pdf", "wb") as f:
f.write(img2pdf.convert([i for i in os.listdir('C:\\Users\\rysza\\Desktop\\python data analysis\\zajecia5') if i.endswith(".jpg")]))
My question is how can I create additional page in front of the images where I can put some text.
I was trying with this, but seems not to be working:
from reportlab.pdfgen import canvas
c = canvas.Canvas("report.pdf")
c.drawString(100,750,"Welcome to Reportlab!")
Any options are welcome.

You can do it with fpdf lib
here you can find code examples here
https://python-scripts.com/create-pdf-pyfpdf
(also you can translate page to english with https://translate.yandex.ru/translate )

How can I save as the document as .psb file using win32com.client in python?

In photoshop and using python, I cannot save the active document as PSB (Large Document Format) file
With win32com.client, I can save active documents as .psd files like this:
from win32com.client import Dispatch
psApp = Dispatch("Photoshop.Application")
activeDocument = psApp.Application.ActiveDocument
activeDocument.SaveAs("E:\\PSDCopy", PhotoshopSaveOptions, False)
Though I cannot force it to save as psb no matter what I tried.
I also could not find any clue in the VBScript documentation, now even a word about psb files.
Any help would be deeply appreciated.

Adobe created a terrible API for interfacing with Photoshop. Worse than that, the documentation is deprecated and doesn't include updates like PSB files, EXR files etc.
A good way to find out how to write code for that is to use the Photoshop ActionListener and hack your way around (doesn't always work but it gives you some good leads). You can read more about it here.
This should do what you are looking:
import comtypes.client as ct
app = ct.CreateObject('Photoshop.Application')
def save_as_psb(path):
""" Save the current Document as PSB with maximised compatibility
turned ON.
Args:
path (str): This is the filename of the output PSB
"""
desc19 = ct.CreateObject("Photoshop.ActionDescriptor")
desc20 = ct.CreateObject("Photoshop.ActionDescriptor")
desc20.putBoolean(app.StringIDToTypeID('maximizeCompatibility'), True)
desc19.putObject(
app.CharIDToTypeID('As '), app.CharIDToTypeID('Pht8'), desc20)
desc19.putPath(app.CharIDToTypeID('In '), path)
logging.debug(path)
desc19.putBoolean(app.CharIDToTypeID('LwCs'), True)
app.executeAction(app.CharIDToTypeID('save'), desc19, 3)

Pdf creation and Writing content inside - PyPDF2

I am reading text from one pdf recursively and doing some operation with the extracted text at each run and want to create a new pdf to save that edited text with each run ..
I tried below from PyPDF2..
import PyPDF2
output = PdfFileWriter()
pdf="pdfte.pdf"
Obj_pdfFile = open(pdf, 'rb')
pdfReader = PyPDF2.PdfFileReader(Obj_pdfFile,strict = False)
pages=pdfReader.numPages
for page in range(pages):
pageObj = pdfReader.getPage(page)
pdf_text=pageObj.extractText()
upper = pdf_text.upper()
#print(pdf_text)
output.addPage(input.getPage(upper)) . # I thought this will work but no use..
I know need to input "page" but basically looking how to save edited text in new pdf ... I know I am missing some code here that how to save in pdf etc but thats exactly what I need help, never worked with pdf..
Also, is there any better option to do this ?

PyPDF2 is amazing to handle pdf files as documents, but not as an editor. I wanted to do the same that you tried, but only made it posible with reportlab as many other answers here do. Note that here
output.addPage(input.getPage(upper)) . # I thought this will work but no use.
upper is a string, and getPage() is expecting a page from
PyPDF2.PdfFileReader(pdffile).getPage(0)
Here is that worked for me on python 2.7:
temp = StringIO()
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import A6 #choose here your size
can = canvas.Canvas(temp, pagesize=A6)
can.drawString(10, 405, "Your string on this position")
can.save()
temp.seek(0)
lector = PyPDF2.PdfFileReader(temp)
output.addPage(lector.getPage(0)) #your pypdf2 writter
now output is your pdf with the string attached, hope someone finds it useful.

Remove all images from docx files

I've searched the documentation for python-docx and other packages, as well as stack-overflow, but could not find how to remove all images from docx files with python.
My exact use-case: I need to convert hundreds of word documents to "draft" format to be viewed by clients. Those drafts should be identical the original documents but all the images must be deleted / redacted from them.
Sorry for not including an example of things I tried, what I have tried is hours of research that didn't give any info. I found this question on how to extract images from word files, but that doesn't delete them from the actual document: Extract pictures from Word and Excel with Python
From there and other sources I've found out that docx files could be read as simple zip files, I don't know if that means that it's possible to "re-zip" without the images without affecting the integrity of the docx file (edit: simply deleting the images works, but prevents python-docx from continuing to work with this file because of missing references to images), but thought this might be a path to a solution.
Any ideas?

If your goal is to redact images maybe this code I used for a similar usecase could be useful:
import sys
import zipfile
from PIL import Image, ImageFilter
import io
blur = ImageFilter.GaussianBlur(40)
def redact_images(filename):
outfile = filename.replace(".docx", "_redacted.docx")
with zipfile.ZipFile(filename) as inzip:
with zipfile.ZipFile(outfile, "w") as outzip:
for info in inzip.infolist():
name = info.filename
print(info)
content = inzip.read(info)
if name.endswith((".png", ".jpeg", ".gif")):
fmt = name.split(".")[-1]
img = Image.open(io.BytesIO(content))
img = img.convert().filter(blur)
outb = io.BytesIO()
img.save(outb, fmt)
content = outb.getvalue()
info.file_size = len(content)
info.CRC = zipfile.crc32(content)
outzip.writestr(info, content)
Here I used PIL to blur images in some files, but instead of the blur filter any other suitable operation could be used. This worked quite nicely for my usecase.

I don't think it's currently implemented in python-docx.
Pictures in the Word Object Model are defined as either floating shapes or inline shapes. The docx documentation states that it only supports inline shapes.
The Word Object Model for Inline Shapes supports a Delete() method, which should be accessible. However, it is not listed in the examples of InlineShapes and there is also a similar method for paragraphs. For paragraphs, there is an open feature request to add this functionality - which dates back to 2014! If it's not added to paragraphs it won't be available for InlineShapes as they are implemented as discrete paragraphs.
You could do this with win32com if you have a machine with Word and Python installed.
This would allow you to call the Word Object Model directly, giving you access to the Delete() method. In fact you could probably cheat - rather than scrolling through the document to get each image, you can call Find and Replace to clear the image. This SO question talks about win32com find and replace:
import win32com.client
from os import getcwd, listdir
docs = [i for i in listdir('.') if i[-3:]=='doc' or i[-4:]=='docx'] #All Word file
FromTo = {"First Name":"John",
"Last Name":"Smith"} #You can insert as many as you want
word = win32com.client.DispatchEx("Word.Application")
word.Visible = True #Keep comment after tests
word.DisplayAlerts = False
for doc in docs:
word.Documents.Open('{}\\{}'.format(getcwd(), doc))
for From in FromTo.keys():
word.Selection.Find.Text = From
word.Selection.Find.Replacement.Text = FromTo[From]
word.Selection.Find.Execute(Replace=2, Forward=True) #You made the mistake here=> Replace must be 2
name = doc.rsplit('.',1)[0]
ext = doc.rsplit('.',1)[1]
word.ActiveDocument.SaveAs('{}\\{}_2.{}'.format(getcwd(), name, ext))
word.Quit() # releases Word object from memory
In this case since we want images, we would need to use the short-code ^g as the find.Text and blank as the replacement.
word.Selection.Find
find.Text = "^g"
find.Replacement.Text = ""
find.Execute(Replace=1, Forward=True)

I don't know about this library, but looking through the documentation I found this section about images. It mentiones that it is currently not possible to insert images other than inline. If that is what you currently have in your documents, I assume you can also retrieve these by looking in the Document object and then remove them?
The Document is explained here.
Although not a duplicate, you might also want to look at this question's answer where user "scanny" explains how he finds images using the library.

Why geotiff could not be opened by gdal?

I'm newbie in python and in geoprocessing. I'm writing some program to calculate ndwi. To make this, I try to open geotiff dataset with gdal, but dataset can't be opened. I tried to open different tiff files (Landsat8 multiple data, Landsat7 composite, etc), but dataset is always None.
What reason to this could be? Or how can i find it out?
Here's a part of code:
import sys, os, struct
import gdal, gdalconst
from gdalconst import *
import numpy as np
from numpy import *
class GDALCalcNDWI ():
def calcNDWI(self, outFilePath):
gdal.AllRegister()
# this allows GDAL to throw Python Exceptions
gdal.UseExceptions()
filePath = "C:\\Users\\Daria\\Desktop.TIF\\170028-2007-05-21.tif"
# Open
dataset = gdal.Open(filePath, gdal.GA_ReadOnly)
# Check
if dataset is None:
print ("can't open tiff file")
sys.exit(-1)
Thanks

Whenever you have a well-known file reader that is returning None, make sure the path to your file is correct. I doubt you have a directory called Desktop.TIF, I'm assuming you just made a typo in your source code. You probably want C:\\Users\\Dara\\Desktop\\TIF\\170028-2007-05-21.tif as the path (note that Desktop.TIF ==> Desktop\\TIF).
The safest thing to do is right click on the file, go to properties, and copy/paste that path into your python source code.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Remove watermark from pdf files - python

Related

Creating report in PDF with images from python

How can I save as the document as .psb file using win32com.client in python?

Pdf creation and Writing content inside - PyPDF2

Remove all images from docx files

Why geotiff could not be opened by gdal?

Categories

Resources