How to decode a pdf encrypted file from python - python

I have got a PDF file and associated password.
I would to convert an encrypted file to a clear version using python only.
I found here some python modules (pyPdf2 , PDFMiner)
to treat PDF file but none of them will work with encryption.
Someone have already done this ?

Now pyPDF2 support encryption, according to this answer, it may be implemented like this:
import os
import PyPDF2
from PyPDF2 import PdfFileReader
fp = open(filename)
pdfFile = PdfFileReader(fp)
password = "mypassword"
if pdfFile.isEncrypted:
try:
pdfFile.decrypt(password)
print('File Decrypted (PyPDF2)')
except:
command = ("cp "+ filename +
" temp.pdf; qpdf --password='' --decrypt temp.pdf " + filename
+ "; rm temp.pdf")
os.system(command)
print('File Decrypted (qpdf)')
fp = open(filename)
pdfFile = PdfFileReader(fp)
else:
print('File Not Encrypted')
Note that this code, use pyPDF2 by default and failback to qpdf in case of issue.

You'd also need to know the encryption algorithm and key length to be able to advise which tool might work... and depending on the answers, a python library may not be available.

Related

PyPDF2 error "PyCryptodome is required for AES algorithm"

I've got hundreds on PDFs I need to set password. I tried to use pyPDF2 to do that but I got an error:
"DependencyError: PyCryptodome is required for AES algorithm".
I've tried to google any other module like pikepdf but I found only how to crack the password using it and not to actually set password.
Any ideas how to deal with it? I get an error on that line: "input_pdf = PdfFileReader(in_file)"
file = directory + '\\passwords.xlsx'
df = pd.read_excel(file)
df['PDF'] = df.iloc[:,[0]] + '.pdf'
df = df.to_dict('records')
for i in df:
filename = i['PDF']
password = i['Password']
with open(filename, "rb") as in_file:
input_pdf = PdfFileReader(in_file)
output_pdf = PdfFileWriter()
output_pdf.appendPagesFromReader(input_pdf)
output_pdf.encrypt(password)
with open(filename, "wb") as out_file:
output_pdf.write(out_file)
I had the same problem.
You just need to install PyCryptodome package.
For example:
pip install pycryptodome==3.15.0
A.) Great way to do it:
https://roytuts.com/how-to-encrypt-pdf-as-password-protected-file-in-python/
import PyPDF2
#pdf_in_file = open("simple.pdf",'rb')
pdf_in_file = open("gre_research_validity_data.pdf",'rb')
inputpdf = PyPDF2.PdfFileReader(pdf_in_file)
pages_no = inputpdf.numPages
output = PyPDF2.PdfFileWriter()
for i in range(pages_no):
inputpdf = PyPDF2.PdfFileReader(pdf_in_file)
output.addPage(inputpdf.getPage(i))
output.encrypt('password')
#with open("simple_password_protected.pdf", "wb") as outputStream:
with open("gre_research_validity_data_password_protected.pdf", "wb") as outputStream:
output.write(outputStream)
pdf_in_file.close()
B.) If you want to fix your own bug:
solution for similar error message but during counting pages - Not able to find number of pages of PDF using Python 3.X: DependencyError: PyCryptodome is required for AES algorithm
ORIGINAL CODE
! pip install PyPDF2
! pip install pycryptodome
from PyPDF2 import PdfFileReader
from Crypto.Cipher import AES
if PdfFileReader('Media Downloaded Files/spk-10-3144 bro.pdf').isEncrypted:
print('This file is encrypted.')
else:
print(PdfFileReader('Media Downloaded Files/spk-10-3144-bro.pdf').numPages)
FIX
! pip install pikepdf
from pikepdf import Pdf
pdf = Pdf.open('Media Downloaded Files/spk-10-3144-bro.pdf')
len(pdf.pages)

Why doesn't the password open my zip file in s3 when passed as a bytes object in python?

I have a small but mysterious and unsolvable problem using python to open a password protected file in an AWS S3 bucket.
The password I have been given is definitely correct and I can download the zip to Windows and extract it to reveal the csv data I need.
However I need to code up a process to load this data into a database regularly.
The password has a pattern like this (includes mixed case letters, numbers and a single "#"):-
ABCD#Efghi12324567890
The code below works with other zip files I place in the location with the same password:-
import boto3
import pyzipper
from io import BytesIO
s3_resource = boto3.resource('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key)
zip_obj = s3_resource.Object(bucket_name=my_bucket, key=my_folder + my_zip)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = pyzipper.ZipFile(buffer)
my_newfile=z.namelist()[0]
s3_resource.meta.client.upload_fileobj(
z.open(my_newfile, pwd=b"ABCD#Efghi12324567890"), #HERE IS THE OPEN COMMAND
Bucket=my_bucket,
Key=my_folder + my_newfile)
I am told the password is incorrect:-
RuntimeError: Bad password for file 'ThisIsTheFileName.csv'
I resorted to using pyzipper rather than zipfile, since zipfile didn't support the compression method of the file in question:-
That compression method is not supported
In 7-zip I can see the following for the zip file:-
Method: AES-256 Deflate
Characteristics: WzAES: Encrypt
Host OS: FAT
So to confirm:-
-The password is definitely correct (can open it manually)
-The code seems ok - it opens my zip files with the same password
What is the issue here please and how do I fix it?
You would have my sincere thanks!
Phil
With some help from a colleague and a useful article, I now have this working.
Firstly as per the compression type, I have found it necessary to use the AESZipFile() method of pyzipper (although this method also seemed to work on other compression types).
Secondly the AESZipFile() method apparently accepts a BytesIO object as well as a file path, presumably because this is what it sees when it opens the file.
Therefore the zip file can be extracted in situ without having to download it first.
This method creates the pyzipper object which you can then read by specifying the file name and the password.
The final code looks like this:-
import pyzipper
import boto3
from io import BytesIO
my_bucket = ''
my_folder = ''
my_zip = ''
my_password = b''
aws_access_key_id=''
aws_secret_access_key=''
s3 = boto3.client('s3', aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key)
s3_file = s3.get_object(Bucket=my_bucket, Key=my_folder + my_zip)
s3_iodata = BytesIO(s3_file['Body'].read())
f = pyzipper.AESZipFile(s3_iodata)
my_file = f.namelist()[0]
file_content = f.read(my_file, pwd = my_password)
response = s3.put_object(Body=file_content, Bucket=my_bucket, Key=my_folder + my_file)
Here is an article that was useful:-
https://www.linkedin.com/pulse/extract-files-from-zip-archives-in-situ-aws-s3-using-python-tom-reid
I hope this is helpful to someone,
Phil

Open a protected pdf file in python

I write a pdf cracking and found the password of the protected pdf file. I want to write a program in Python that can display that pdf file on the screen without password.I use the PyPDF library.
I know how to open a file without the password, but can't figure out the protected one.Any idea? Thanks
filePath = raw_input()
password = 'abc'
if sys.platform.startswith('linux'):
subprocess.call(["xdg-open", filePath])
The approach shown by KL84 basically works, but the code is not correct (it writes the output file for each page). A cleaned up version is here:
https://gist.github.com/bzamecnik/1abb64affb21322256f1c4ebbb59a364
# Decrypt password-protected PDF in Python.
#
# Requirements:
# pip install PyPDF2
from PyPDF2 import PdfFileReader, PdfFileWriter
def decrypt_pdf(input_path, output_path, password):
with open(input_path, 'rb') as input_file, \
open(output_path, 'wb') as output_file:
reader = PdfFileReader(input_file)
reader.decrypt(password)
writer = PdfFileWriter()
for i in range(reader.getNumPages()):
writer.addPage(reader.getPage(i))
writer.write(output_file)
if __name__ == '__main__':
# example usage:
decrypt_pdf('encrypted.pdf', 'decrypted.pdf', 'secret_password')
You should use pikepdf library nowadays instead:
import pikepdf
with pikepdf.open("input.pdf", password="abc") as pdf:
num_pages = len(pdf.pages)
print("Total pages:", num_pages)
PyPDF2 doesn't support many encryption algorithms, pikepdf seems to solve them, it supports most of password protected methods, and also documented and actively maintained.
You can use pdfplumber library. Super easy to use and reads machine written pdf files seamlessly, better than any other library i have used.
import pdfplumber
with pdfplumber.open(r'D:\examplepdf.pdf' , password = 'abc') as pdf:
first_page = pdf.pages[0]
print(first_page.extract_text())
I have the answer for this question. Basically, the PyPDF2 library needs to install and use in order to get this idea working.
#When you have the password = abc you have to call the function decrypt in PyPDF to decrypt the pdf file
filePath = raw_input("Enter pdf file path: ")
f = PdfFileReader(file(filePath, "rb"))
output = PdfFileWriter()
f.decrypt ('abc')
# Copy the pages in the encrypted pdf to unencrypted pdf with name noPassPDF.pdf
for pageNumber in range (0, f.getNumPages()):
output.addPage(f.getPage(pageNumber))
# write "output" to noPassPDF.pdf
outputStream = file("noPassPDF.pdf", "wb")
output.write(outputStream)
outputStream.close()
#Open the file now
if sys.platform.startswith('darwin'):#open in MAC OX
subprocess.call(["open", "noPassPDF.pdf"])

merging pdf files with pypdf

I am writing a script that parses an internet site (maya.tase.co.il) for links, downloads pdf file and merges them. It works mostly, but merging gives me different kinds of errors depending on the file. I cant seem to figure out why. I cut out the relevant code and built a test only for two specific files that are causing a problem. The script uses pypdf, but I am willing to try anything that works. Some files are encrypted, some are not.
def is_incry(pdf):
from pyPdf import PdfFileWriter, PdfFileReader
input=PdfFileReader(pdf)
try:
input.getNumPages()
return input
except:
input.decrypt("")
return input
def merg_pdf(to_keep,to_lose):
import os
from pyPdf import PdfFileWriter, PdfFileReader
if os.path.exists(to_keep):
in1=file(to_keep, "rb")
in2=file(to_lose, "rb")
input1 = is_incry(in1)
input2 = is_incry(in2)
output = PdfFileWriter()
loop1=input1.getNumPages()
for i in range(0,loop1):
output.addPage(input1.getPage(i))#
loop2=input2.getNumPages()
for i in range(0,loop2):
output.addPage(input2.getPage(i))#
outputStream = file("document-output.pdf", "wb")
output.write(outputStream)
outputStream.close()
pdflen=loop1+loop2
in1.close()
in2.close()
os.remove(to_lose)
os.remove(to_keep)
os.rename("document-output.pdf",to_keep)
else:
os.rename(to_lose,to_keep)
in1=file(to_keep, "rb")
input1 = PdfFileReader(in1)
try:
pdflen=input1.getNumPages()
except:
input1.decrypt("")
pdflen=input1.getNumPages()
in1.close()
#input1.close()
return pdflen
def test():
import urllib
urllib.urlretrieve ('http://mayafiles.tase.co.il/RPdf/487001-488000/P487028-01.pdf', 'temp1.pdf')
urllib.urlretrieve ('http://mayafiles.tase.co.il/RPdf/488001-489000/P488170-00.pdf', 'temp2.pdf')
merg_pdf('temp1.pdf','temp2.pdf')
test()
I thank anyone that even took the time to read this.
Al.
I once wrote a complex PDF generation/merging stuff which I have now open-sourced.
You can have a look at it: https://github.com/becomingGuru/nikecup/blob/master/reg/models.py#L71
def merge_pdf(self):
from pyPdf import PdfFileReader,PdfFileWriter
pdf_file = file_names['main_pdf']%settings.MEDIA_ROOT
pdf_obj = PdfFileReader(open(pdf_file))
values_page = PdfFileReader(open(self.make_pdf())).getPage(0)
mergepage = pdf_obj.pages[0]
mergepage.mergePage(values_page)
signed_pdf = PdfFileWriter()
for page in pdf_obj.pages:
signed_pdf.addPage(page)
signed_pdf_name = file_names['dl_done']%(settings.MEDIA_ROOT,self.phash)
signed_pdf_file = open(signed_pdf_name,mode='wb')
signed_pdf.write(signed_pdf_file)
signed_pdf_file.close()
return signed_pdf_name
It then works like a charm. Hope it helps.
I tried the documents with pyPdf - it looks like both the pdfs in this document are encrypted, and a blank password ("") is not valid.
Looking at the security settings in adobe, you're allowed to print and copy - however, when you run pyPdf's input.decrypt(""), the files are still encrypted (as in, input.getNumPages() after this still returns 0).
I was able to open this in OSX's preview, and save the pdfs without encryption, and then the assembly works fine. pyPdf deals in pages though, so I don't think you can do this through pyPdf. Either find the correct password, or perhaps use another application, such as using a batch job to OpenOffice, or another pdf plugin would work.

.doc to pdf using python

I'am tasked with converting tons of .doc files to .pdf. And the only way my supervisor wants me to do this is through MSWord 2010. I know I should be able to automate this with python COM automation. Only problem is I dont know how and where to start. I tried searching for some tutorials but was not able to find any (May be I might have, but I don't know what I'm looking for).
Right now I'm reading through this. Dont know how useful this is going to be.
A simple example using comtypes, converting a single file, input and output filenames given as commandline arguments:
import sys
import os
import comtypes.client
wdFormatPDF = 17
in_file = os.path.abspath(sys.argv[1])
out_file = os.path.abspath(sys.argv[2])
word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
You could also use pywin32, which would be the same except for:
import win32com.client
and then:
word = win32com.client.Dispatch('Word.Application')
You can use the docx2pdf python package to bulk convert docx to pdf. It can be used as both a CLI and a python library. It requires Microsoft Office to be installed and uses COM on Windows and AppleScript (JXA) on macOS.
from docx2pdf import convert
convert("input.docx")
convert("input.docx", "output.pdf")
convert("my_docx_folder/")
pip install docx2pdf
docx2pdf input.docx output.pdf
Disclaimer: I wrote the docx2pdf package. https://github.com/AlJohri/docx2pdf
I have tested many solutions but no one of them works efficiently on Linux distribution.
I recommend this solution :
import sys
import subprocess
import re
def convert_to(folder, source, timeout=None):
args = [libreoffice_exec(), '--headless', '--convert-to', 'pdf', '--outdir', folder, source]
process = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout)
filename = re.search('-> (.*?) using filter', process.stdout.decode())
return filename.group(1)
def libreoffice_exec():
# TODO: Provide support for more platforms
if sys.platform == 'darwin':
return '/Applications/LibreOffice.app/Contents/MacOS/soffice'
return 'libreoffice'
and you call your function:
result = convert_to('TEMP Directory', 'Your File', timeout=15)
All resources:
https://michalzalecki.com/converting-docx-to-pdf-using-python/
I have worked on this problem for half a day, so I think I should share some of my experience on this matter. Steven's answer is right, but it will fail on my computer. There are two key points to fix it here:
(1). The first time when I created the 'Word.Application' object, I should make it (the word app) visible before open any documents. (Actually, even I myself cannot explain why this works. If I do not do this on my computer, the program will crash when I try to open a document in the invisible model, then the 'Word.Application' object will be deleted by OS. )
(2). After doing (1), the program will work well sometimes but may fail often. The crash error "COMError: (-2147418111, 'Call was rejected by callee.', (None, None, None, 0, None))" means that the COM Server may not be able to response so quickly. So I add a delay before I tried to open a document.
After doing these two steps, the program will work perfectly with no failure anymore. The demo code is as below. If you have encountered the same problems, try to follow these two steps. Hope it helps.
import os
import comtypes.client
import time
wdFormatPDF = 17
# absolute path is needed
# be careful about the slash '\', use '\\' or '/' or raw string r"..."
in_file=r'absolute path of input docx file 1'
out_file=r'absolute path of output pdf file 1'
in_file2=r'absolute path of input docx file 2'
out_file2=r'absolute path of outputpdf file 2'
# print out filenames
print in_file
print out_file
print in_file2
print out_file2
# create COM object
word = comtypes.client.CreateObject('Word.Application')
# key point 1: make word visible before open a new document
word.Visible = True
# key point 2: wait for the COM Server to prepare well.
time.sleep(3)
# convert docx file 1 to pdf file 1
doc=word.Documents.Open(in_file) # open docx file 1
doc.SaveAs(out_file, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 1
word.Visible = False
# convert docx file 2 to pdf file 2
doc = word.Documents.Open(in_file2) # open docx file 2
doc.SaveAs(out_file2, FileFormat=wdFormatPDF) # conversion
doc.Close() # close docx file 2
word.Quit() # close Word Application
unoconv (writen in Python) and OpenOffice running as a headless daemon.
https://github.com/unoconv/unoconv
http://dag.wiee.rs/home-made/unoconv/
Works very nicely for doc, docx, ppt, pptx, xls, xlsx.
Very useful if you need to convert docs or save/convert to certain formats on a server.
As an alternative to the SaveAs function, you could also use ExportAsFixedFormat which gives you access to the PDF options dialog you would normally see in Word. With this you can specify bookmarks and other document properties.
doc.ExportAsFixedFormat(OutputFileName=pdf_file,
ExportFormat=17, #17 = PDF output, 18=XPS output
OpenAfterExport=False,
OptimizeFor=0, #0=Print (higher res), 1=Screen (lower res)
CreateBookmarks=1, #0=No bookmarks, 1=Heading bookmarks only, 2=bookmarks match word bookmarks
DocStructureTags=True
);
The full list of function arguments is: 'OutputFileName', 'ExportFormat', 'OpenAfterExport', 'OptimizeFor', 'Range', 'From', 'To', 'Item', 'IncludeDocProps', 'KeepIRM', 'CreateBookmarks', 'DocStructureTags', 'BitmapMissingFonts', 'UseISO19005_1', 'FixedFormatExtClassPtr'
It's worth noting that Stevens answer works, but make sure if using a for loop to export multiple files to place the ClientObject or Dispatch statements before the loop - it only needs to be created once - see my problem: Python win32com.client.Dispatch looping through Word documents and export to PDF; fails when next loop occurs
If you don't mind using PowerShell have a look at this Hey, Scripting Guy! article. The code presented could be adopted to use the wdFormatPDF enumeration value of WdSaveFormat (see here).
This blog article presents a different implementation of the same idea.
I have modified it for ppt support as well. My solution support all the below-specified extensions.
word_extensions = [".doc", ".odt", ".rtf", ".docx", ".dotm", ".docm"]
ppt_extensions = [".ppt", ".pptx"]
My Solution: Github Link
I have modified code from Docx2PDF
I tried the accepted answer but wasn't particularly keen on the bloated PDFs Word was producing which was usually an order of magnitude bigger than expected. After looking how to disable the dialogs when using a virtual PDF printer I came across Bullzip PDF Printer and I've been rather impressed with its features. It's now replaced the other virtual printers I used previously. You'll find a "free community edition" on their download page.
The COM API can be found here and a list of the usable settings can be found here. The settings are written to a "runonce" file which is used for one print job only and then removed automatically. When printing multiple PDFs we need to make sure one print job completes before starting another to ensure the settings are used correctly for each file.
import os, re, time, datetime, win32com.client
def print_to_Bullzip(file):
util = win32com.client.Dispatch("Bullzip.PDFUtil")
settings = win32com.client.Dispatch("Bullzip.PDFSettings")
settings.PrinterName = util.DefaultPrinterName # make sure we're controlling the right PDF printer
outputFile = re.sub("\.[^.]+$", ".pdf", file)
statusFile = re.sub("\.[^.]+$", ".status", file)
settings.SetValue("Output", outputFile)
settings.SetValue("ConfirmOverwrite", "no")
settings.SetValue("ShowSaveAS", "never")
settings.SetValue("ShowSettings", "never")
settings.SetValue("ShowPDF", "no")
settings.SetValue("ShowProgress", "no")
settings.SetValue("ShowProgressFinished", "no") # disable balloon tip
settings.SetValue("StatusFile", statusFile) # created after print job
settings.WriteSettings(True) # write settings to the runonce.ini
util.PrintFile(file, util.DefaultPrinterName) # send to Bullzip virtual printer
# wait until print job completes before continuing
# otherwise settings for the next job may not be used
timestamp = datetime.datetime.now()
while( (datetime.datetime.now() - timestamp).seconds < 10):
if os.path.exists(statusFile) and os.path.isfile(statusFile):
error = util.ReadIniString(statusFile, "Status", "Errors", '')
if error != "0":
raise IOError("PDF was created with errors")
os.remove(statusFile)
return
time.sleep(0.1)
raise IOError("PDF creation timed out")
I was working with this solution but I needed to search all .docx, .dotm, .docm, .odt, .doc or .rtf and then turn them all to .pdf (python 3.7.5). Hope it works...
import os
import win32com.client
wdFormatPDF = 17
for root, dirs, files in os.walk(r'your directory here'):
for f in files:
if f.endswith(".doc") or f.endswith(".odt") or f.endswith(".rtf"):
try:
print(f)
in_file=os.path.join(root,f)
word = win32com.client.Dispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(in_file)
doc.SaveAs(os.path.join(root,f[:-4]), FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
word.Visible = True
print ('done')
os.remove(os.path.join(root,f))
pass
except:
print('could not open')
# os.remove(os.path.join(root,f))
elif f.endswith(".docx") or f.endswith(".dotm") or f.endswith(".docm"):
try:
print(f)
in_file=os.path.join(root,f)
word = win32com.client.Dispatch('Word.Application')
word.Visible = False
doc = word.Documents.Open(in_file)
doc.SaveAs(os.path.join(root,f[:-5]), FileFormat=wdFormatPDF)
doc.Close()
word.Quit()
word.Visible = True
print ('done')
os.remove(os.path.join(root,f))
pass
except:
print('could not open')
# os.remove(os.path.join(root,f))
else:
pass
The try and except was for those documents I couldn't read and won't exit the code until the last document.
You should start from investigating so called virtual PDF print drivers.
As soon as you will find one you should be able to write batch file that prints your DOC files into PDF files. You probably can do this in Python too (setup printer driver output and issue document/print command in MSWord, later can be done using command line AFAIR).
import docx2txt
from win32com import client
import os
files_from_folder = r"c:\\doc"
directory = os.fsencode(files_from_folder)
amount = 1
word = client.DispatchEx("Word.Application")
word.Visible = True
for file in os.listdir(directory):
filename = os.fsdecode(file)
print(filename)
if filename.endswith('docx'):
text = docx2txt.process(os.path.join(files_from_folder, filename))
print(f'{filename} transfered ({amount})')
amount += 1
new_filename = filename.split('.')[0] + '.txt'
try:
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
except:
os.mkdir(files_from_folder + r'\txt_files')
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
elif filename.endswith('doc'):
doc = word.Documents.Open(os.path.join(files_from_folder, filename))
text = doc.Range().Text
doc.Close()
print(f'{filename} transfered ({amount})')
amount += 1
new_filename = filename.split('.')[0] + '.txt'
try:
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
except:
os.mkdir(files_from_folder + r'\txt_files')
with open(os.path.join(files_from_folder + r'\txt_files', new_filename), 'w', encoding='utf-8') as t:
t.write(text)
word.Quit()
The Source Code, see here:
https://neculaifantanaru.com/en/python-full-code-how-to-convert-doc-and-docx-files-to-pdf-from-the-folder.html
I would suggest ignoring your supervisor and use OpenOffice which has a Python api. OpenOffice has built in support for Python and someone created a library specific for this purpose (PyODConverter).
If he isn't happy with the output, tell him it could take you weeks to do it with word.

Categories