I would like to print a PDF file with an external printer. However, since I'm about to open, create or transform multiple files in some loop, I would like to print the thing without the need of saving it as a PDF file in every iteration.
Simplified code looks like this:
import PyPDF2
import os
pdf_in = open('tubba.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_in)
pdf_writer = PyPDF2.PdfFileWriter()
page = pdf_reader.getPage(0)
page.rotateClockwise(90)
# Some other operations done on the page, such as scaling, cropping etc.
pdf_writer.addPage(page)
pdf_out = open('rotated.pdf', 'wb')
pdf_writer.write(pdf_out)
pdf_print = os.startfile('rotated.pdf', 'print')
pdf_out.close()
pdf_in.close()
Is there any way to print "page", or "pdf_writer"?
Best regards
You can just use variables.
Eg.
path = 'C\yourfile.pdf'
os.startfile(path) #just pass the variable here
I believe this is my first StackOverflow question, so please be nice.
I am OCRing a repository of PDFs (~1GB in total) ranging from 50-200 pages each and found that suddenly all of the available 100GB of remaining harddrive space on my Macbook Pro were gone. Based on a previous post, it seems that ImageMagick is the culprit as shown here.
I found that these files are called 'magick-*' and are stored in /private/var/tmp. For only 23 PDFs it had created 3576 files totaling 181GB.
How can I delete these files immediately within the code after they are no longer needed? Thank you in advance for any suggestions to remedy this issue.
Here is the code:
import io, os
import json
import unicodedata
from PIL import Image as PI
import pyocr
import pyocr.builders
from wand.image import Image
from tqdm import tqdm
# Where you want to save the PDFs
destination_folder = 'contract_data/Contracts_Backlog/'
pdfs = [unicodedata.normalize('NFKC',f.decode('utf8')) for f in os.listdir(destination_folder) if f.lower().endswith('.pdf')]
txt_files = [unicodedata.normalize('NFKC',f.decode('utf8')) for f in os.listdir(destination_folder) if f.lower().endswith('.txt')]
### Perform OCR on PDFs
def ocr_pdf_to_text(filename):
tool = pyocr.get_available_tools()[0]
lang = 'spa'
req_image = []
final_text = []
image_pdf = Image(filename=filename, resolution=300)
image_jpeg = image_pdf.convert('jpeg')
for img in image_jpeg.sequence:
img_page = Image(image=img)
req_image.append(img_page.make_blob('jpeg'))
for img in req_image:
txt = tool.image_to_string(
PI.open(io.BytesIO(img)),
lang=lang,
builder=pyocr.builders.TextBuilder()
)
final_text.append(txt)
return final_text
for filename in tqdm(pdfs):
txt_file = filename[:-3] +'txt'
txt_filename = destination_folder + txt_file
if not txt_file in txt_files:
print 'Converting ' + filename
try:
ocr_txt = ocr_pdf_to_text(destination_folder + filename)
with open(txt_filename,'w') as f:
for i in range(len(ocr_txt)):
f.write(json.dumps({i:ocr_txt[i].encode('utf8')}))
f.write('\n')
f.close()
except:
print "Could not OCR " + filename
A hacky way of dealing with this was to add an os.remove() statement within the main loop to remove the tmp files after creation.
tempdir = '/private/var/tmp/'
files = os.listdir(tempdir)
for file in files:
if "magick" in file:
os.remove(os.path.join(tempdir,file))
Image should be used as a context manager, because Wand determine timings to dispose resources including temporary files, in-memory buffers, and so on. with block help Wand to know boundaries when these Image objects are still needed and when they are now unnecessary.
See also the official docs.
I am having trouble creating and writing to a text file in Python. I am running Python 3.5.1 and have the following code to try and create and write to a file:
from os import *
custom_path = "MyDirectory/"
if not path.exists(custom_path)
mkdir(custom_path)
text_path = custom_path + "MyTextFile.txt"
text_file = open(text_path, "w")
text_file.write("my text")
But I get a TypeError saying an integer is required (got type str) at the line text_file = open(text_path, "w").
I don't know what I'm doing wrong as my code is just about identical to that of several tutorial sites showing how to create and write to files.
Also, does the above code create the text file if it doesn't exist, and if not how do I create it?
Please don't import everything from os module:
from os import path, mkdir
custom_path = "MyDirectory/"
if not path.exists(custom_path):
mkdir(custom_path)
text_path = custom_path + "MyTextFile.txt"
text_file = open(text_path, 'w')
text_file.write("my text")
Because there also a "open" method in os module which will overwrite the native file "open" method.
When I print a PDF from any of my source PDFs, the file size drops and removes the text boxes presents in form. In short, it flattens the file.
This is behavior I want to achieve.
The following code to create a PDF using another PDF as a source (the one I want to flatten), it writes the text boxes form as well.
Can I get a PDF without the text boxes, flatten it? Just like Adobe does when I print a PDF as a PDF.
My other code looks something like this minus some things:
import os
import StringIO
from pyPdf import PdfFileWriter, PdfFileReader
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
directory = os.path.join(os.getcwd(), "source") # dir we are interested in
fif = [f for f in os.listdir(directory) if f[-3:] == 'pdf'] # get the PDFs
for i in fif:
packet = StringIO.StringIO()
can = canvas.Canvas(packet, pagesize=letter)
can.rotate(-90)
can.save()
packet.seek(0)
new_pdf = PdfFileReader(packet)
fname = os.path.join('source', i)
existing_pdf = PdfFileReader(file(fname, "rb"))
output = PdfFileWriter()
nump = existing_pdf.getNumPages()
page = existing_pdf.getPage(0)
for l in range(nump):
output.addPage(existing_pdf.getPage(l))
page.mergePage(new_pdf.getPage(0))
outputStream = file("out-"+i, "wb")
output.write(outputStream)
outputStream.close()
print fName + " written as", i
Summing up: I have a pdf, I add a text box to it, covering up info and adding new info, and then I print a pdf from that pdf. The text box becomes not editable or moveable any longer. I wanted to automate that process but everything I tried still allowed that text box to be editable.
If installing an OS package is an option, then you could use pdftk with its python wrapper pypdftk like this:
import pypdftk
pypdftk.fill_form('filled.pdf', out_file='flattened.pdf', flatten=True)
You would also need to install the pdftk package, which on Ubuntu could be done like this:
sudo apt-get install pdftk
The pypdftk library can by downloaded from PyPI:
pip install pypdftk
Update: pdftk was briefly removed from Ubuntu in version 18.04, but it seems it is back since 20.04.
A simple but more of a round about way it to covert the pdf to images than to put those image into a pdf.
You'll need pdf2image and PIL
Like So
from pdf2image import convert_from_path
from PIL import Image
images = convert_from_path('temp.pdf')
im1 = images[0]
images.pop(0)
pdf1_filename = "flattened.pdf"
im1.save(pdf1_filename, "PDF" ,resolution=100.0, save_all=True, append_images=images)
Edit:
I created a library to do this called fillpdf
pip install fillpdf
from fillpdf import fillpdfs
fillpdfs.flatten_pdf('input.pdf', 'newflat.pdf')
Per the Adobe Docs, you can change the Bit Position of the Editable Form Fields to 1 to make the field ReadOnly. I provided a full solution here, but it uses Django:
https://stackoverflow.com/a/55301804/8382028
Adobe Docs (page 552):
https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf
Use PyPDF2 to fill the fields, then loop through the annotations to change the bit position:
from io import BytesIO
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import BooleanObject, NameObject, NumberObject
# open the pdf
input_stream = open("YourPDF.pdf", "rb")
reader = PdfFileReader(input_stream, strict=False)
if "/AcroForm" in reader.trailer["/Root"]:
reader.trailer["/Root"]["/AcroForm"].update(
{NameObject("/NeedAppearances"): BooleanObject(True)}
)
writer = PdfFileWriter()
writer.set_need_appearances_writer()
if "/AcroForm" in writer._root_object:
# Acro form is form field, set needs appearances to fix printing issues
writer._root_object["/AcroForm"].update(
{NameObject("/NeedAppearances"): BooleanObject(True)}
)
data_dict = dict() # this is a dict of your form values
writer.addPage(reader.getPage(0))
page = writer.getPage(0)
# update form fields
writer.updatePageFormFieldValues(page, data_dict)
for j in range(0, len(page["/Annots"])):
writer_annot = page["/Annots"][j].getObject()
for field in data_dict:
if writer_annot.get("/T") == field:
# make ReadOnly:
writer_annot.update({NameObject("/Ff"): NumberObject(1)})
output_stream = BytesIO()
writer.write(output_stream)
# output_stream is your flattened PDF
A solution that goes for Windows as well, converts many pdf pages and flatens the chackbox values as well. For some reason #ViaTech code did not work in my pc (Windows7 python 3.8)
Followed #ViaTech indications and used extensively #hchillon code from this post
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import BooleanObject, NameObject, IndirectObject, TextStringObject, NumberObject
def set_need_appearances_writer(writer):
try:
catalog = writer._root_object
# get the AcroForm tree and add "/NeedAppearances attribute
if "/AcroForm" not in catalog:
writer._root_object.update({
NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)})
need_appearances = NameObject("/NeedAppearances")
writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
return writer
except Exception as e:
print('set_need_appearances_writer() catch : ', repr(e))
return writer
class PdfFileFiller(object):
def __init__(self, infile):
self.pdf = PdfFileReader(open(infile, "rb"), strict=False)
if "/AcroForm" in self.pdf.trailer["/Root"]:
self.pdf.trailer["/Root"]["/AcroForm"].update(
{NameObject("/NeedAppearances"): BooleanObject(True)})
# newvals and newchecks have keys have to be filled. '' is not accepted
def update_form_values(self, outfile, newvals=None, newchecks=None):
self.pdf2 = MyPdfFileWriter()
trailer = self.pdf.trailer['/Root'].get('/AcroForm', None)
if trailer:
self.pdf2._root_object.update({
NameObject('/AcroForm'): trailer})
set_need_appearances_writer(self.pdf2)
if "/AcroForm" in self.pdf2._root_object:
self.pdf2._root_object["/AcroForm"].update(
{NameObject("/NeedAppearances"): BooleanObject(True)})
for i in range(self.pdf.getNumPages()):
self.pdf2.addPage(self.pdf.getPage(i))
self.pdf2.updatePageFormFieldValues(self.pdf2.getPage(i), newvals)
for j in range(0, len(self.pdf.getPage(i)['/Annots'])):
writer_annot = self.pdf.getPage(i)['/Annots'][j].getObject()
for field in newvals:
writer_annot.update({NameObject("/Ff"): NumberObject(1)})
self.pdf2.updatePageFormCheckboxValues(self.pdf2.getPage(i), newchecks)
with open(outfile, 'wb') as out:
self.pdf2.write(out)
class MyPdfFileWriter(PdfFileWriter):
def __init__(self):
super().__init__()
def updatePageFormCheckboxValues(self, page, fields):
for j in range(0, len(page['/Annots'])):
writer_annot = page['/Annots'][j].getObject()
for field in fields:
writer_annot.update({NameObject("/Ff"): NumberObject(1)})
origin = ## Put input pdf path here
destination = ## Put output pdf path here, even if the file does not exist yet
newchecks = {} # A dict with all checkbox values that need to be changed
newvals = {'':''} # A dict with all entry values that need to be changed
# newvals dict has to be equal to {'':''} in case that no changes are needed
c = PdfFileFiller(origin)
c.update_form_values(outfile=destination, newvals=newvals, newchecks=newchecks)
print('PDF has been created\n')
I had trouble flattening a form that I had entered content into using pdfrw (How to Populate Fillable PDF's with Python) and found that I had to add an additional step using generate_fdf (pdftk flatten loses fillable field data).
os.system('pdftk '+outtemp+' generate_fdf output '+outfdf)
os.system('pdftk '+outtemp+' fill_form '+outfdf+' output '+outpdf)
I came to this solution because I was able to flatten a file just fine using ghostscript's pdf2ps followed by ps2pdf on my Mac, but the quality had low resolution when I ran it on an Amazon Linux instance. I couldn't figure out why that was the case and so moved to the pdftk solution.
So I'm a beginning programmer, and python is my first language. I'm trying to write a script that will open a random PDF from a directory and select a random page from that PDF to read. When I run my script I get the error code IO ERROR: [Errno 2] and then displays the title of the selected PDF. How can I fix this? I am using the pyPdf module. Are there any other problems in the code you can see?
import os, random, pyPdf
from pyPdf import PdfFileReader
b = random.choice(os.listdir("/home/illtic/PDF"))
pdf_toread = pyPdf.PdfFileReader(open(b, 'r'))
last_page = pdf_toread.getNumPages() - 1
page_one = pdf_toread.getPage(random.randint(0, last_page))
print " %d " % page_one
what value does b have? I am pretty sure that it is just the filename without the path. Try adding the path in front of the filename and it should be ok.
pdf_toread = pyPdf.PdfFileReader(open('/home/illtic/PDF/' + b, 'r'))