add custom metadata to pdf using python - python

I want to add custom metadata to pdf file. This can be achieved by pypdf2 or pdrw library. I have referred Change metadata of pdf file with pypdf2
solution works fine, when there is no space between the two words of attribute.
In my case my metadata is like
meta = {'Page description' : 'description',
'create time' : '123455' }
when I try to add above metadata as :
reader = PdfFileReader(filename)
writer = PdfFileWriter()
writer.appendPagesFromReader(reader)
writer.addMetadata(meta)
with open(filename, 'wb') as fout:
writer.write(fout)
when we try to see the custom properties in document property of pdf, nothing is displayed.
The above solution works fine if attributes are without spaces

I do not know about pypdfw, but for pdfrw, you should use PdfName('Page description') as your dictionary key. This will properly encode the space.
Disclaimer: I am the primary pdfrw author.

Related

Remove OCG layers from PDF with Python

Is there a way to delete an OCG layer from a PDF within Python?
I normally work with pymupdf but couldn't the functionality there. Is there any other library with this functionality?
disclaimer: I am the author of borb the library mentioned in this answer.
borb will turn any input PDF into a JSON-like data structure. If you know what to delete in the content-tree, you can simply remove that item from the dictionary as you would in a normal dictionary object.
Reading the Document is easy:
with open("input.pdf", "rb") as in_file_handle:
document = PDF.loads(in_file_handle)
You need document["XRef"]["Trailer"]["Root"]["OCGs"] which will be a List of layers. Remove whatever element(s) you want.
If you then store the PDF, the layer will be gone.
with open("output.pdf", "wb") as out_file_handle:
PDF.dumps(out_file_handle, document)

CSV file with some empty rows in Python with Tkintertable

I fill in the table as any data and then export the file (CSV). OK so far. But when I open the CSV file I notice empty rows between the rows I filled.
How to solve it? Run the code, fill the table, export and see the file to understand.
from tkintertable import TableCanvas, TableModel
from tkinter import *
root=Tk()
t_frame=Frame(root)
t_frame.pack(fill='both', expand=True)
table = TableCanvas(t_frame, cellwidth=60, thefont=('Arial',12),rowheight=18, rowheaderwidth=30,
rowselectedcolor='yellow', editable=True)
table.show()
root.mainloop()
look the empty rows in the CSV file
If you open the CSV file in notepad or some other basic text editor you will see that the data is saved with spaces in between each row of data. Because EXCEL reads a CSV per new line then it will always import this way into EXCEL.
Only things you can really do is manually going in and removing the lines from the CSV, building a macro to clean it up for you after the fact or editing the tkintertable library where the issue is occurring.
The Usage page of the GitHub for tkintertable does not appear to have any details on exporting to CSV that will fix this issue. Just a problem with how this library works with data I guess.
Even adding a Button that runs a command to export you get the same problem.
def export_to_csv():
table.exportTable()
Button(root, text='test CSV', command=export_to_csv).pack()
Usage
UPDATE:
After doing some digging and realizing the tkintertable uses the csv librarty to write to csv I did some googling on the same problem with writing to csv using the csv library and found this post (csv.write skipping lines when writing to csv). With that post I did some digging in the tkintertable library to find where the writing occurred and here is what I found.
The writing of new lines in between data appears to be a result of how the csv library is writing data. If we dig in you will find that the tkintertable class that is writing your table to CSV is called ExportTableData. That class is as follows:
class TableExporter:
def __init__(self):
"""Provides export utility methods for the Table and Table Model classes"""
return
def ExportTableData(self, table, sep=None):
"""Export table data to a comma separated file"""
parent=table.parentframe
filename = filedialog.asksaveasfilename(parent=parent,defaultextension='.csv',
filetypes=[("CSV files","*.csv")] )
if not filename:
return
if sep == None:
sep = ','
writer = csv.writer(open(filename, "w"), delimiter=sep)
model=table.getModel()
recs = model.getAllCells()
#take column labels as field names
colnames = model.columnNames
collabels = model.columnlabels
row=[]
for c in colnames:
row.append(collabels[c])
writer.writerow(row)
for row in recs.keys():
print(row)
writer.writerow(recs[row])
return
If we update this line:
writer = csv.writer(open(filename, "w"), delimiter=sep)
To include lineterminator = '\n' you will find that the problem goes away:
writer = csv.writer(open(filename, "w"), delimiter=sep, lineterminator = '\n')
Results:
Keep in mind editing libraries is probably not a great idea but for this situation I think it is your only option.
To get to this class you will need to open the python file called Tables_IO.py in the tkintertable library.
If you are using PyCharm you can navigate through the library files with CTRL+Left Click.
First Ctrl+Click on the import name . Then you will Ctrl+Click on the import name . Then you will search for a class method called exportTable and inside that method you will Ctrl+Click . This will take you to the above mentioned class that you can edit to solve the problem.
Please take care not to edit anything else as you can very easily break your library and will need to reinstall it if that happens.

How to create a PDF from a binary string?

There is a request has been made to the server using Python's requests module:
requests.get('myserver/pdf', headers)
It returned a status-200 response, which all contains PDF binary data in response.content
Question
How does one create a PDF file from the response.content?
You can create an empty pdf then save write to that pdf in binary like this:
from reportlab.pdfgen import canvas
import requests
# Example of path. This file has not been created yet but we
# will use this as the location and name of the pdf in question
path_to_create_pdf_with_name_of_pdf = r'C:/User/Oleg/MyDownloadablePdf.pdf'
# Anything you used before making the request. Since you did not
# provide code I did not know what you used
.....
request = requests.get('myserver/pdf', headers)
#Actually creates the empty pdf that we will use to write the binary data to
pdf_file = canvas.Canvas(path_to_create_pdf_with_name_of_pdf)
#Open the empty pdf that we created above and write the binary data to.
with open(path_to_create_pdf_with_name_of_pdf, 'wb') as f:
f.write(request.content)
f.close()
The reportlab.pdfgen allows you to make a new pdf by specifying the path you want to save the pdf in along with the name of the pdf using the canvas.Canvas method. As stated in my answer you need to provide the path to do this.
Once you have an empty pdf, you can open the pdf file as wb (write binary) and write the content of the pdf from the request to the file and close the file.
When using the path - ensure that the name is not the name of any existing files to ensure that you do not overwrite any existing files. As the comments show, if this name is the name of any other file then you risk overwriting the data. If you are doing this in a loop for example, you will need to specify the path with a new name at each iteration to ensure that you have a new pdf each time. But if it is a one-off thing then you do not run that risk so as long as it is not the name of another file.

How can I create a word (.docx) document if not found using python and write in it?

How can I create a word (.docx) document if not found using python and write in it?
I certainly cannot do either of the following:
file = open(file_name, 'r')
file = open(file_name, 'w')
or, to create or append if found:
f = open(file_name, 'a+')
Also I cannot find any related info in python-docx documentation at:
https://python-docx.readthedocs.io/en/latest/
NOTE:
I need to create an automated report via python with text and pie charts, graphs etc.
Probably the safest way to open (and truncate) a new file for writing is using 'xb' mode. 'x' will raise a FileExistsError if the file is already there. 'b' is necessary because a word document is fundamentally a binary file: it's a zip archive with XML and other files inside it. You can't compress and decompress a zip file if you convert bytes through character encoding.
Document.save accepts streams, so you can pass in a file object opened like that to save your document.
Your work-flow could be something like this:
doc = docx.Document(...)
...
# Make your document
...
with open('outfile.docx', 'xb') as f:
doc.save(f)
It's a good idea to use with blocks instead of raw open to ensure the file gets closed properly even in case of an error.
In the same way that you can't simply write to a Word file directly, you can't append to it either. The way to "append" is to open the file, load the Document object, and then write it back, overwriting the original content. Since the word file is a zip archive, it's very likely that appended text won't even be at the end of the XML file it's in, much less the whole docx file:
doc = docx.Document('file_to_append.docx')
...
# Modify the contents of doc
...
doc.save('file_to_append.docx')
Keep in mind that the python-docx library may not support loading some elements, which may end up being permanently discarded when you save the file this way.
Looks like I found an answer:
The important point here was to create a new file, if not found, or
otherwise edit the already present file.
import os
from docx import Document
#checking if file already present and creating it if not present
if not os.path.isfile(r"file_path"):
#Creating a blank document
document = Document()
#saving the blank document
document.save('file_name.docx')
#------------editing the file_name.docx now------------------------
#opening the existing document
document = Document('file_name.docx')
#editing it
document.add_heading("hello world" , 0)
#saving document in the end
document.save('file_name.docx')
Further edits/suggestions are welcome.

Edit text in PDF with python

I have a pdf file and I need to edit some text/values in the pdf. For example, In the pdf files that I have BIRTHDAY DD/MM/YYYY is always N/A. I want to change it to whatever value I desire and then save it as a new document. Overwriting existing document is also alright.
I have previously done this so far:
from PyPDF2 import PdfReader, PdfWriter
reader = PdfReader("abc.pdf")
page = reader.pages[0]
writer = PdfWriter()
writer.add_page(reader.pages[0])
pdf_doc = writer.update_page_form_field_values(
reader.pages[0], {"BIRTHDAY DD/MM/YYYY": "123"}
)
with open("new_abc1.pdf", "wb") as fh:
writer.write(fh)
But this update_page_form_field_values() doesn't change the desired value, maybe because this is not a form field?
Screenshot of pdf showing the value to be changed:
Any clues?
I'm the current maintainer of pypdf and PyPDF2 (Please use pypdf; PyPDF2 is deprecated)
It is not possible to change a text with pypdf at the moment.
Changing form contents is a different story. However, we have several issues with form fields: https://github.com/py-pdf/pypdf/labels/workflow-forms
The update_page_form_field_values is the correct function to use.

Categories