Windows Automatic naming from info in PDF file itself - python

I am trying to find out a way to take scanned pdfs that are automatically named things like "397009900" to a certain string inside the PDF itself. In my case it is a drawing name that I am trying to extract from the PDF to rename the file ie "ISO-4024-4301".
Is there a way to automatically rename a PDF file with information from inside of it?
Thanks very much.

This can be done with python.
import PyPDF2
with open('path_to_file\Test doc.pdf', 'rb') as p:
pdfReader = PyPDF2.PdfFileReader(p)
pageObj = pdfReader.getPage(0)
info=pageObj.extractText()
print(info)
You can specify the page number where you want to extract the information. Change page number from 0 where you want to extract.
pageObj = pdfReader.getPage(0)
The extracted texts will be stored in the variable info, then you can perform any operation to choose the required text you want to rename to.
import os
os.rename(r'old_file_path_and_name_with_extension',r'new_file_path_and_name_with_extension')
With OS module, you can easily rename the files!

Related

Extract First Page of All PDF Documents in a Library

I am new to PDF Handling in Python. I have a document library which contains a large volume of PDF Documents. I am trying to extract the First Page of each document. I have produced the below code.
My initial for loop "for entry in entries" returns the name of all documents in the library. I verify this by successfully printing all document names in the library.
I am using the pdfReader.getPage to specify the page number of each document whilst also using the extractText function to extract the text from the page. However, when i run this entire script, I am being thrown an error which states that one of the documents cannot be located. However, the document does exist in the library. This is shown in the screenshot from the library below. Whilst also verified by the fact that it prints in the list of documents in the repository.
I believe the issue is with how the extractText is iterating through all documents but I am unclear on how to resolve. Would anyone have any suggestions?
import os
import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader
# get the file names in the directory
directory = 'Fund Docs'
entries = os.listdir(directory)
for entry in entries:
print(entry)
# create a PDF reader object
pdfFileObj = open(entry, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
You need to specify the full path:
pdfFileObj = open(directory + '/' + entry, 'rb')
This will open the file at Fund Docs/FILE_NAME.pdf. By only specifying entry, it will look for the file in the current folder, which it won't find. By adding the folder at the start, you're saying to find the entry inside that folder.

How do I add different IPTC keywords to multiple images?

I have a folder containing thousands of images and each image needs a unique list of keywords added to it. I also have a table with fields showing the file path and associated list of desired keywords for each image. For example, one record might need the tags, "ORASH (a survey site code), Crew 1, Transect A Upstream, Site Layout". While the next record might need the tags, "ORWLW, Crew 2, Amphibian, Pacific Giant Salamander".
How do I iterate over each image to add the IPTC keywords to them? I'm using python 3 and the iptcinfo3 module but am willing to try other modules that may work.
Here's where I'm at now:
import os
import pandas as pd
from iptcinfo3 import IPTCInfo
srcdir = r'E:\photos'
files = os.listdir(srcdir)
# Create a dataframe from the table containing filepaths and associated keywords.
df = pd.read_excel(r'E:\photo_info.xlsx')
# Create a dictionary with the filename as the key and the tags as the value.
references = dict(df.set_index('basename')['tags'])
for file in files:
# Get the full filepath for each image.
filepath = os.path.join(srcdir, file)
# Create an object for a file that may not have IPTC data (ignore the 'Marker scan...' notification).
info = IPTCInfo(filepath, force=True)
At this point, I imagined I'd use info['keywords'] = ... in conjunction with the 'references' dictionary to plug the keywords into the correct files. Then info.save_as(filepath). I'm just not experienced enough to know how to make this work or even if it's a reasonable way of doing it. Any help would be appreciated!
I saved the table with the filenames and keywords as a .csv file where the fields and records looked like this (though the text in the 'Subject' field didn't include the quotes):
SourceFile, Artist, Subject
E:\photos\0048.JPG, MARY GRAY, "YEAR2022, REQUIRED, GPS UNIT WITH
TIME"
Because I use Jupyter Lab for other portions of this workflow, I ran this code there:
import os
os.system('cmd d: & exiftool -overwrite_original -sep ", " -csv="E:\photos\metadata.csv" E:\photos')
This opens the Windows command prompt, changes the path to the D: drive (where the exiftool.exe file was saved), invokes exiftool, sets it to overwrite the original image file rather than create a copy, defines the keyword separator in the .csv file, reads the .csv file that has the list of filenames and associated keywords, then runs it on the E:\photos directory.
Worked like a charm on about 4,000 photos!

Why i'm not being able to open more than one file with pdf2image in python

I'm trying to extract text from a pdf, so first I have to convert it to image. I can do it, but just with one pdf with a specific name. If I add another pdf to the folder, or change the name of the pdf I already have, I get this error:
pdf2image.exceptions.PDFPageCountError: Unable to get page count.
I/O Error: Couldn't open file 'LoremIpsun.pdf': No error.
This is the part of the code I'm having trouble with:
from pdf2image import convert_from_path
import os
def pdf_a_txt(route):
target = route
for root, dirnames, files in os.walk(target):
for x in files:
if x.endswith('.pdf'):
pages = convert_from_path(x, 500, poppler_path='C:\\Users\\User\\Desktop\\poppler-22.04.0\\Library\\bin')
image_counter = 1
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(root+'\\'+ filename, 'JPEG')
image_counter = image_counter + 1
pdf_a_txt('C:\\Users\\User\\Desktop\\Test\\Input')
I'm using a pdf named "LoremIpsum.pdf". If I put another pdf inside the Input folder, it will just open the LoremIpsum. When it finishes to convert that one and tries to open the other one I get the error above. And if I change "LoremIpsum.pdf" for something different, like "LoremIpsun.pdf" it also can't be opened. I know is a pretty simple code, but I can't find why it's just working with that specific name.
Any help would be appreciated. Thanks!

PDF manipulation with Python

I need to remove the last page of a pdf file. I have multiple pdf files in the same directory. So far I have the next code:
from PyPDF2 import PdfFileWriter, PdfFileReader
import os
def changefile (file):
infile = PdfFileReader(file, "rb")
output = PdfFileWriter()
numpages = infile.getNumPages()
for i in range (numpages -1):
p = infile.getPage(i)
output.addPage(p)
with open(file, 'wb') as f:
output.write(f)
for file in os.listdir("C:\\Users\\Conan\\PycharmProjects\\untitled"):
if file.endswith(".pdf") or file.endswith(".PDF"):
changefile(file)
My script worked while testing it. I need this script for my work. Each day I need to download several e-invoices from our main external supplier. The last page always mentions the conditions of sales and is useless. Unfortunately our supplier left a signature, causing my script not to work properly.
When I am trying to run it on the invoices I receive the following error:
line 1901, in read
raise utils.PdfReadError("Could not find xref table at specified location")
PyPDF2.utils.PdfReadError: Could not find xref table at specified location
I was able to fix it on my Linux laptop by running qpdf invoice.pdf invoice-fix. I can't install QPDF on my work, where Windows is being used.
I suppose this error is triggered by the signature left by our supplier on each PDF file.
Does anyone know how to fix this error? I am looking for an efficient method to fix the issue with a broken PDF file and its signature. There must be something better than opening each PDF file with Adobe and removing the signature manually ...
Automation would be nice, because I daily put multiple invoices in the same directory.
Thanks.
The problem is probably a corrupt PDF file. QPDF is indeed able to work around that. That's why I would recommend using the pikepdf library instead of PyPDF2 in this case. Pikepdf is based on QPDF. First install pikepdf (for example using pip or your package manager) and then try this code:
import pikepdf
import os
def changefile (file):
print("Processing {0}".format(file))
pdf = pikepdf.Pdf.open(file)
lastPageNum = len(pdf.pages)
pdf.pages.remove(p = lastPageNum)
pdf.save(file + '.tmp')
pdf.close()
os.unlink(file)
os.rename(file + '.tmp', file)
for file in os.listdir("C:\\Users\\Conan\\PycharmProjects\\untitled"):
if file.lower().endswith(".pdf"):
changefile(file)
Link to pikepdf docs: https://pikepdf.readthedocs.io/en/latest/
Let me know if that works for you.

Search all docx files with python-docx in a directory (batch)

I have a bunch of Word docx files that have the same embedded Excel table. I am trying to extract the same cells from several files.
I figured out how to hard code to one file:
from docx import Document
document = Document(r"G:\GIS\DESIGN\ROW\ROW_Files\Docx\006-087-003.docx")
table = document.tables[0]
Project_cell = table.rows[2].cells[2]
paragraph = Project_cell.paragraphs[0]
Project = paragraph.text
print Project
But how do I batch this? I tried some variations on listdir, but they are not working for me and I am too green to get there on my own.
How you loop over all of the files will really depend on your project deliverables. Are all of the files in a single folder? Are there more than just .docx files?
To address all of the issues, we'll assume that there are subdirectories, and other files mingled with your .docx files. For this, we'll use os.walk() and os.path.splitext()
import os
from docx import Document
# First, we'll create an empty list to hold the path to all of your docx files
document_list = []
# Now, we loop through every file in the folder "G:\GIS\DESIGN\ROW\ROW_Files\Docx"
# (and all it's subfolders) using os.walk(). You could alternatively use os.listdir()
# to get a list of files. It would be recommended, and simpler, if all files are
# in the same folder. Consider that change a small challenge for developing your skills!
for path, subdirs, files in os.walk(r"G:\GIS\DESIGN\ROW\ROW_Files\Docx"):
for name in files:
# For each file we find, we need to ensure it is a .docx file before adding
# it to our list
if os.path.splitext(os.path.join(path, name))[1] == ".docx":
document_list.append(os.path.join(path, name))
# Now create a loop that goes over each file path in document_list, replacing your
# hard-coded path with the variable.
for document_path in document_list:
document = Document(document_path) # Change the document being loaded each loop
table = document.tables[0]
project_cell = table.rows[2].cells[2]
paragraph = project_cell.paragraphs[0]
project = paragraph.text
print project
For additional reading, here is the documentation on os.listdir().
Also, it would be best to put your code into a function which is re-usable, but that's also a challenge for yourself!
Assuming that the code above get you the data you need, all you need to do is read the files from the disk and process them.
First let's define a function that does what you were already doing, then we'll loop over all the documents in the directory and process them with that function.
Edit the following untested code to suit your needs.
# we'll use os.walk to iterate over all the files in the directory
# we're going to make some simplifying assumption:
# 1) all the docs files are in the same directory
# 2) that you want to store in the paragraph in a list.
import document
import os
DOCS = r'G:\GIS\DESIGN\ROW\ROW_Files\Docx'
def get_para(document):
table = document.tables[0]
Project_cell = table.rows[2].cells[2]
paragraph = Project_cell.paragraphs[0]
Project = paragraph.text
return Project
if __name__ == "__main__":
paragraphs = []
f = os.walk(DOCS).next()
for filename in f:
file_name = os.path.join(DOCS, filename)
document = Document(file_name)
para = get_para(document)
paragraphs.append(para)
print(paragraphs)

Categories