Extract First Page of All PDF Documents in a Library - python

I am new to PDF Handling in Python. I have a document library which contains a large volume of PDF Documents. I am trying to extract the First Page of each document. I have produced the below code.
My initial for loop "for entry in entries" returns the name of all documents in the library. I verify this by successfully printing all document names in the library.
I am using the pdfReader.getPage to specify the page number of each document whilst also using the extractText function to extract the text from the page. However, when i run this entire script, I am being thrown an error which states that one of the documents cannot be located. However, the document does exist in the library. This is shown in the screenshot from the library below. Whilst also verified by the fact that it prints in the list of documents in the repository.
I believe the issue is with how the extractText is iterating through all documents but I am unclear on how to resolve. Would anyone have any suggestions?
import os
import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader
# get the file names in the directory
directory = 'Fund Docs'
entries = os.listdir(directory)
for entry in entries:
print(entry)
# create a PDF reader object
pdfFileObj = open(entry, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(0)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()

You need to specify the full path:
pdfFileObj = open(directory + '/' + entry, 'rb')
This will open the file at Fund Docs/FILE_NAME.pdf. By only specifying entry, it will look for the file in the current folder, which it won't find. By adding the folder at the start, you're saying to find the entry inside that folder.

Related

Why i'm not being able to open more than one file with pdf2image in python

I'm trying to extract text from a pdf, so first I have to convert it to image. I can do it, but just with one pdf with a specific name. If I add another pdf to the folder, or change the name of the pdf I already have, I get this error:
pdf2image.exceptions.PDFPageCountError: Unable to get page count.
I/O Error: Couldn't open file 'LoremIpsun.pdf': No error.
This is the part of the code I'm having trouble with:
from pdf2image import convert_from_path
import os
def pdf_a_txt(route):
target = route
for root, dirnames, files in os.walk(target):
for x in files:
if x.endswith('.pdf'):
pages = convert_from_path(x, 500, poppler_path='C:\\Users\\User\\Desktop\\poppler-22.04.0\\Library\\bin')
image_counter = 1
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(root+'\\'+ filename, 'JPEG')
image_counter = image_counter + 1
pdf_a_txt('C:\\Users\\User\\Desktop\\Test\\Input')
I'm using a pdf named "LoremIpsum.pdf". If I put another pdf inside the Input folder, it will just open the LoremIpsum. When it finishes to convert that one and tries to open the other one I get the error above. And if I change "LoremIpsum.pdf" for something different, like "LoremIpsun.pdf" it also can't be opened. I know is a pretty simple code, but I can't find why it's just working with that specific name.
Any help would be appreciated. Thanks!

PDF manipulation with Python

I need to remove the last page of a pdf file. I have multiple pdf files in the same directory. So far I have the next code:
from PyPDF2 import PdfFileWriter, PdfFileReader
import os
def changefile (file):
infile = PdfFileReader(file, "rb")
output = PdfFileWriter()
numpages = infile.getNumPages()
for i in range (numpages -1):
p = infile.getPage(i)
output.addPage(p)
with open(file, 'wb') as f:
output.write(f)
for file in os.listdir("C:\\Users\\Conan\\PycharmProjects\\untitled"):
if file.endswith(".pdf") or file.endswith(".PDF"):
changefile(file)
My script worked while testing it. I need this script for my work. Each day I need to download several e-invoices from our main external supplier. The last page always mentions the conditions of sales and is useless. Unfortunately our supplier left a signature, causing my script not to work properly.
When I am trying to run it on the invoices I receive the following error:
line 1901, in read
raise utils.PdfReadError("Could not find xref table at specified location")
PyPDF2.utils.PdfReadError: Could not find xref table at specified location
I was able to fix it on my Linux laptop by running qpdf invoice.pdf invoice-fix. I can't install QPDF on my work, where Windows is being used.
I suppose this error is triggered by the signature left by our supplier on each PDF file.
Does anyone know how to fix this error? I am looking for an efficient method to fix the issue with a broken PDF file and its signature. There must be something better than opening each PDF file with Adobe and removing the signature manually ...
Automation would be nice, because I daily put multiple invoices in the same directory.
Thanks.
The problem is probably a corrupt PDF file. QPDF is indeed able to work around that. That's why I would recommend using the pikepdf library instead of PyPDF2 in this case. Pikepdf is based on QPDF. First install pikepdf (for example using pip or your package manager) and then try this code:
import pikepdf
import os
def changefile (file):
print("Processing {0}".format(file))
pdf = pikepdf.Pdf.open(file)
lastPageNum = len(pdf.pages)
pdf.pages.remove(p = lastPageNum)
pdf.save(file + '.tmp')
pdf.close()
os.unlink(file)
os.rename(file + '.tmp', file)
for file in os.listdir("C:\\Users\\Conan\\PycharmProjects\\untitled"):
if file.lower().endswith(".pdf"):
changefile(file)
Link to pikepdf docs: https://pikepdf.readthedocs.io/en/latest/
Let me know if that works for you.

Windows Automatic naming from info in PDF file itself

I am trying to find out a way to take scanned pdfs that are automatically named things like "397009900" to a certain string inside the PDF itself. In my case it is a drawing name that I am trying to extract from the PDF to rename the file ie "ISO-4024-4301".
Is there a way to automatically rename a PDF file with information from inside of it?
Thanks very much.
This can be done with python.
import PyPDF2
with open('path_to_file\Test doc.pdf', 'rb') as p:
pdfReader = PyPDF2.PdfFileReader(p)
pageObj = pdfReader.getPage(0)
info=pageObj.extractText()
print(info)
You can specify the page number where you want to extract the information. Change page number from 0 where you want to extract.
pageObj = pdfReader.getPage(0)
The extracted texts will be stored in the variable info, then you can perform any operation to choose the required text you want to rename to.
import os
os.rename(r'old_file_path_and_name_with_extension',r'new_file_path_and_name_with_extension')
With OS module, you can easily rename the files!

Get XML file name from loaded XML files using Python

My Python code reads XML files stored at location and loads it into Python list after parsing using lxml library as shown below:
XMLFILEList = []
FilePath = 'C:\\plugin\\TestPlugin\\'
XMLFilePath = os.listdir(FilePath)
for XMLFILE in XMLFilePath:
if XMLFILE.endswith('.xml'):
XMLFILEList.append(etree.parse(XMLFILE))
print(XMLFILEList)
Output:
[<lxml.etree._ElementTree object at 0x000001CCEEE0C748>, <lxml.etree._ElementTree object at 0x000001CCEEE0C7C8>]
Currently, I see objects of XML files.
Please can anyone help me pull original filenames of XML files. For example: if my HelloWorld.xml file is loaded into XMLFILEList. I should be able to retrieve "HelloWorld.xml"
you have a one to one correspondence between XBRLFilePath and XMLFILEList, first one is the file you loaded, second is the file contents, just use that applying your if statement.
mydict = {}
for XMLFILE in XBRLFilePath:
if XMLFILE.endswith('.xml'):
mydict[XMLFILE] = etree.parse(XMLFILE)
your dict will now have as keys the files loaded, and as values the loaded files

Search all docx files with python-docx in a directory (batch)

I have a bunch of Word docx files that have the same embedded Excel table. I am trying to extract the same cells from several files.
I figured out how to hard code to one file:
from docx import Document
document = Document(r"G:\GIS\DESIGN\ROW\ROW_Files\Docx\006-087-003.docx")
table = document.tables[0]
Project_cell = table.rows[2].cells[2]
paragraph = Project_cell.paragraphs[0]
Project = paragraph.text
print Project
But how do I batch this? I tried some variations on listdir, but they are not working for me and I am too green to get there on my own.
How you loop over all of the files will really depend on your project deliverables. Are all of the files in a single folder? Are there more than just .docx files?
To address all of the issues, we'll assume that there are subdirectories, and other files mingled with your .docx files. For this, we'll use os.walk() and os.path.splitext()
import os
from docx import Document
# First, we'll create an empty list to hold the path to all of your docx files
document_list = []
# Now, we loop through every file in the folder "G:\GIS\DESIGN\ROW\ROW_Files\Docx"
# (and all it's subfolders) using os.walk(). You could alternatively use os.listdir()
# to get a list of files. It would be recommended, and simpler, if all files are
# in the same folder. Consider that change a small challenge for developing your skills!
for path, subdirs, files in os.walk(r"G:\GIS\DESIGN\ROW\ROW_Files\Docx"):
for name in files:
# For each file we find, we need to ensure it is a .docx file before adding
# it to our list
if os.path.splitext(os.path.join(path, name))[1] == ".docx":
document_list.append(os.path.join(path, name))
# Now create a loop that goes over each file path in document_list, replacing your
# hard-coded path with the variable.
for document_path in document_list:
document = Document(document_path) # Change the document being loaded each loop
table = document.tables[0]
project_cell = table.rows[2].cells[2]
paragraph = project_cell.paragraphs[0]
project = paragraph.text
print project
For additional reading, here is the documentation on os.listdir().
Also, it would be best to put your code into a function which is re-usable, but that's also a challenge for yourself!
Assuming that the code above get you the data you need, all you need to do is read the files from the disk and process them.
First let's define a function that does what you were already doing, then we'll loop over all the documents in the directory and process them with that function.
Edit the following untested code to suit your needs.
# we'll use os.walk to iterate over all the files in the directory
# we're going to make some simplifying assumption:
# 1) all the docs files are in the same directory
# 2) that you want to store in the paragraph in a list.
import document
import os
DOCS = r'G:\GIS\DESIGN\ROW\ROW_Files\Docx'
def get_para(document):
table = document.tables[0]
Project_cell = table.rows[2].cells[2]
paragraph = Project_cell.paragraphs[0]
Project = paragraph.text
return Project
if __name__ == "__main__":
paragraphs = []
f = os.walk(DOCS).next()
for filename in f:
file_name = os.path.join(DOCS, filename)
document = Document(file_name)
para = get_para(document)
paragraphs.append(para)
print(paragraphs)

Categories