PDF manipulation with Python - python

I need to remove the last page of a pdf file. I have multiple pdf files in the same directory. So far I have the next code:
from PyPDF2 import PdfFileWriter, PdfFileReader
import os
def changefile (file):
infile = PdfFileReader(file, "rb")
output = PdfFileWriter()
numpages = infile.getNumPages()
for i in range (numpages -1):
p = infile.getPage(i)
output.addPage(p)
with open(file, 'wb') as f:
output.write(f)
for file in os.listdir("C:\\Users\\Conan\\PycharmProjects\\untitled"):
if file.endswith(".pdf") or file.endswith(".PDF"):
changefile(file)
My script worked while testing it. I need this script for my work. Each day I need to download several e-invoices from our main external supplier. The last page always mentions the conditions of sales and is useless. Unfortunately our supplier left a signature, causing my script not to work properly.
When I am trying to run it on the invoices I receive the following error:
line 1901, in read
raise utils.PdfReadError("Could not find xref table at specified location")
PyPDF2.utils.PdfReadError: Could not find xref table at specified location
I was able to fix it on my Linux laptop by running qpdf invoice.pdf invoice-fix. I can't install QPDF on my work, where Windows is being used.
I suppose this error is triggered by the signature left by our supplier on each PDF file.
Does anyone know how to fix this error? I am looking for an efficient method to fix the issue with a broken PDF file and its signature. There must be something better than opening each PDF file with Adobe and removing the signature manually ...
Automation would be nice, because I daily put multiple invoices in the same directory.
Thanks.

The problem is probably a corrupt PDF file. QPDF is indeed able to work around that. That's why I would recommend using the pikepdf library instead of PyPDF2 in this case. Pikepdf is based on QPDF. First install pikepdf (for example using pip or your package manager) and then try this code:
import pikepdf
import os
def changefile (file):
print("Processing {0}".format(file))
pdf = pikepdf.Pdf.open(file)
lastPageNum = len(pdf.pages)
pdf.pages.remove(p = lastPageNum)
pdf.save(file + '.tmp')
pdf.close()
os.unlink(file)
os.rename(file + '.tmp', file)
for file in os.listdir("C:\\Users\\Conan\\PycharmProjects\\untitled"):
if file.lower().endswith(".pdf"):
changefile(file)
Link to pikepdf docs: https://pikepdf.readthedocs.io/en/latest/
Let me know if that works for you.

Related

Runinng a ipynb script on many files at once/an entire directory?

I will be the first to tell you that my Python skills are beginner at best, so please forgive my ignorance here.
By way of background, I have created a Python script in Anaconda Jupyter Notebooks that reads a single PDF from a folder, C:\Users\...\PDFs , extracts the text of said PDF, and then through some splicing puts the text of interest into a CSV file that it creates.
The problem is that I want to execute this script on hundreds of PDFs (the ipynb script itself works just fine when executed on individual PDFs, I just don't want to keep manually changing the file name in the Notebook/Python script). Using pdfreader, my script starts with the following:
import pdfreader
from pdfreader import PDFDocument, SimplePDFViewer
fd = open(r'C:Users\...\PDFs\[pdf name].pdf', 'rb')
viewer = SimplePDFViewer(fd)
doc = PDFDocument(fd)
This is where I get stuck - I cannot figure out how to run this on/import all PDFs in the folder. I have seen some people use a variable file name with an asterisk, eg C:\Users\...\PDFs\*.pdf, however I can't get that to go. It seems like like it might be possible to save my ipynb as a py file, and then somehow run it in Anaconda Prompt, however I have struggled getting this method to work as well. I am unfamiliar with bat files, but those too seem potentially promising.
Does anyone know of a way to run this script on many PDFs in a single directory at once? I have scrounged around a ton, but for the life of me cannot figure this out. Any help would be greatly appreciated! :)
You can use the glob module to gather all of the files names, then loop through them.
import pdfreader
from pdfreader import PDFDocument, SimplePDFViewer
from glob import glob
pdf_files = glob(r'C:Users\...\PDFs\*.pdf')
for path in pdf_files:
fd = open(path, 'rb')
viewer = SimplePDFViewer(fd)
doc = PDFDocument(fd)
...
fd.close()

Unable to open saved excel file using urllib.request.urlretrieve (Sample link mentioned )

Currently, I'm using Flask with Python 3.
For sample purposes, here is a dropbox link
In order to fetch the file and save it, I'm doing the following.
urllib.request.urlretrieve("https://www.dropbox.com/s/w1h6vw2st3wvtfb/Sample_List.xlsx?dl=0", "Sample_List.xlsx")
The file is saved successfully to my project's root directory, however there is a problem. When I try to open the file, I get this error.
What am I doing wrong over here?
Also, is there a way to get the file name and extension from the URL itself? Example, filename = Sample_List and extension = xlsx...something like this.

Is it possible to download just part of a ZIP file using python zipfile library

I was wondering is there any way by which I can download only a part of a .rar or .zip file without downloading the whole file ? There is a zip file containing files A,B,C and D. I only need A. Can I somehow, use zipfile module so that i can only download 1 file ?
i am trying below code:
r = c.get(file)
z = ZipFile.ZipFile(BytesIO(r.content))
for file1 in z.namelist():
if 'time' not in file1:
print("hi")
z.extractall(file1,download_path + filename)
This code is downloading whole zip file and only extracting specific one. Can i somehow download only the file i Need.
There is similar question here but it shows only approch by command line in linux. That question dosent address how it can be done using python liabraries.
The question #Juggernaut mentioned in a comment is actually very helpful, as it points you in the direction of the solution.
You need to create a replacement for Bytes.IO that returns the necessary information to ZipFile. You will need to get the length of the file, and then get whatever sections ZipFile asks for.
How large are those file? Is it really worth the trouble?
Use remotezip: https://github.com/gtsystem/python-remotezip. You can install it using pip:
pip install remotezip
Usage example:
from remotezip import RemoteZip
with RemoteZip("https://path/to/zip/file.zip") as zip_file:
for file in zip_file.namelist():
if 'time' not in file:
print("hi")
zip_file.extract(file, path="/path/to/extract")
Note that to use this approach, the web server from which you receive the file needs to support the Range header.

Python: Save Excel File As-Is To Folder

I'm downloading Excel files from a website using beautifulsoup4.
I only need to download the files. I don't need to rename them, just download them to a folder, relative to where the code is.
the function takes in a beautifulsoup call, searches for <a> then makes a call to the links.
def save_excel_files(sfile):
print("starting")
for link in sfile.find_all("a"):
candidate_link = link.get("href")
if (candidate_link is not None
and "Flat.File" in candidate_link):
xfile = requests.get(candidate_link)
if xfile:
### I just don't know what to do...
I've tried using os.path ; with open("xtest", "wb") as f: and many other variations. Been at this for two evenings and totally stuck.
The first issue is that I can't even get the files to downlaod and save anywhere. xfile resolves to [response 200], so that part is working, I'm just having a hard time coding the actual download and save.
Something like this should've worked :
xfile = requests.get(candidate_link)
file_name = candidate_link.split('/')[-1]
if xfile:
with open(file_name, "wb") as f:
f.write(xfile.content)
Tested with the following link I found randomly in google :
candidate_link = "http://berkeleycollege.edu/browser_check/samples/excel.xls"

corrupt zip download urllib2

I am trying to download zip files from measuredhs.com using the following code:
url ='https://dhsprogram.com/customcf/legacy/data/download_dataset.cfm?Filename=BFBR62DT.ZIP&Tp=1&Ctry_Code=BF'
request = urllib2.urlopen(url)
output = open("install.zip", "w")
output.write(request.read())
output.close()
However the downloaded file does not open. I get a message saying the compressed zip folder is invalid.
To access the download link, one needs to long in, which I have done so. If i click on the link, it automatically downloads the file, or even if i paste it in a browser.
Thanks
Try writing to local file in binary mode.
with open('install.zip', 'wb') as output:
output.write(request.read())
Also, comparing the md5/sha1 hash of the downloaded file will let you know if the downloaded file has been corrupted.

Categories