PyPDF2 - Unable to Get Past. A Large Corrupted File

PyPDF2 - Unable to Get Past. A Large Corrupted File - python

I am working on checking for corrupted PDF in a file system. In the test I am running, there are almost 200k PDF's. It seems like smaller corrupted files alert correctly, but I ran into a large 15 MB file that's corrupted and the code just hangs indefinitely. I've tried setting Strict to False with no luck. It seems like it's the initial opening that's the problem. Rather than doing threads and setting a timeout (which I have tried in the past to little success), I'm hoping there's an alternative.
import PyPDF2, os
from time import gmtime,strftime
path = raw_input("Enter folder path of PDF files:")
t = open(r'c:\pdf_check\log.txt','w')
count = 1
for dirpath,dnames,fnames in os.walk(path):
for file in fnames:
print count
count = count + 1
if file.endswith(".pdf"):
file = os.path.join(dirpath, file)
try:
PyPDF2.PdfFileReader(file,'rb',warndest="c:\test\warning.txt")
except PyPDF2.utils.PdfReadError:
curdate = strftime("%Y-%m-%d %H:%M:%S", gmtime())
t.write(str(curdate) + " " + "-" + " " + file + " " + "-" + " " + "fail" + "\n")
else:
pass
#curdate = strftime("%Y-%m-%d %H:%M:%S", gmtime())
#t.write(str(curdate) + " " + "-" + " " + file + " " + "-" + " " + "pass" + "\n")
t.close()

It looks like there is an issue with PyPDF2. I wasn't able to get it to work, however, I used pdfrw and it did not stop at this point and ran through all couple hundred thousand docs without issue.

Related

How to ignore corrupted files?

How to loop through a directory in Python and open wave files that are good whilst ignoring bad (corrupted) ones?
I want to open various wave files from a directory. However, some of these files may be corrupted, some may not be to specification. In particular there will be files in that directory which when trying to open them will raise the error:
wave.Error: file does not start with RIFF id
I want to ignore those files. I want to catch the error and continue with the loop. How can this be done?
My code:
for file_path in files:
sig=0
file = str(file_path)
sig, wave_params = DataGenerator.open_wave(file)
if sig == 0:
print(
'WARNING: Could not open wave file during data creation: ' + file)
continue
if wave_params[0] != 1:
print("WARNING: Wrong NUMBER OF CHANNELS in " + file)
txt.write(
"WARNING: Wrong NUMBER OF CHANNELS in " + file + "\n")
continue
if wave_params[1] != 2:
print("WARNING: Wrong SAMPLE WIDTH in " + file)
txt.write("WARNING: Wrong SAMPLE WIDTH in " + file + "\n")
continue
if wave_params[2] != RATE:
print("WARNING: Wrong FRAME RATE in " + file)
txt.write("WARNING: Wrong FRAME RATE in " + file + "\n")
continue
if wave_params[3] != SAMPLES:
print("WARNING: Wrong NUMBER OF SAMPLES in " + file)
txt.write(
"WARNING: Wrong NUMBER OF SAMPLES in " + file + "\n")
continue
if wave_params[4] != 'NONE':
print("WARNING: Wrong comptype: " + file)
txt.write("WARNING: Wrong comptype: " + file + "\n")
continue
if wave_params[5] != 'not compressed':
print("WARNING: File appears to be compressed " + file)
txt.write(
"WARNING: File appears to be compressed " + file + "\n")
continue
if bit_depth != (wave_params[2] * (2**4) * wave_params[1]):
print("WARNING: Wring bit depth in " + file)
txt.write("WARNING: Wring bit depth in " + file + "\n")
continue
if isinstance(sig, int):
print("WARNING: No signal in " + file)
txt.write("WARNING: No signal in " + file + "\n")
continue
My code for opening the wave file:
def open_wave(sound_file):
"""
Open wave file
Links:
https://stackoverflow.com/questions/16778878/python-write-a-wav-file-into-numpy-float-array
https://stackoverflow.com/questions/2060628/reading-wav-files-in-python
"""
if Path(sound_file).is_file():
sig = 0
with wave.open(sound_file, 'rb') as f:
n_channels = f.getnchannels()
samp_width = f.getsampwidth()
frame_rate = f.getframerate()
num_frames = f.getnframes()
wav_params = f.getparams()
snd = f.readframes(num_frames)
audio_as_np_int16 = np.frombuffer(snd, dtype=np.int16)
sig = audio_as_np_int16.astype(np.float32)
return sig, wav_params
else:
print('ERROR: File ' + sound_file + ' does not exist. BAD.')
print("Problem with openng wave file")
exit(1)
The missing lines which scale the output of the wave file correctly is done on purpose.
I am interested in how to catch the error mentioned above. A tipp of how to open wave files defensively would be nice, too. That is how can I simply ignore wave files that throw errors?

just wrap your function in a try:except block
for file_path in files:
sig=0
file = str(file_path)
try: # attempt to use `open_wave`
sig, wave_params = DataGenerator.open_wave(file)
except wave.Error as ex:
print(f"caught Exception reading '{file}': {repr(ex)}")
continue # next file_path
# opportunity to catch other or more generic Exceptions
... # rest of loop

You could make use of a try-catch block. where you 'try' accessing the file and you catch a potential exception. here you could just make a 'pass'

Optimize the performance of retreiving file sizes with pysftp

I have a requirement to get the file details for certain locations (within the system and SFTP) and get the file size for some locations on SFTP which can be achieved using the shared code.
def getFileDetails(location: str):
filenames: list = []
if location.find(":") != -1:
for file in glob.glob(location):
filenames.append(getFileNameFromFilePath(file))
else:
with pysftp.Connection(host=myHostname, username=myUsername, password=myPassword) as sftp:
remote_files = [x.filename for x in sorted(sftp.listdir_attr(location), key=lambda f: f.st_mtime)]
if location == LOCATION_SFTP_A:
for filename in remote_files:
filenames.append(filename)
sftp_archive_d_size_mapping[filename] = sftp.stat(location + "/" + filename).st_size
elif location == LOCATION_SFTP_B:
for filename in remote_files:
filenames.append(filename)
sftp_archive_e_size_mapping[filename] = sftp.stat(location + "/" + filename).st_size
else:
for filename in remote_files:
filenames.append(filename)
sftp.close()
return filenames
There are more than 10000+ files in LOCATION_SFTP_A and LOCATION_SFTP_B. For each file, I need to get the file size. To get the size I am using
sftp_archive_d_size_mapping[filename] = sftp.stat(location + "/" + filename).st_size
sftp_archive_e_size_mapping[filename] = sftp.stat(location + "/" + filename).st_size
# Time Taken : 5 min+
sftp_archive_d_size_mapping[filename] = 1 #sftp.stat(location + "/" + filename).st_size
sftp_archive_e_size_mapping[filename] = 1 #sftp.stat(location + "/" + filename).st_size
# Time Taken : 20-30 s
If I comment sftp.stat(location + "/" + filename).st_size and assign static value It takes only 20-30 seconds to run the entire code. I am looking for a way How can optimize the time and get the file size details.

The Connection.listdir_attr already gives you the file size in SFTPAttributes.st_size.
There's no need to call Connection.stat for each file to get the size (again).
See also:
With pysftp or Paramiko, how can I get a directory listing complete with attributes?
How to fetch sizes of all SFTP files in a directory through Paramiko

Frozen Python File Can't access "Save_File"

After packing my program I decided to test it out to make sure it worked, a few things happened, but the main issue is with the Save_File.
I use a Save_File.py for data, static save data. However, the frozen python file can't do anything with this file. It can't write to it, or read from it. Writing says saved successful but on load it resets all values to zero again.
Is it normal for any .py file to do this?
Is it an issue in pyinstaller?
Bad freeze process?
Or is there some other reason that the frozen file can't write, read, or interact with files not already inside it? (Save_File was frozen inside and doesn't work, but removing it causes errors, similar to if it never existed).
So the exe can't see outside of itself or change within itself...
Edit: Added the most basic version of the save file, but basically, it gets deleted and rewritten a lot.
def save():
with open("Save_file.py", "a") as file:
file.write("healthy = " + str(healthy) + "\n")
file.write("infected = " + str(infected) + "\n")
file.write("zombies = " + str(zombies) + "\n")
file.write("dead = " + str(dead) + "\n")
file.write("cure = " + str(cure) + "\n")
file.write("week = " + str(week) + "\n")
file.write("infectivity = " + str(infectivity) + "\n")
file.write("infectivity_limit = " + str(infectivity_limit) + "\n")
file.write("severity = " + str(severity) + "\n")
file.write("severity_limit = " + str(severity_limit) + "\n")
file.write("lethality = " + str(lethality) + "\n")
file.write("lethality_limit = " + str(lethality_limit) + "\n")
file.write("weekly_infections = " + str(weekly_infections) + "\n")
file.write("dna_points = " + str(dna_points) + "\n")
file.write("burst = " + str(burst) + "\n")
file.write("burst_price = " + str(burst_price) + "\n")
file.write("necrosis = " + str(necrosis) + "\n")
file.write("necrosis_price = " + str(necrosis_price) + "\n")
file.write("water = " + str(water) + "\n")
file.write("water_price = " + str(water_price) + "\n")
file.write("air = " + str(air) + "\n")
file.write("blood = " + str(blood) + "\n")
file.write("saliva = " + str(saliva) + "\n")
file.write("zombify = " + str(zombify) + "\n")
file.write("rise = " + str(rise) + "\n")
file.write("limit = int(" + str(healthy) + " + " + str(infected) + " + " + str(dead) + " + " + str(zombies) + ")\n")
file.write("old = int(1)\n")
Clear.clear()
WordCore.word_corex("SAVING |", "Save completed successfully")
time.sleep(2)
Clear.clear()
player_menu()

it's probably because the frozen version of the file (somewhere in a .zip file) is loaded and never the one you're writing (works when the files aren't frozen)
That's bad practice to:
- have a zillion global variables to hold your persistent data
- generate code in a python file just to evaluate it back again (it's _self-modifying code_).
If you used C or C++ language, would you generate some code to store your data then compile it in your new executable ? would you declare 300 globals? I don't think so.
You'd be better off with json data format and a dictionary for your variables, that would work for frozen or not frozen:
your dictionary would be like:
variables = {"healthy" : True, "zombies" : 345} # and so on
Access your variables:
if variables["healthy"]: # do something
then save function:
import json
def save():
with open("data.txt", "w") as file:
json.dump(variables,file,indent=3)
creates a text file with data like this:
{
"healthy": true,
"zombies": 345
}
and load function (declaring variables as global to avoid creating the same variable, but local only)
def load():
global variables
with open("data.txt", "r") as file:
variables = json.load(file)

File modification times reverting after change with os.utime in python

I've came across a problem in python 2.7.1 (running on Mac OS X 10.7.5) with the os.utime command
I'm trying to develop a script that downloads files from an FTP that match certain criteria but if the file exists and I already have a copy of it on a local dir then I want to check file modification times. If they don't match then I download the new copy. To achieve this goal I get the FTP file modification time, convert it to timestamp and then use os.utime to change the access and modification dates of the files downloaded to match the FTP server ones.
My problem is that as soon as I get out from the subroutine where I change the access and modification times they revert back to the original ones! I don't have anything running in background and I also tested the script on a linux server with the same results
If you run the code bellow twice it will show the problem in the debug comments as the files in the FTP server didn't change but the timestamps don't match the local ones that were correctly changed.
Thanks in advance for any help in this problem.
import ftplib
import os
from datetime import datetime
def DownloadAndSetTimestamp(local_file,fi,nt):
lf=open(local_file,'wb')
f.retrbinary("RETR " + fi, lf.write, 8*1024)
lf.close
print fi + " downloaded!"
print "-> mtime before change : " + str(os.stat(local_file).st_mtime)
print "-> atime before change : " + str(os.stat(local_file).st_atime)
print "-> FTP value : " + str(int(nt))
#set the modification time the same as server for future comparison
os.utime(local_file,( int(nt) , int(nt) ))
print "-> mtime after change : "+ str(os.stat(local_file).st_mtime)
print "-> atime after change : "+ str(os.stat(local_file).st_atime)
print "Connecting to ftp.ncbi.nih.gov..."
f=ftplib.FTP('ftp.ncbi.nih.gov')
f.login()
f.cwd('/genomes/Bacteria/')
listing=[]
dirs=f.nlst();
print "Connected and Dir list retrieved."
target_bug="Streptococcus_pseudopneumoniae"
print "Searching for :"+ target_bug
ct=0;
Target_dir="test/"
for item in dirs:
if item.find(target_bug)>-1:
print item
#create the dir
if not os.path.isdir(os.path.join(Target_dir,item)):
print "Dir not found. Creating it..."
os.makedirs(os.path.join(Target_dir,item))
#Get the gbk
#1) change the dir
f.cwd(item)
#2) get *.gbk files in dir
files=f.nlst('*.gbk')
for fi in files:
print "----------------------------------------------"
local_file = os.path.join(Target_dir,item,fi)
if os.path.isfile(local_file):
print "################"
print "File " + local_file + " already exists."
#get remote modification time
mt = f.sendcmd('MDTM '+ fi)
#converting to timestamp
nt = datetime.strptime(mt[4:], "%Y%m%d%H%M%S").strftime("%s")
#print "mtime FTP :" + str(int(mt[4:]))
#print "FTP M timestamp : " + str(nt)
#print "Local M timestamp : " + str(os.stat(local_file).st_mtime)
#print "Local A timestamp : " + str(os.stat(local_file).st_atime)
if int(nt)==int(os.stat(local_file).st_mtime):
print fi +" not modified. Download skipped"
else:
print "New version of "+fi
ct+=1
DownloadAndSetTimestamp(local_file,fi,nt)
print "NV Local M timestamp : " + str(os.stat(local_file).st_mtime)
print "NV Local A timestamp : " + str(os.stat(local_file).st_atime)
print "################"
else:
print "################"
print "New file: "+fi
ct+=1
mt = f.sendcmd('MDTM '+ fi)
#converting to timestamp
nt = datetime.strptime(mt[4:], "%Y%m%d%H%M%S").strftime("%s")
DownloadAndSetTimestamp(local_file,fi,nt)
print "################"
f.cwd('..')
f.quit()
print "# of "+target_bug+" new files found and downloaded: " + str(ct)

You're missing the parentheses in lf.close; it should be lf.close().
Without the parentheses, you're effectively not closing the file. Instead, the file is closed a bit later by the garbage collector after your call to os.utime. Since closing a file flushes the outstanding IO buffer contents, modification time will be updated as a side effect, clobbering the value you previously set.

I am using your ftputil within a python script

I am using your ftputil within a python script to get last modification/creation date of files in directory and I am having few problems and wondered if you could help.
host.stat_cache.resize(200000)
recursive = host.walk(directory, topdown=True, onerror=None)
for root,dirs,files in recursive:
for name in files:
#mctime = host.stat(name).mtime
print name
The above outputs a listing of all files in the directory
host.stat_cache.resize(200000)
recursive = host.walk(directory, topdown=True, onerror=None)
for root,dirs,files in recursive:
for name in files:
if host.path.isfile("name"):
mtime1 = host.stat("name")
mtime2 = host.stat("name").mtime
#if crtime < now -30 * 86400:
#print name + " Was Created " + " " + crtime + " " + mtime
print name + " Was Created " + " " + " " + mtime1 + " " + mtime2
Above produces no output

You've put name in quotes. So Python will always be checking for the literal filename "name", which presumably doesn't exist. You mean:
if host.path.isfile(name):
mtime1 = host.stat(name)
mtime2 = host.stat(name).mtime

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PyPDF2 - Unable to Get Past. A Large Corrupted File - python

It looks like there is an issue with PyPDF2. I wasn't able to get it to work, however, I used pdfrw and it did not stop at this point and ran through all couple hundred thousand docs without issue.

Related

How to ignore corrupted files?

Optimize the performance of retreiving file sizes with pysftp

Frozen Python File Can't access "Save_File"

File modification times reverting after change with os.utime in python

I am using your ftputil within a python script

Categories

Resources