How to ignore corrupted files? - python

How to loop through a directory in Python and open wave files that are good whilst ignoring bad (corrupted) ones?
I want to open various wave files from a directory. However, some of these files may be corrupted, some may not be to specification. In particular there will be files in that directory which when trying to open them will raise the error:
wave.Error: file does not start with RIFF id
I want to ignore those files. I want to catch the error and continue with the loop. How can this be done?
My code:
for file_path in files:
sig=0
file = str(file_path)
sig, wave_params = DataGenerator.open_wave(file)
if sig == 0:
print(
'WARNING: Could not open wave file during data creation: ' + file)
continue
if wave_params[0] != 1:
print("WARNING: Wrong NUMBER OF CHANNELS in " + file)
txt.write(
"WARNING: Wrong NUMBER OF CHANNELS in " + file + "\n")
continue
if wave_params[1] != 2:
print("WARNING: Wrong SAMPLE WIDTH in " + file)
txt.write("WARNING: Wrong SAMPLE WIDTH in " + file + "\n")
continue
if wave_params[2] != RATE:
print("WARNING: Wrong FRAME RATE in " + file)
txt.write("WARNING: Wrong FRAME RATE in " + file + "\n")
continue
if wave_params[3] != SAMPLES:
print("WARNING: Wrong NUMBER OF SAMPLES in " + file)
txt.write(
"WARNING: Wrong NUMBER OF SAMPLES in " + file + "\n")
continue
if wave_params[4] != 'NONE':
print("WARNING: Wrong comptype: " + file)
txt.write("WARNING: Wrong comptype: " + file + "\n")
continue
if wave_params[5] != 'not compressed':
print("WARNING: File appears to be compressed " + file)
txt.write(
"WARNING: File appears to be compressed " + file + "\n")
continue
if bit_depth != (wave_params[2] * (2**4) * wave_params[1]):
print("WARNING: Wring bit depth in " + file)
txt.write("WARNING: Wring bit depth in " + file + "\n")
continue
if isinstance(sig, int):
print("WARNING: No signal in " + file)
txt.write("WARNING: No signal in " + file + "\n")
continue
My code for opening the wave file:
def open_wave(sound_file):
"""
Open wave file
Links:
https://stackoverflow.com/questions/16778878/python-write-a-wav-file-into-numpy-float-array
https://stackoverflow.com/questions/2060628/reading-wav-files-in-python
"""
if Path(sound_file).is_file():
sig = 0
with wave.open(sound_file, 'rb') as f:
n_channels = f.getnchannels()
samp_width = f.getsampwidth()
frame_rate = f.getframerate()
num_frames = f.getnframes()
wav_params = f.getparams()
snd = f.readframes(num_frames)
audio_as_np_int16 = np.frombuffer(snd, dtype=np.int16)
sig = audio_as_np_int16.astype(np.float32)
return sig, wav_params
else:
print('ERROR: File ' + sound_file + ' does not exist. BAD.')
print("Problem with openng wave file")
exit(1)
The missing lines which scale the output of the wave file correctly is done on purpose.
I am interested in how to catch the error mentioned above. A tipp of how to open wave files defensively would be nice, too. That is how can I simply ignore wave files that throw errors?

just wrap your function in a try:except block
for file_path in files:
sig=0
file = str(file_path)
try: # attempt to use `open_wave`
sig, wave_params = DataGenerator.open_wave(file)
except wave.Error as ex:
print(f"caught Exception reading '{file}': {repr(ex)}")
continue # next file_path
# opportunity to catch other or more generic Exceptions
... # rest of loop

You could make use of a try-catch block. where you 'try' accessing the file and you catch a potential exception. here you could just make a 'pass'

Related

Exception Error Handling in python spits out error after specifying not to

I'm trying to stop this code from giving me an error about a file I created called beloved.txt I used the FillNotFoundError: to say not to give me the error and to print the file thats not found but instead its printing the message and the error message. How can I fix it ?
def count_words(Filenames):
with open(Filenames) as fill_object:
contentInFill = fill_object.read()
words = contentInFill.rsplit()
word_length = len(words)
print("The file " + Filename + " has " + str(word_length) + " words.")
try:
Filenames = open("beloved.txt", mode="rb")
data = Filenames.read()
return data
except FileNotFoundError as err:
print("Cant find the file name")
Filenames = ["anna.txt", "gatsby.txt", "don_quixote.txt", "beloved.txt", "mockingbird.txt"]
for Filename in Filenames:
count_words(Filename)
A few tips:
Don't capitalize variables besides class names.
Use different variable names when referring to different things. (i.e. don't use Filenames = open("beloved.txt", mode="rb") when you already have a global version of that variable, and a local version of that variable, and now you are reassigning it to mean something different again!! This behavior will lead to headaches...
The main problem with the script though is trying to open a file outside your try statement. You can just move your code to be within the try:! I also don't understand except FileNotFoundError as err: when you don't use err. You should rewrite that to except FileNotFoundError: in this case :)
def count_words(file):
try:
with open(file) as fill_object:
contentInFill = fill_object.read()
words = contentInFill.rsplit()
word_length = len(words)
print("The file " + file + " has " + str(word_length) + " words.")
with open("beloved.txt", mode="rb") as other_file:
data = other_file.read()
return data
except FileNotFoundError:
print("Cant find the file name")
filenames = ["anna.txt", "gatsby.txt", "don_quixote.txt", "beloved.txt", "mockingbird.txt"]
for filename in filenames:
count_words(filename)
I also do not understand why you have your function return data when data is read from the same file regardless of that file you input to the function?? You will get the same result returned in all cases...
The "with open(Filenames) as fill_objec:" sentence will throw you the exception.
So you at least must enclose that sentence in the try part. In your code you first get that len in words, and then you check for the specific file beloved.txt. This doubled code lets you to the duplicated mensajes. Suggestion:
def count_words(Filenames):
try:
with open(Filenames) as fill_object:
contentInFill = fill_object.read()
words = contentInFill.rsplit()
word_length = len(words)
print("The file " + Filename + " has " + str(word_length) + " words.")
except FileNotFoundError as err:
print("Cant find the file name")

PyPDF2 - Unable to Get Past. A Large Corrupted File

I am working on checking for corrupted PDF in a file system. In the test I am running, there are almost 200k PDF's. It seems like smaller corrupted files alert correctly, but I ran into a large 15 MB file that's corrupted and the code just hangs indefinitely. I've tried setting Strict to False with no luck. It seems like it's the initial opening that's the problem. Rather than doing threads and setting a timeout (which I have tried in the past to little success), I'm hoping there's an alternative.
import PyPDF2, os
from time import gmtime,strftime
path = raw_input("Enter folder path of PDF files:")
t = open(r'c:\pdf_check\log.txt','w')
count = 1
for dirpath,dnames,fnames in os.walk(path):
for file in fnames:
print count
count = count + 1
if file.endswith(".pdf"):
file = os.path.join(dirpath, file)
try:
PyPDF2.PdfFileReader(file,'rb',warndest="c:\test\warning.txt")
except PyPDF2.utils.PdfReadError:
curdate = strftime("%Y-%m-%d %H:%M:%S", gmtime())
t.write(str(curdate) + " " + "-" + " " + file + " " + "-" + " " + "fail" + "\n")
else:
pass
#curdate = strftime("%Y-%m-%d %H:%M:%S", gmtime())
#t.write(str(curdate) + " " + "-" + " " + file + " " + "-" + " " + "pass" + "\n")
t.close()
It looks like there is an issue with PyPDF2. I wasn't able to get it to work, however, I used pdfrw and it did not stop at this point and ran through all couple hundred thousand docs without issue.

Creating multipage tif in Freeimagepy, memory not being freed

I have created a program to merge TIFs into multipage tifs with an rather old version of FreeImagePy from 2009. It actually works pretty well but I have one little hitch. As best I can tell it's not freeing up the memory and eventually crashes. Can anyone tell me what I am missing?
Using Ptyhon 2.7.
import urllib
import FreeImagePy as FIPY
import os.path
import time
# set source files
sourceFile = open('C:\\temp\AGetStatePlats\\PlatsGenesee.csv','r')
# logfile currently gets totally replaced at each run
logFile = open('C:\\temp\\AGetStatePlats\\PlatTifMerge.txt','w')
sourcePath = "c:\\temp\subdownload2\\"
destPath = 'C:\\temp\\subtifMerge\\'
for row in sourceFile:
# prepare filenames
subPath = row.split(',',1)[0]
platID = subPath.split('/',1)[1]
curDocument = sourcePath + platID + '01.tif'
curPage = 0
# check if base file is out there
if not os.path.isfile(destPath + platID + '.tif'):
outImage = FIPY.Image()
outImage.new(multiBitmap = destPath + platID +'.tif')
print (destPath + platID +'.tif')
for n in range (1,100):
#build n
nPad = "{:0>2}".format(n)
if os.path.isfile(sourcePath + platID + nPad + '.tif'):
# delay in case file system is not keeping up may not be needed
time.sleep (1.0/4.0)
outImage.appendPage(sourcePath + platID + nPad + '.tif')
curPage = curPage + 1
else:
outImage.deletePage(0)
outImage.close()
del outImage
logFile.write(sourcePath + platID + '.tif' + '\r')
logFile.write(platID + " PageCount = " + str(curPage) + '\r' )
break
else:
logFile.write(platID + " already exists" + '\r')
# cleanup
sourceFile.close()
logFile.close()
print("Run Complete!")
Here's the error messages:
C:\temp\subtifMerge\05848.tif ('Error returned. ', 'TIFF', 'Warning: parsing error. Image may be incomplete or contain invalid data !') Traceback (most recent call last): File "C:\temp\AGetStatePlats\PlatTIFcombine.py", line 43, in outImage.appendPage(sourcePath + platID + nPad + '.tif') File "C:\Python27\ArcGIS10.4\lib\site-packages\FreeImagePy\FreeIm??agePy.py", line 2048, in appendPage bitmap = self.genericLoader(fileName) File "C:\Python27\ArcGIS10.4\lib\site-packages\FreeImagePy\FreeIm??agePy.py", line 1494, in genericLoader dib = self.Load(fif, fileName, flag); File "C:\Python27\ArcGIS10.4\lib\site-packages\FreeImagePy\FreeIm??agePy.py", line 188, in Load return self.__lib.Load(typ, fileName, flags)
WindowsError: exception: priviledged instruction >>>

Python tarfile gzipped file bigger than sum of source files

I have a Python routine which archives file recordings into a GZipped tarball. The output file appears to be far larger than the source files, and I cannot work out why. As an example of the scale of the issue, 6GB of call recordings are generating an archive of 10GB.
There appear to be no errors in the script and the output .gz file is readable and appears OK apart from the huge size.
Excerpt from my script as follows:
# construct tar filename and open file
client_fileid = client_id + "_" + dt.datetime.now().strftime("%Y%m%d_%H%M%S")
tarname = tar_path + "/" + client_fileid + ".tar.gz"
print "Opening tar file %s " % (tarname), "\n"
try:
tar = tarfile.open (tarname, "w:gz")
except:
print "Error opening tar file: %s" % sys.exc_info()[0]
sql="""SELECT number, er.id, e.id, flow, filename, filesize, unread, er.cr_date, callerid,
length, callid, info, party FROM extension_recording er, extension e, client c
WHERE er.extension_id = e.id AND e.client_id = c.id AND c.parent_client_id = %s
AND DATE(er.cr_date) BETWEEN '%s' AND '%s'""" % (client_id, start_date, end_date)
rows = cur.execute(sql)
recordings = cur.fetchall()
if rows == 0: sys.exit("No recordings for selected date range - exiting")
for recording in recordings: # loop through recordings cursor
try:
ext_len = len(str(recording[0]))
# add preceding zeroes if the ext no starts with 0 or 00
if ext_len == 2: extension_no = "0" + str(recording[0])
elif ext_len == 1: extension_no = "00" + str(recording[0])
else: extension_no = str(recording[0])
filename = recording[4]
extended_no = client_id + "*%s" % (extension_no)
sourcedir = recording_path + "/" + extended_no
tardir = extended_no + "/" + filename
complete_name = sourcedir + "/" + filename
tar.add(complete_name, arcname=tardir) # add to tar archive
except:
print "Error '%s' writing to tar file %s" % (sys.exc_info()[1], csvfullfilename)

i have this code. It is supposed to classify the segment files with the ratings in sys.argv[5].But it keeps having error

I have the following code tha uses FFmpeg . It has 5 argv and it takes in filename,video,segment size, start time, end time, ratings. thats supposed to let me classify segments many times with my ratings "PG G M18..." but there's this error,
"File "C:\c.py",line 92, in <module> os.rename<filename + str(x), filename + str(x) + classification)
WindowsError: [Error2] The system cannot find the file specified.
Ive tried to edit many times but this error still persists. Anyone have any idea what could this error mean and anyway to solve it?
import sys
import subprocess
import os
#change hh:mm:ss to seconds:
def getSeconds(sec):
l = sec.split(':')
return int(l[0])* 3600 + int(l[1])* 60 + float(l[2])
def get_total_time(filename):
proc = subprocess.Popen(["ffmpeg", "-i", filename], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
lines = proc.communicate()[1]
target = [line for line in lines.split('\n') if 'Duration:' in line][0]
time = target.split('Duration: ')[-1].split(',', 1)[0]
return time
#check command line arguments
if len(sys.argv) < 5:
print "Error: not enough arguments"
sys.exit()
#save filename to file_name
file_name = sys.argv[1]
if not file_name.endswith('mpg'):
print 'Error! File extension not supported'
sys.exit()
# save a size of chunk in chunk
segsize = int(sys.argv[2])
chunk = (segsize * 1024)
# get time of starting censorship in seconds
start_censorship = getSeconds(sys.argv[3])
# get time of ending censorship in seconds
end_censorship = getSeconds(sys.argv[4])
classification = sys.argv[5]
if classification not in ['P','PG','G','NC16','M18','R21']:
print "Error: invalid classification"
sys.exit()
#initialize variable for extension
file_ext = ''
# if extension exists then save it into file_ext
if '.' in file_name:
# split file_name on two parts from right
file_ext = sys.argv[1].split('.')[-1]
# file_name without extension
filename = '.'.join(file_name.split('.')[:-1])
# total_time of file in seconds
total_time = getSeconds(get_total_time(file_name))
print total_time
#open file
in_file = open(file_name,'rb')
#read first chunks
s = in_file.read(chunk)
in_file.seek(0, 2)
file_size = in_file.tell()
chunks = (file_size / chunk) + 1
chunk_time = total_time/ file_size * chunk
#close input file
in_file.close()
#loop for each chunk
for x in range(0, chunks):
# starting time of current chunk
t1 = chunk_time*x
# ending time of current chunk
t2 = chunk_time*(x+1)
if t2 < start_censorship or t1 > end_censorship:
pass
else:
if os.path.exists(filename + str(x) + 'x'):
os.rename(filename + str(x) + 'x', filename + str(x))
os.rename(filename + str(x), filename + str(x) + classification)
#read next bytes
You're not checking whether filename + str(x) exists before you try to rename it. Either check first, or put it in a try block and catch OSError (which WindowsError subclasses). Either that or you're missing an indent on the second rename.
In that last for loop it looks like you are possibly renaming multiple files -- where do the original files come from?
Or, asked another way, for every chunk that is not censored you are renaming a file -- is that really what you want?

Categories