recently I faced performance problem with mp4 files retention. I have kind a recorder which saves 1 min long mp4 files from multiple RTSP streams. Those files are stored on external drive in file tree like this:
./recordings/{camera_name}/{YYYY-MM-DD}/{HH-MM}.mp4
Apart from video files, there are many other files on this drive which are not considered (unless they have mp4 extension), as they took much less space.
Assumption of file retention is as follows. Every minute, python script that is responsible for recording, check for external drive fulfillment level. If the level is above 80%, it performs a scan of the whole drive, and look for .mp4 files. When scanning is done, it sorts a list of files by its creation date, and deletes the number of the oldest files which is equal to the cameras number.
The part of the code, which is responsible for files retention, is shown below.
total, used, free = shutil.disk_usage("/home")
used_percent = int(used / total * 100)
if used_percent > 80:
logging.info("SSD usage %s. Looking for the oldest files", used_percent)
try:
oldest_files = sorted(
(
os.path.join(dirname, filename)
for dirname, dirnames, filenames in os.walk('/home')
for filename in filenames
if filename.endswith(".mp4")
),
key=lambda fn: os.stat(fn).st_mtime,
)[:len(camera_devices)]
logging.info("Removing %s", oldest_files)
for oldest_file in oldest_files:
os.remove(oldest_file)
logging.info("%s removed", oldest_file)
except ValueError as e:
# no files to delete
pass
(/home is external drive mount point)
The problem is that this mechanism used to work as a charm, when I used 256 or 512 GB SSD. Now I have a need of larger space (more cameras and longer storage time), and it takes a lot of time to create files list on larger SSD (from 2 to 5 TB now and maybe 8 TB in the future). The scanning process takes a lot more than 1 min, what could be resolved by performing it more rarely, and extending the length of "to delete" files list. The real problem is, that the process uses a lot of CPU load (by I/O ops) itself. The performance drop is visible is the whole system. Other applications, like some simple computer vision algorithms, works slower, and CPU load can even cause kernel panic.
The HW I work on is Nvidia Jetson Nano and Xavier NX. Both devices have problem with performance as I described above.
The question is if you know some algorithms or out of the box software for file retention that will work on the case I described. Or maybe there is a way to rewrite my code, to let it be more reliable and perform?
EDIT:
I was able to lower os.walk() impact by limit space to check.Now I just scan /home/recordings and /home/recognition/ which also lower directory tree (for recursive scan). At the same time, I've added .jpg files checking, so now I look from both .mp4 and .jpg. Result is much better in this implementation.
However, I need further optimization. I prepared some test cases, and tested them on 1 TB drive which is 80% filled (media files mostly). I attached profiler results per case below.
#time_measure
def method6():
paths = [
"/home/recordings",
"/home/recognition",
"/home/recognition/marked_frames",
]
files = []
for path in paths:
files.extend((
os.path.join(dirname, filename)
for dirname, dirnames, filenames in os.walk(path)
for filename in filenames
if (filename.endswith(".mp4") or filename.endswith(".jpg")) and not os.path.islink(os.path.join(dirname, filename))
))
oldest_files = sorted(
files,
key=lambda fn: os.stat(fn).st_mtime,
)
print(oldest_files[:5])
#time_measure
def method7():
ext = [".mp4", ".jpg"]
paths = [
"/home/recordings/*/*/*",
"/home/recognition/*",
"/home/recognition/marked_frames/*",
]
files = []
for path in paths:
files.extend((file for file in glob(path) if not os.path.islink(file) and (file.endswith(".mp4") or file.endswith(".jpg"))))
oldest_files = sorted(files, key=lambda fn: os.stat(fn).st_mtime)
print(oldest_files[:5])
The original implementation on the same data set last ~100 s
EDIT2
#norok2 proposals comparation
I compared them with method6 and method7 from above. I tried several times with similar result.
Testing method7
['/home/recordings/35e68df5-44b1-5010-8d12-74b892c60136/2022-06-24/17-36-18.jpg', '/home/recordings/db33186d-3607-5055-85dd-7e5e3c46faba/2021-11-22/11-27-30.jpg', '/home/recordings/acce21a2-763d-56fe-980d-a85af1744b7a/2021-11-22/11-27-30.jpg', '/home/recordings/b97eb889-e050-5c82-8034-f52ae2d99c37/2021-11-22/11-28-23.jpg', '/home/recordings/01ae845c-b743-5b64-86f6-7f1db79b73ae/2021-11-22/11-28-23.jpg']
Took 24.73726773262024 s
_________________________
Testing find_oldest
['/home/recordings/35e68df5-44b1-5010-8d12-74b892c60136/2022-06-24/17-36-18.jpg', '/home/recordings/db33186d-3607-5055-85dd-7e5e3c46faba/2021-11-22/11-27-30.jpg', '/home/recordings/acce21a2-763d-56fe-980d-a85af1744b7a/2021-11-22/11-27-30.jpg', '/home/recordings/b97eb889-e050-5c82-8034-f52ae2d99c37/2021-11-22/11-28-23.jpg', '/home/recordings/01ae845c-b743-5b64-86f6-7f1db79b73ae/2021-11-22/11-28-23.jpg']
Took 34.355509757995605 s
_________________________
Testing find_oldest_cython
['/home/recordings/35e68df5-44b1-5010-8d12-74b892c60136/2022-06-24/17-36-18.jpg', '/home/recordings/db33186d-3607-5055-85dd-7e5e3c46faba/2021-11-22/11-27-30.jpg', '/home/recordings/acce21a2-763d-56fe-980d-a85af1744b7a/2021-11-22/11-27-30.jpg', '/home/recordings/b97eb889-e050-5c82-8034-f52ae2d99c37/2021-11-22/11-28-23.jpg', '/home/recordings/01ae845c-b743-5b64-86f6-7f1db79b73ae/2021-11-22/11-28-23.jpg']
Took 25.81963086128235 s
method7 (glob())
iglob()
Cython
You could get an extra few percent speed-up on top of your method7() with the following:
import os
import glob
def find_oldest(paths=("*",), exts=(".mp4", ".jpg"), k=5):
result = [
filename
for path in paths
for filename in glob.iglob(path)
if any(filename.endswith(ext) for ext in exts) and not os.path.islink(filename)]
mtime_idxs = sorted(
(os.stat(fn).st_mtime, i)
for i, fn in enumerate(result))
return [result[mtime_idxs[i][1]] for i in range(k)]
The main improvements are:
use iglob instead of glob -- while it may be of comparable speed, it takes significantly less memory which may help on low end machines
str.endswith() is done before the allegedly more expensive os.path.islink() which helps reducing the number of such calls due to shortcircuiting
an intermediate list with all the mtimes is produces to minimize the os.stat() calls
This can be sped up even further with Cython:
%%cython --cplus -c-O3 -c-march=native -a
import os
import glob
cpdef find_oldest_cy(paths=("*",), exts=(".mp4", ".jpg"), k=5):
result = []
for path in paths:
for filename in glob.iglob(path):
good_ext = False
for ext in exts:
if filename.endswith(ext):
good_ext = True
break
if good_ext and not os.path.islink(filename):
result.append(filename)
mtime_idxs = []
for i, fn in enumerate(result):
mtime_idxs.append((os.stat(fn).st_mtime, i))
mtime_idxs.sort()
return [result[mtime_idxs[i][1]] for i in range(k)]
My tests on the following files:
def gen_files(n, exts=("mp4", "jpg", "txt"), filename="somefile", content="content"):
for i in range(n):
ext = exts[i % len(exts)]
with open(f"{filename}{i}.{ext}", "w") as f:
f.write(content)
gen_files(10_000)
produces the following:
funcs = find_oldest_OP, find_oldest, find_oldest_cy
timings = []
base = funcs[0]()
for func in funcs:
res = func()
is_good = base == res
timed = %timeit -r 8 -n 4 -q -o func()
timing = timed.best * 1e3
timings.append(timing if is_good else None)
print(f"{func.__name__:>24} {is_good} {timing:10.3f} ms")
# find_oldest_OP True 81.074 ms
# find_oldest True 70.994 ms
# find_oldest_cy True 64.335 ms
find_oldest_OP is the following, based on method7() from OP:
def find_oldest_OP(paths=("*",), exts=(".mp4", ".jpg"), k=5):
files = []
for path in paths:
files.extend(
(file for file in glob.glob(path)
if not os.path.islink(file) and any(file.endswith(ext) for ext in exts)))
oldest_files = sorted(files, key=lambda fn: os.stat(fn).st_mtime)
return oldest_files[:k]
The Cython version seems to point to a ~25% reduction in execution time.
You could use the subprocess module to list all the mp4 files directly, without having to loop through all the files in the directory.
import subprocess as sb
oldest_files = sb.getoutput("dir /b /s .\home\*.mp4").split("\n")).sort(lambda fn: os.stat(fn).st_mtime,)[:len(camera_devices)]
A quick optimization would be not to bother checking file creation time and trusting the filename.
total, used, free = shutil.disk_usage("/home")
used_percent = int(used / total * 100)
if used_percent > 80:
logging.info("SSD usage %s. Looking for the oldest files", used_percent)
try:
files = []
for dirname, dirnames, filenames in os.walk('/home/recordings'):
for filename in filenames:
files.push((
name := os.path.join(dirname, filename),
datetime.strptime(
re.search(r'\d{4}-\d{2}-\d{2}\/\d{2}-\d{2}', name)[0],
"%Y-%m-%d/%H-%M"
))
oldest_files = files.sort(key=lambda e: e[1])[:len(camera_devices)]
logging.info("Removing %s", oldest_files)
for oldest_file in oldest_files:
os.remove(oldest_file)
# logging.info("%s removed", oldest_file)
logging.info("Removed")
except ValueError as e:
# no files to delete
pass
Below is part of my code in which I am trying to iterate over PE files. I am still getting the same error which is:
[Errno 2] No such file or directory: '//FlickLearningWizard.exe'
Tried using os.path.join(filepath) but it does not do anything since I am have already made the path. I got rid of '/' but it did not add much. Here is my code:
B = 65521
T = {}
for directories in datasetPath: # directories iterating over my datasetPath which contains list of my pe files
samples = [f for f in os.listdir(datasetPath) if isfile(join(datasetPath, f))]
for file in samples:
filePath = directories+"/"+file
fileByteSequence = readFile(filePath)
fileNgrams = byteSequenceToNgrams(filePath,N)
hashFileNgramsIntoDictionary(fileNgrams,T)
K1 = 1000
import heapq
K1_most_common_Ngrams_Using_Hash_Grams = heapq.nlargest(K1, T)
And here is my complete error message:
FileNotFoundError Traceback (most recent call last)
<ipython-input-63-eb8b9254ac6d> in <module>
6 for file in samples:
7 filePath = directories+"/"+file
----> 8 fileByteSequence = readFile(filePath)
9 fileNgrams = byteSequenceToNgrams(filePath,N)
10 hashFileNgramsIntoDictionary(fileNgrams,T)
<ipython-input-3-4bdd47640108> in readFile(filePath)
1 def readFile(filePath):
----> 2 with open(filePath, "rb") as binary_file:
3 data = binary_file.read()
4 return data
5 def byteSequenceToNgrams(byteSequence, n):
A sample of the files I am trying to iterate through in which is in the datasetpath:
['FlickLearningWizard.exe', 'autochk.exe', 'cmd.exe', 'BitLockerWizard.exe', 'iexplore.exe', 'AxInstUI.exe', 'fvenotify.exe', 'DismHost.exe', 'GameBarPresenceWriter.exe', 'consent.exe', 'fax_390392029_072514.exe', 'Win32.AgentTesla.exe', '{71257279-042b-371d-a1d3-fbf8d2fadffa}.exe', 'imecfmui.exe', 'HxCalendarAppImm.exe', 'CExecSvc.exe', 'bootim.exe', 'dumped.exe', 'FXSSVC.exe', 'drvinst.exe', 'DW20.exe', 'appidtel.exe', 'baaupdate.exe', 'AuthHost.exe', 'last.exe', 'BitLockerToGo.exe', 'EhStorAuthn.exe', 'IMTCLNWZ.EXE', 'drvcfg.exe', 'makecab.exe', 'licensingdiag.exe', 'ldp.exe', 'win33.exe', 'forfiles.exe', 'DWWIN.EXE', 'comp.exe', 'coredpussvr.exe', 'AddSuggestedFoldersToLibraryDialog.exe', 'InetMgr6.exe', '3_4.exe', 'CIDiag.exe', 'win32.exe', 'LanguageComponentsInstallerComHandler.exe', 'sample.exe', 'Win32.SofacyCarberp.exe', 'EASPolicyManagerBrokerHost.exe', '131.exe', 'AddInUtil.exe', 'fixmapi.exe', 'cmdl32.exe', 'chkntfs.exe', 'instnm.exe', 'ImagingDevices.exe', 'BitLockerWizardElev.exe', 'bdechangepin.exe', 'logman.exe', '.DS_Store', 'bootcfg.exe', 'DsmUserTask.exe', 'find.exe', 'LogCollector.exe', 'HxTsr.exe', 'lpq.exe', 'ctfmon.exe', 'AppInstaller.exe', 'hvsimgr.exe', 'Vcffipzmnipbxzdl.exe', 'lpremove.exe', 'hdwwiz.exe', 'CastSrv.exe', 'gpresult.exe', 'hvix64.exe', 'HvsiSettingsWorker.exe', 'fodhelper.exe', '21.exe', 'InspectVhdDialog6.2.exe', '798_abroad.exe', 'doskey.exe', 'AuditShD.exe', 'alg.exe', 'certutil.exe', 'bitsadmin.exe', 'help.exe', 'fsquirt.exe', 'PDFXCview.exe', 'inetinfo.exe', 'Win32.Wannacry.exe', 'dcdiag.exe', 'LsaIso.exe', 'lpr.exe', 'dtdump.exe', 'FileHistory.exe', 'LockApp.exe', 'AppVShNotify.exe', 'DeviceProperties.exe', 'ilasm.exe', 'CheckNetIsolation.exe', 'FilePicker.exe', 'choice.exe', 'ComSvcConfig.exe', 'Calculator.exe', 'CredDialogHost.exe', 'logagent.exe', 'InspectVhdDialog6.3.exe', 'junction.exe', 'findstr.exe', 'ktmutil.exe', 'csvde.exe', 'esentutl.exe', 'Win32.GravityRAT.exe', 'bootsect.exe', 'BdeUISrv.exe', 'ChtIME.exe', 'ARP.EXE', 'dsdbutil.exe', 'iisreset.exe', '1003.exe', 'getmac.exe', 'dllhost.exe', 'BOTBINARY.EXE', 'cscript.exe', 'dnscacheugc.exe', 'aspnet_regbrowsers.exe', 'hvax64.exe', 'CredentialUIBroker.exe', 'dpnsvr.exe', 'ApplyTrustOffline.exe', 'LxRun.exe', 'credwiz.exe', '1002.exe', 'FileExplorer.exe', 'BackgroundTransferHost.exe', 'convert.exe', 'AppVClient.exe', 'evntcmd.exe', 'attrib.exe', 'ClipUp.exe', 'DmNotificationBroker.exe', 'dcomcnfg.exe', 'dvdplay.exe', 'Dism.exe', 'AtBroker.exe', 'invoice_2318362983713_823931342io.pdf.exe', 'DataSvcUtil.exe', 'bdeunlock.exe', 'DeviceCensus.exe', 'dstokenclean.exe', 'AndroRat Binder_Patched.exe', 'iediagcmd.exe', 'comrepl.exe', 'dispdiag.exe', 'FlashUtil_ActiveX.exe', 'cliconfg.exe', 'aitstatic.exe', 'gpupdate.exe', 'GetHelp.exe', 'charmap.exe', 'aspnet_regsql.exe', 'IMEWDBLD.EXE', 'AppVStreamingUX.exe', 'dwm.exe', 'Ransomware.Unnamed_0.exe', 'csc.exe', 'bridgeunattend.exe', 'icacls.exe', 'dialer.exe', 'BdeHdCfg.exe', 'fontdrvhost.exe', '027cc450ef5f8c5f653329641ec1fed9.exe', 'LocationNotificationWindows.exe', 'dpapimig.exe', 'BitLockerDeviceEncryption.exe', 'ftp.exe', 'Eap3Host.exe', 'dfsvc.exe', 'LogonUI.exe', 'Fake Intel (1).exe', 'chglogon.exe', 'fhmanagew.exe', 'changepk.exe', 'aspnetca.exe', 'IMEPADSV.EXE', 'browserexport.exe', 'bcdboot.exe', 'aspnet_wp.exe', 'FXSCOVER.exe', 'dllhst3g.exe', 'CertEnrollCtrl.exe', 'EduPrintProv.exe', 'ielowutil.exe', 'ADSchemaAnalyzer.exe', 'cygrunsrv.exe', 'HxAccounts.exe', 'diskperf.exe', 'certreq.exe', 'bcdedit.exe', 'efsui.exe', 'klist.exe', 'raffle.exe', 'cacls.exe', 'hvc.exe', 'cmmon32.exe', 'BioIso.exe', 'AssignedAccessLockApp.exe', 'DmOmaCpMo.exe', 'AppLaunch.exe', 'AddInProcess.exe', 'dasHost.exe', 'dmcertinst.exe', 'IMJPSET.EXE', 'cmbins.exe', 'LicenseManagerShellext.exe', 'diskpart.exe', 'iscsicpl.exe', 'chown.exe', 'Magnify.exe', 'aapt.exe', 'false.exe', 'BioEnrollmentHost.exe', 'hvsirdpclient.exe', 'c2wtshost.exe', 'dplaysvr.exe', 'ChsIME.exe', 'fsavailux.exe', 'Win32.WannaPeace.exe', 'CasPol.exe', 'icsunattend.exe', 'fveprompt.exe', 'expand.exe', 'chgusr.exe', 'hvsirpcd.exe', 'MiniConfigBuilder.exe', 'FirstLogonAnim.exe', 'EDPCleanup.exe', 'ksetup.exe', 'AppVDllSurrogate.exe', 'InstallUtil.exe', 'immersivetpmvscmgrsvr.exe', 'cmdkey.exe', 'appcmd.exe', 'Build.exe', 'hostr.exe', 'CloudStorageWizard.exe', 'DWTRIG20.EXE', 'file_4571518150a8181b403df4ae7ad54ce8b16ded0c.exe', 'FsIso.exe', 'chmod.exe', 'imjpuexc.exe', 'CHXSmartScreen.exe', 'iissetup.exe', '7ZipSetup.exe', 'svchost.exe', 'ldifde.exe', 'logoff.exe', 'DiskSnapshot.exe', 'fontview.exe', 'LaunchWinApp.exe', 'GamePanel.exe', 'yfoye_dump.exe', 'ls.exe', 'HOSTNAME.EXE', 'at.exe', 'InetMgr.exe', 'FaceFodUninstaller.exe', 'InputPersonalization.exe', 'AppVNice.exe', 'ImeBroker.exe', 'CameraSettingsUIHost.exe', 'Defrag.exe', 'lpksetup.exe', 'djoin.exe', 'irftp.exe', 'DTUHandler.exe', 'LockScreenContentServer.exe', 'dsamain.exe', 'lpkinstall.exe', 'DataStoreCacheDumpTool.exe', 'dmclient.exe', 'dump1.exe', 'Cain.exe', 'AddInProcess32.exe', 'appidcertstorecheck.exe', 'IMJPUEX.EXE', 'HxOutlook.exe', 'FlashPlayerApp.exe', 'diskraid.exe', 'bthudtask.exe', 'explorer.exe', 'CompMgmtLauncher.exe', 'malware.exe', 'njRAT.exe', 'CompatTelRunner.exe', 'evntwin.exe', 'Dxpserver.exe', 'HelpPane.exe', 'cvtres.exe', 'dxdiag.exe', 'hvsievaluator.exe', 'signed.exe', 'csrss.exe', 'InstallBC201401.exe', 'audiodg.exe', 'dsregcmd.exe', 'ApproveChildRequest.exe', 'iisrstas.exe', 'chkdsk.exe', 'lodctr.exe', 'aspnet_state.exe', 'DiagnosticsHub.StandardCollector.Service.exe', 'chgport.exe', 'cleanmgr.exe', 'GameBar.exe', 'AgentService.exe', 'InfDefaultInstall.exe', 'IMESEARCH.EXE', 'Fondue.exe', 'iexpress.exe', 'backgroundTaskHost.exe', 'dfrgui.exe', 'cofire.exe', 'BrowserCore.exe', 'clip.exe', 'appidpolicyconverter.exe', 'ed01ebfbc9eb5bbea545af4d01bf5f1071661840480439c6e5babe8e080e41aa.exe', 'cipher.exe', 'DeviceEject.exe', 'cerber.exe', '5a765351046fea1490d20f25.exe', 'CloudExperienceHostBroker.exe', 'FXSUNATD.exe', 'GenValObj.exe', 'lsass.exe', 'ddodiag.exe', 'cmstp.exe', 'wirelesskeyview.exe', 'edpnotify.exe', 'CameraBarcodeScannerPreview.exe', 'bfsvc.exe', 'eventcreate.exe', 'driverquery.exe', 'CCG.exe', 'ConfigSecurityPolicy.exe', 'ieUnatt.exe', 'eshell.exe', 'ipconfig.exe', 'jsc.exe', 'gpscript.exe', 'LaunchTM.exe', 'cttunesvr.exe', 'curl.exe', 'cttune.exe', 'DevicePairingWizard.exe', 'ByteCodeGenerator.exe', 'IEChooser.exe', 'LockAppHost.exe', 'DataExchangeHost.exe', 'dxgiadaptercache.exe', 'dsacls.exe', 'Locator.exe', 'DpiScaling.exe', 'DisplaySwitch.exe', 'autoconv.exe', 'IMJPDCT.EXE', 'ieinstal.exe', 'colorcpl.exe', 'auditpol.exe', 'dccw.exe', 'DeviceEnroller.exe', 'UpdateCheck.exe', 'LicensingUI.exe', 'ExtExport.exe', 'easinvoker.exe', 'ApplySettingsTemplateCatalog.exe', 'eventvwr.exe', 'browser_broker.exe', 'extrac32.exe', 'EaseOfAccessDialog.exe', 'label.exe', 'change.exe', 'IMCCPHR.exe', 'audit.exe', 'aspnet_compiler.exe', 'aspnet_regiis.exe', 'desktopimgdownldr.exe', 'dmcfghost.exe', 'ComputerDefaults.exe', 'control.exe', 'DeviceCredentialDeployment.exe', 'compact.exe', 'InspectVhdDialog.exe', 'EdmGen.exe', 'cmak.exe', 'AppHostRegistrationVerifier.exe', 'DataUsageLiveTileTask.exe', 'hcsdiag.exe', 'gchrome.exe', 'adamuninstall.exe', 'CloudNotifications.exe', 'dusmtask.exe', 'fc.exe', 'hh.exe', 'eudcedit.exe', 'iscsicli.exe', 'DFDWiz.exe', 'isoburn.exe', 'IMTCPROP.exe', 'CapturePicker.exe', 'abba_-_happy_new_year_zaycev_net.exe', 'finger.exe', 'ApplicationFrameHost.exe', 'calc.exe', 'counter.exe', 'editrights.exe', 'fltMC.exe', 'convertvhd.exe', 'LegacyNetUXHost.exe', 'grpconv.exe', 'ie4uinit.exe', 'dsmgmt.exe', 'fsutil.exe', 'AppResolverUX.exe', 'BootExpCfg.exe', 'conhost.exe', 'bash.exe', 'IcsEntitlementHost.exe']
Can anyone help please?
(Edited in reaction to question updates; probably scroll down to the end.)
This probably contains more than one bug:
for directories in datasetPath: # directories iterating over my datasetPath which contains list of my pe files
samples = [f for f in os.listdir(datasetPath) if isfile(join(datasetPath, f))]
for file in samples:
filePath = directories+"/"+file
fileByteSequence = readFile(filePath)
Without knowledge of the precise data types here, it's hard to know exactly how to fix this. But certainly, if datasetPath is a list of paths, os.path.join(datasetPath, f) will not produce what you hope and expect.
Assuming datasetPath contains something like [r'A:\', r'c:\windows\kill me now'], a more or less logical rewrite could look something like
for dir in datasetPath:
samples = []
for f in os.listdir(dir):
p = os.path.join(dir, f)
if isfile(p):
samples.append(p)
for filePath in samples:
fileByteSequence = readFile(filePath)
Notice how we produce the full path just once, and then keep that. Notice how we use the loop variable dir inside the loop, not the list of paths we are looping over.
Actually I'm guessing datasetPath is actually a string, but then the for loop makes no sense (you end up looping over the characters in the string one by one).
If you merely want to check which of these files exist in the current directory, you are massively overcomplicating things.
for filePath in os.listdir('.'):
if filePath in datasetPath:
fileByteSequence = readFile(filePath)
Whether you loop over the files on the disk and check which ones are on you list or vice versa is not a crucial design detail; I have preferred the former on the theory that you want to minimize the number of disk accesses (in the ideal case you get all file names from the disk with a single system call).
I got it working finally.
For some reason it did not run properly on the Macintosh machine therefore I used the same code to run it on Linux and windows and it ran successfully.
The problem was with this line specifically:
samples = [f for f in os.listdir(datasetPath) if isfile(join(datasetPath, f))]