Store output from for loop in an array - python

I am still at the very beginning of my Python journey, so this question might be basic for more advanced programmers.
I would like to analyse a bunch of .wav files, that are all stored in the same directory, so I created a list of all the filenames so I could then get their audiosignal and samplerate.
dirPath=r"path_to_directory"
global files
files = [f for f in os.listdir(dirPath) if os.path.isfile(os.path.join(dirPath, f)) and f.endswith(".wav")]
for file_name in files:
path_to_file= dirPath+"\\"+file_name
audio_signal, sample_rate = sf.read(path_to_file)
with sf being the soundfile library.
audio_signal is an array, sample_rate is a number.
Now I would like to be able to store both audio_signal and sample_rate together with the corresponding file_name, so I can access them later. How do I do this?
I tried
files = [f for f in os.listdir(dirPath) if os.path.isfile(os.path.join(dirPath, f)) and f.endswith(".wav")],[]
for file_name in files[0]:
path_to_file= dirPath+"\\"+file_name
audio_signal, sample_rate = sf.read(path_to_file)
files[1].append(audio_signal)
files[2].append(sample_rate)
which seems to work, but is there a more elegant way? I feel like the audio_signal, sample_rate and file_name are individual values rather than codependent.

The data structure you are looking for is an associative array which associates a key with a value – the key being the file name and the value being a tuple consisting of the audio signal and sample rate in this example.
An implementation of associative arrays is built into Python as the dictionary type.
You can learn about dictionaries here:
https://docs.python.org/3/tutorial/datastructures.html#dictionaries
The application in your code would look like:
result = {}
for file_name in files:
# as before:
# ...
audio_signal, sample_rate = ...
# new:
result[file_name] = audio_signal, sample_rate

Related

File retention mechanism in a large data storage

recently I faced performance problem with mp4 files retention. I have kind a recorder which saves 1 min long mp4 files from multiple RTSP streams. Those files are stored on external drive in file tree like this:
./recordings/{camera_name}/{YYYY-MM-DD}/{HH-MM}.mp4
Apart from video files, there are many other files on this drive which are not considered (unless they have mp4 extension), as they took much less space.
Assumption of file retention is as follows. Every minute, python script that is responsible for recording, check for external drive fulfillment level. If the level is above 80%, it performs a scan of the whole drive, and look for .mp4 files. When scanning is done, it sorts a list of files by its creation date, and deletes the number of the oldest files which is equal to the cameras number.
The part of the code, which is responsible for files retention, is shown below.
total, used, free = shutil.disk_usage("/home")
used_percent = int(used / total * 100)
if used_percent > 80:
logging.info("SSD usage %s. Looking for the oldest files", used_percent)
try:
oldest_files = sorted(
(
os.path.join(dirname, filename)
for dirname, dirnames, filenames in os.walk('/home')
for filename in filenames
if filename.endswith(".mp4")
),
key=lambda fn: os.stat(fn).st_mtime,
)[:len(camera_devices)]
logging.info("Removing %s", oldest_files)
for oldest_file in oldest_files:
os.remove(oldest_file)
logging.info("%s removed", oldest_file)
except ValueError as e:
# no files to delete
pass
(/home is external drive mount point)
The problem is that this mechanism used to work as a charm, when I used 256 or 512 GB SSD. Now I have a need of larger space (more cameras and longer storage time), and it takes a lot of time to create files list on larger SSD (from 2 to 5 TB now and maybe 8 TB in the future). The scanning process takes a lot more than 1 min, what could be resolved by performing it more rarely, and extending the length of "to delete" files list. The real problem is, that the process uses a lot of CPU load (by I/O ops) itself. The performance drop is visible is the whole system. Other applications, like some simple computer vision algorithms, works slower, and CPU load can even cause kernel panic.
The HW I work on is Nvidia Jetson Nano and Xavier NX. Both devices have problem with performance as I described above.
The question is if you know some algorithms or out of the box software for file retention that will work on the case I described. Or maybe there is a way to rewrite my code, to let it be more reliable and perform?
EDIT:
I was able to lower os.walk() impact by limit space to check.Now I just scan /home/recordings and /home/recognition/ which also lower directory tree (for recursive scan). At the same time, I've added .jpg files checking, so now I look from both .mp4 and .jpg. Result is much better in this implementation.
However, I need further optimization. I prepared some test cases, and tested them on 1 TB drive which is 80% filled (media files mostly). I attached profiler results per case below.
#time_measure
def method6():
paths = [
"/home/recordings",
"/home/recognition",
"/home/recognition/marked_frames",
]
files = []
for path in paths:
files.extend((
os.path.join(dirname, filename)
for dirname, dirnames, filenames in os.walk(path)
for filename in filenames
if (filename.endswith(".mp4") or filename.endswith(".jpg")) and not os.path.islink(os.path.join(dirname, filename))
))
oldest_files = sorted(
files,
key=lambda fn: os.stat(fn).st_mtime,
)
print(oldest_files[:5])
#time_measure
def method7():
ext = [".mp4", ".jpg"]
paths = [
"/home/recordings/*/*/*",
"/home/recognition/*",
"/home/recognition/marked_frames/*",
]
files = []
for path in paths:
files.extend((file for file in glob(path) if not os.path.islink(file) and (file.endswith(".mp4") or file.endswith(".jpg"))))
oldest_files = sorted(files, key=lambda fn: os.stat(fn).st_mtime)
print(oldest_files[:5])
The original implementation on the same data set last ~100 s
EDIT2
#norok2 proposals comparation
I compared them with method6 and method7 from above. I tried several times with similar result.
Testing method7
['/home/recordings/35e68df5-44b1-5010-8d12-74b892c60136/2022-06-24/17-36-18.jpg', '/home/recordings/db33186d-3607-5055-85dd-7e5e3c46faba/2021-11-22/11-27-30.jpg', '/home/recordings/acce21a2-763d-56fe-980d-a85af1744b7a/2021-11-22/11-27-30.jpg', '/home/recordings/b97eb889-e050-5c82-8034-f52ae2d99c37/2021-11-22/11-28-23.jpg', '/home/recordings/01ae845c-b743-5b64-86f6-7f1db79b73ae/2021-11-22/11-28-23.jpg']
Took 24.73726773262024 s
_________________________
Testing find_oldest
['/home/recordings/35e68df5-44b1-5010-8d12-74b892c60136/2022-06-24/17-36-18.jpg', '/home/recordings/db33186d-3607-5055-85dd-7e5e3c46faba/2021-11-22/11-27-30.jpg', '/home/recordings/acce21a2-763d-56fe-980d-a85af1744b7a/2021-11-22/11-27-30.jpg', '/home/recordings/b97eb889-e050-5c82-8034-f52ae2d99c37/2021-11-22/11-28-23.jpg', '/home/recordings/01ae845c-b743-5b64-86f6-7f1db79b73ae/2021-11-22/11-28-23.jpg']
Took 34.355509757995605 s
_________________________
Testing find_oldest_cython
['/home/recordings/35e68df5-44b1-5010-8d12-74b892c60136/2022-06-24/17-36-18.jpg', '/home/recordings/db33186d-3607-5055-85dd-7e5e3c46faba/2021-11-22/11-27-30.jpg', '/home/recordings/acce21a2-763d-56fe-980d-a85af1744b7a/2021-11-22/11-27-30.jpg', '/home/recordings/b97eb889-e050-5c82-8034-f52ae2d99c37/2021-11-22/11-28-23.jpg', '/home/recordings/01ae845c-b743-5b64-86f6-7f1db79b73ae/2021-11-22/11-28-23.jpg']
Took 25.81963086128235 s
method7 (glob())
iglob()
Cython
You could get an extra few percent speed-up on top of your method7() with the following:
import os
import glob
def find_oldest(paths=("*",), exts=(".mp4", ".jpg"), k=5):
result = [
filename
for path in paths
for filename in glob.iglob(path)
if any(filename.endswith(ext) for ext in exts) and not os.path.islink(filename)]
mtime_idxs = sorted(
(os.stat(fn).st_mtime, i)
for i, fn in enumerate(result))
return [result[mtime_idxs[i][1]] for i in range(k)]
The main improvements are:
use iglob instead of glob -- while it may be of comparable speed, it takes significantly less memory which may help on low end machines
str.endswith() is done before the allegedly more expensive os.path.islink() which helps reducing the number of such calls due to shortcircuiting
an intermediate list with all the mtimes is produces to minimize the os.stat() calls
This can be sped up even further with Cython:
%%cython --cplus -c-O3 -c-march=native -a
import os
import glob
cpdef find_oldest_cy(paths=("*",), exts=(".mp4", ".jpg"), k=5):
result = []
for path in paths:
for filename in glob.iglob(path):
good_ext = False
for ext in exts:
if filename.endswith(ext):
good_ext = True
break
if good_ext and not os.path.islink(filename):
result.append(filename)
mtime_idxs = []
for i, fn in enumerate(result):
mtime_idxs.append((os.stat(fn).st_mtime, i))
mtime_idxs.sort()
return [result[mtime_idxs[i][1]] for i in range(k)]
My tests on the following files:
def gen_files(n, exts=("mp4", "jpg", "txt"), filename="somefile", content="content"):
for i in range(n):
ext = exts[i % len(exts)]
with open(f"{filename}{i}.{ext}", "w") as f:
f.write(content)
gen_files(10_000)
produces the following:
funcs = find_oldest_OP, find_oldest, find_oldest_cy
timings = []
base = funcs[0]()
for func in funcs:
res = func()
is_good = base == res
timed = %timeit -r 8 -n 4 -q -o func()
timing = timed.best * 1e3
timings.append(timing if is_good else None)
print(f"{func.__name__:>24} {is_good} {timing:10.3f} ms")
# find_oldest_OP True 81.074 ms
# find_oldest True 70.994 ms
# find_oldest_cy True 64.335 ms
find_oldest_OP is the following, based on method7() from OP:
def find_oldest_OP(paths=("*",), exts=(".mp4", ".jpg"), k=5):
files = []
for path in paths:
files.extend(
(file for file in glob.glob(path)
if not os.path.islink(file) and any(file.endswith(ext) for ext in exts)))
oldest_files = sorted(files, key=lambda fn: os.stat(fn).st_mtime)
return oldest_files[:k]
The Cython version seems to point to a ~25% reduction in execution time.
You could use the subprocess module to list all the mp4 files directly, without having to loop through all the files in the directory.
import subprocess as sb
oldest_files = sb.getoutput("dir /b /s .\home\*.mp4").split("\n")).sort(lambda fn: os.stat(fn).st_mtime,)[:len(camera_devices)]
A quick optimization would be not to bother checking file creation time and trusting the filename.
total, used, free = shutil.disk_usage("/home")
used_percent = int(used / total * 100)
if used_percent > 80:
logging.info("SSD usage %s. Looking for the oldest files", used_percent)
try:
files = []
for dirname, dirnames, filenames in os.walk('/home/recordings'):
for filename in filenames:
files.push((
name := os.path.join(dirname, filename),
datetime.strptime(
re.search(r'\d{4}-\d{2}-\d{2}\/\d{2}-\d{2}', name)[0],
"%Y-%m-%d/%H-%M"
))
oldest_files = files.sort(key=lambda e: e[1])[:len(camera_devices)]
logging.info("Removing %s", oldest_files)
for oldest_file in oldest_files:
os.remove(oldest_file)
# logging.info("%s removed", oldest_file)
logging.info("Removed")
except ValueError as e:
# no files to delete
pass

Iterating over PE files

Below is part of my code in which I am trying to iterate over PE files. I am still getting the same error which is:
[Errno 2] No such file or directory: '//FlickLearningWizard.exe'
Tried using os.path.join(filepath) but it does not do anything since I am have already made the path. I got rid of '/' but it did not add much. Here is my code:
B = 65521
T = {}
for directories in datasetPath: # directories iterating over my datasetPath which contains list of my pe files
samples = [f for f in os.listdir(datasetPath) if isfile(join(datasetPath, f))]
for file in samples:
filePath = directories+"/"+file
fileByteSequence = readFile(filePath)
fileNgrams = byteSequenceToNgrams(filePath,N)
hashFileNgramsIntoDictionary(fileNgrams,T)
K1 = 1000
import heapq
K1_most_common_Ngrams_Using_Hash_Grams = heapq.nlargest(K1, T)
And here is my complete error message:
FileNotFoundError Traceback (most recent call last)
<ipython-input-63-eb8b9254ac6d> in <module>
6 for file in samples:
7 filePath = directories+"/"+file
----> 8 fileByteSequence = readFile(filePath)
9 fileNgrams = byteSequenceToNgrams(filePath,N)
10 hashFileNgramsIntoDictionary(fileNgrams,T)
<ipython-input-3-4bdd47640108> in readFile(filePath)
1 def readFile(filePath):
----> 2 with open(filePath, "rb") as binary_file:
3 data = binary_file.read()
4 return data
5 def byteSequenceToNgrams(byteSequence, n):
A sample of the files I am trying to iterate through in which is in the datasetpath:
['FlickLearningWizard.exe', 'autochk.exe', 'cmd.exe', 'BitLockerWizard.exe', 'iexplore.exe', 'AxInstUI.exe', 'fvenotify.exe', 'DismHost.exe', 'GameBarPresenceWriter.exe', 'consent.exe', 'fax_390392029_072514.exe', 'Win32.AgentTesla.exe', '{71257279-042b-371d-a1d3-fbf8d2fadffa}.exe', 'imecfmui.exe', 'HxCalendarAppImm.exe', 'CExecSvc.exe', 'bootim.exe', 'dumped.exe', 'FXSSVC.exe', 'drvinst.exe', 'DW20.exe', 'appidtel.exe', 'baaupdate.exe', 'AuthHost.exe', 'last.exe', 'BitLockerToGo.exe', 'EhStorAuthn.exe', 'IMTCLNWZ.EXE', 'drvcfg.exe', 'makecab.exe', 'licensingdiag.exe', 'ldp.exe', 'win33.exe', 'forfiles.exe', 'DWWIN.EXE', 'comp.exe', 'coredpussvr.exe', 'AddSuggestedFoldersToLibraryDialog.exe', 'InetMgr6.exe', '3_4.exe', 'CIDiag.exe', 'win32.exe', 'LanguageComponentsInstallerComHandler.exe', 'sample.exe', 'Win32.SofacyCarberp.exe', 'EASPolicyManagerBrokerHost.exe', '131.exe', 'AddInUtil.exe', 'fixmapi.exe', 'cmdl32.exe', 'chkntfs.exe', 'instnm.exe', 'ImagingDevices.exe', 'BitLockerWizardElev.exe', 'bdechangepin.exe', 'logman.exe', '.DS_Store', 'bootcfg.exe', 'DsmUserTask.exe', 'find.exe', 'LogCollector.exe', 'HxTsr.exe', 'lpq.exe', 'ctfmon.exe', 'AppInstaller.exe', 'hvsimgr.exe', 'Vcffipzmnipbxzdl.exe', 'lpremove.exe', 'hdwwiz.exe', 'CastSrv.exe', 'gpresult.exe', 'hvix64.exe', 'HvsiSettingsWorker.exe', 'fodhelper.exe', '21.exe', 'InspectVhdDialog6.2.exe', '798_abroad.exe', 'doskey.exe', 'AuditShD.exe', 'alg.exe', 'certutil.exe', 'bitsadmin.exe', 'help.exe', 'fsquirt.exe', 'PDFXCview.exe', 'inetinfo.exe', 'Win32.Wannacry.exe', 'dcdiag.exe', 'LsaIso.exe', 'lpr.exe', 'dtdump.exe', 'FileHistory.exe', 'LockApp.exe', 'AppVShNotify.exe', 'DeviceProperties.exe', 'ilasm.exe', 'CheckNetIsolation.exe', 'FilePicker.exe', 'choice.exe', 'ComSvcConfig.exe', 'Calculator.exe', 'CredDialogHost.exe', 'logagent.exe', 'InspectVhdDialog6.3.exe', 'junction.exe', 'findstr.exe', 'ktmutil.exe', 'csvde.exe', 'esentutl.exe', 'Win32.GravityRAT.exe', 'bootsect.exe', 'BdeUISrv.exe', 'ChtIME.exe', 'ARP.EXE', 'dsdbutil.exe', 'iisreset.exe', '1003.exe', 'getmac.exe', 'dllhost.exe', 'BOTBINARY.EXE', 'cscript.exe', 'dnscacheugc.exe', 'aspnet_regbrowsers.exe', 'hvax64.exe', 'CredentialUIBroker.exe', 'dpnsvr.exe', 'ApplyTrustOffline.exe', 'LxRun.exe', 'credwiz.exe', '1002.exe', 'FileExplorer.exe', 'BackgroundTransferHost.exe', 'convert.exe', 'AppVClient.exe', 'evntcmd.exe', 'attrib.exe', 'ClipUp.exe', 'DmNotificationBroker.exe', 'dcomcnfg.exe', 'dvdplay.exe', 'Dism.exe', 'AtBroker.exe', 'invoice_2318362983713_823931342io.pdf.exe', 'DataSvcUtil.exe', 'bdeunlock.exe', 'DeviceCensus.exe', 'dstokenclean.exe', 'AndroRat Binder_Patched.exe', 'iediagcmd.exe', 'comrepl.exe', 'dispdiag.exe', 'FlashUtil_ActiveX.exe', 'cliconfg.exe', 'aitstatic.exe', 'gpupdate.exe', 'GetHelp.exe', 'charmap.exe', 'aspnet_regsql.exe', 'IMEWDBLD.EXE', 'AppVStreamingUX.exe', 'dwm.exe', 'Ransomware.Unnamed_0.exe', 'csc.exe', 'bridgeunattend.exe', 'icacls.exe', 'dialer.exe', 'BdeHdCfg.exe', 'fontdrvhost.exe', '027cc450ef5f8c5f653329641ec1fed9.exe', 'LocationNotificationWindows.exe', 'dpapimig.exe', 'BitLockerDeviceEncryption.exe', 'ftp.exe', 'Eap3Host.exe', 'dfsvc.exe', 'LogonUI.exe', 'Fake Intel (1).exe', 'chglogon.exe', 'fhmanagew.exe', 'changepk.exe', 'aspnetca.exe', 'IMEPADSV.EXE', 'browserexport.exe', 'bcdboot.exe', 'aspnet_wp.exe', 'FXSCOVER.exe', 'dllhst3g.exe', 'CertEnrollCtrl.exe', 'EduPrintProv.exe', 'ielowutil.exe', 'ADSchemaAnalyzer.exe', 'cygrunsrv.exe', 'HxAccounts.exe', 'diskperf.exe', 'certreq.exe', 'bcdedit.exe', 'efsui.exe', 'klist.exe', 'raffle.exe', 'cacls.exe', 'hvc.exe', 'cmmon32.exe', 'BioIso.exe', 'AssignedAccessLockApp.exe', 'DmOmaCpMo.exe', 'AppLaunch.exe', 'AddInProcess.exe', 'dasHost.exe', 'dmcertinst.exe', 'IMJPSET.EXE', 'cmbins.exe', 'LicenseManagerShellext.exe', 'diskpart.exe', 'iscsicpl.exe', 'chown.exe', 'Magnify.exe', 'aapt.exe', 'false.exe', 'BioEnrollmentHost.exe', 'hvsirdpclient.exe', 'c2wtshost.exe', 'dplaysvr.exe', 'ChsIME.exe', 'fsavailux.exe', 'Win32.WannaPeace.exe', 'CasPol.exe', 'icsunattend.exe', 'fveprompt.exe', 'expand.exe', 'chgusr.exe', 'hvsirpcd.exe', 'MiniConfigBuilder.exe', 'FirstLogonAnim.exe', 'EDPCleanup.exe', 'ksetup.exe', 'AppVDllSurrogate.exe', 'InstallUtil.exe', 'immersivetpmvscmgrsvr.exe', 'cmdkey.exe', 'appcmd.exe', 'Build.exe', 'hostr.exe', 'CloudStorageWizard.exe', 'DWTRIG20.EXE', 'file_4571518150a8181b403df4ae7ad54ce8b16ded0c.exe', 'FsIso.exe', 'chmod.exe', 'imjpuexc.exe', 'CHXSmartScreen.exe', 'iissetup.exe', '7ZipSetup.exe', 'svchost.exe', 'ldifde.exe', 'logoff.exe', 'DiskSnapshot.exe', 'fontview.exe', 'LaunchWinApp.exe', 'GamePanel.exe', 'yfoye_dump.exe', 'ls.exe', 'HOSTNAME.EXE', 'at.exe', 'InetMgr.exe', 'FaceFodUninstaller.exe', 'InputPersonalization.exe', 'AppVNice.exe', 'ImeBroker.exe', 'CameraSettingsUIHost.exe', 'Defrag.exe', 'lpksetup.exe', 'djoin.exe', 'irftp.exe', 'DTUHandler.exe', 'LockScreenContentServer.exe', 'dsamain.exe', 'lpkinstall.exe', 'DataStoreCacheDumpTool.exe', 'dmclient.exe', 'dump1.exe', 'Cain.exe', 'AddInProcess32.exe', 'appidcertstorecheck.exe', 'IMJPUEX.EXE', 'HxOutlook.exe', 'FlashPlayerApp.exe', 'diskraid.exe', 'bthudtask.exe', 'explorer.exe', 'CompMgmtLauncher.exe', 'malware.exe', 'njRAT.exe', 'CompatTelRunner.exe', 'evntwin.exe', 'Dxpserver.exe', 'HelpPane.exe', 'cvtres.exe', 'dxdiag.exe', 'hvsievaluator.exe', 'signed.exe', 'csrss.exe', 'InstallBC201401.exe', 'audiodg.exe', 'dsregcmd.exe', 'ApproveChildRequest.exe', 'iisrstas.exe', 'chkdsk.exe', 'lodctr.exe', 'aspnet_state.exe', 'DiagnosticsHub.StandardCollector.Service.exe', 'chgport.exe', 'cleanmgr.exe', 'GameBar.exe', 'AgentService.exe', 'InfDefaultInstall.exe', 'IMESEARCH.EXE', 'Fondue.exe', 'iexpress.exe', 'backgroundTaskHost.exe', 'dfrgui.exe', 'cofire.exe', 'BrowserCore.exe', 'clip.exe', 'appidpolicyconverter.exe', 'ed01ebfbc9eb5bbea545af4d01bf5f1071661840480439c6e5babe8e080e41aa.exe', 'cipher.exe', 'DeviceEject.exe', 'cerber.exe', '5a765351046fea1490d20f25.exe', 'CloudExperienceHostBroker.exe', 'FXSUNATD.exe', 'GenValObj.exe', 'lsass.exe', 'ddodiag.exe', 'cmstp.exe', 'wirelesskeyview.exe', 'edpnotify.exe', 'CameraBarcodeScannerPreview.exe', 'bfsvc.exe', 'eventcreate.exe', 'driverquery.exe', 'CCG.exe', 'ConfigSecurityPolicy.exe', 'ieUnatt.exe', 'eshell.exe', 'ipconfig.exe', 'jsc.exe', 'gpscript.exe', 'LaunchTM.exe', 'cttunesvr.exe', 'curl.exe', 'cttune.exe', 'DevicePairingWizard.exe', 'ByteCodeGenerator.exe', 'IEChooser.exe', 'LockAppHost.exe', 'DataExchangeHost.exe', 'dxgiadaptercache.exe', 'dsacls.exe', 'Locator.exe', 'DpiScaling.exe', 'DisplaySwitch.exe', 'autoconv.exe', 'IMJPDCT.EXE', 'ieinstal.exe', 'colorcpl.exe', 'auditpol.exe', 'dccw.exe', 'DeviceEnroller.exe', 'UpdateCheck.exe', 'LicensingUI.exe', 'ExtExport.exe', 'easinvoker.exe', 'ApplySettingsTemplateCatalog.exe', 'eventvwr.exe', 'browser_broker.exe', 'extrac32.exe', 'EaseOfAccessDialog.exe', 'label.exe', 'change.exe', 'IMCCPHR.exe', 'audit.exe', 'aspnet_compiler.exe', 'aspnet_regiis.exe', 'desktopimgdownldr.exe', 'dmcfghost.exe', 'ComputerDefaults.exe', 'control.exe', 'DeviceCredentialDeployment.exe', 'compact.exe', 'InspectVhdDialog.exe', 'EdmGen.exe', 'cmak.exe', 'AppHostRegistrationVerifier.exe', 'DataUsageLiveTileTask.exe', 'hcsdiag.exe', 'gchrome.exe', 'adamuninstall.exe', 'CloudNotifications.exe', 'dusmtask.exe', 'fc.exe', 'hh.exe', 'eudcedit.exe', 'iscsicli.exe', 'DFDWiz.exe', 'isoburn.exe', 'IMTCPROP.exe', 'CapturePicker.exe', 'abba_-_happy_new_year_zaycev_net.exe', 'finger.exe', 'ApplicationFrameHost.exe', 'calc.exe', 'counter.exe', 'editrights.exe', 'fltMC.exe', 'convertvhd.exe', 'LegacyNetUXHost.exe', 'grpconv.exe', 'ie4uinit.exe', 'dsmgmt.exe', 'fsutil.exe', 'AppResolverUX.exe', 'BootExpCfg.exe', 'conhost.exe', 'bash.exe', 'IcsEntitlementHost.exe']
Can anyone help please?
(Edited in reaction to question updates; probably scroll down to the end.)
This probably contains more than one bug:
for directories in datasetPath: # directories iterating over my datasetPath which contains list of my pe files
samples = [f for f in os.listdir(datasetPath) if isfile(join(datasetPath, f))]
for file in samples:
filePath = directories+"/"+file
fileByteSequence = readFile(filePath)
Without knowledge of the precise data types here, it's hard to know exactly how to fix this. But certainly, if datasetPath is a list of paths, os.path.join(datasetPath, f) will not produce what you hope and expect.
Assuming datasetPath contains something like [r'A:\', r'c:\windows\kill me now'], a more or less logical rewrite could look something like
for dir in datasetPath:
samples = []
for f in os.listdir(dir):
p = os.path.join(dir, f)
if isfile(p):
samples.append(p)
for filePath in samples:
fileByteSequence = readFile(filePath)
Notice how we produce the full path just once, and then keep that. Notice how we use the loop variable dir inside the loop, not the list of paths we are looping over.
Actually I'm guessing datasetPath is actually a string, but then the for loop makes no sense (you end up looping over the characters in the string one by one).
If you merely want to check which of these files exist in the current directory, you are massively overcomplicating things.
for filePath in os.listdir('.'):
if filePath in datasetPath:
fileByteSequence = readFile(filePath)
Whether you loop over the files on the disk and check which ones are on you list or vice versa is not a crucial design detail; I have preferred the former on the theory that you want to minimize the number of disk accesses (in the ideal case you get all file names from the disk with a single system call).
I got it working finally.
For some reason it did not run properly on the Macintosh machine therefore I used the same code to run it on Linux and windows and it ran successfully.
The problem was with this line specifically:
samples = [f for f in os.listdir(datasetPath) if isfile(join(datasetPath, f))]

noise reduction on multiple wav file in python

I have used the code from
https://github.com/davidpraise45/Audio-Signal-Processing
to make a function to run it on an entire folder which contains around 100 wav files, but unable to get output cant understand what seems to be the problem.
def noise_reduction(dirName):
types = ('*.wav', '*.aif', '*.aiff', '*.mp3', '*.au', '*.ogg')
wav_file_list = []
for files in types:
wav_file_list.extend(glob.glob(os.path.join(dirName, files)))
wav_file_list = sorted(wav_file_list)
wav_file_list2 = []
for i, wavFile in enumerate(wav_file_list):
#samples = get_samples(wavFile,)
(Frequency, samples)=read(wavFile)
FourierTransformation = sp.fft(samples) # Calculating the fourier transformation of the signal
scale = sp.linspace(0, Frequency, len(samples))
b,a = signal.butter(5, 9800/(Frequency/2), btype='highpass') # ButterWorth filter 4350
filteredSignal = signal.lfilter(b,a,samples)
c,d = signal.butter(5, 200/(Frequency/4), btype='lowpass') # ButterWorth low-filter
newFilteredSignal = signal.lfilter(c,d,filteredSignal) # Applying the filter to the signal
write(New,wavFile, Frequency, newFilteredSignal)
noise_reduction("C:\\Users\\adity\\Desktop\\capstone\\hindi_dia_2\\sad\\sad_1.wav")
scipy.io.wavfile.read only supports the WAV format. It can not read aif, aiff, mp3, au, or ogg files.
You have four arguments to the scipy.io.wavfile.write function that only takes three. New,wavFile should most likely be os.path.join(os.path.dirname(wavFile), "New"+os.path.basename(wavFile)). This create a file with the New prefix in the same directory as the original. If you want to create them in the current directory instead use "New"+os.path.basename(wavFile).
You are passing a filename, not the name of a directory to your function:
noise_reduction("C:\\Users\\adity\\Desktop\\capstone\\hindi_dia_2\\sad\\sad_1.wav")
should likely be:
noise_reduction("C:\\Users\\adity\\Desktop\\capstone\\hindi_dia_2\\sad")
This causes the glob pattern to end up being: C:\\Users\\adity\\Desktop\\capstone\\hindi_dia_2\\sad\\sad_1.wav\\*.wav. This pattern has no matches unless sad_1.wav was a directory and it had files in it that ended with .wav.

sort images based on a cluster correspondances list

I have the following working code to sort images according to a cluster list which is a list of tuples: (image_id, cluster_id).
One image can only be in one and only one cluster (there is never the same image in two clusters for example).
I wonder if there is a way to shorten the "for+for+if+if" loops at the end of the code as yet, for each file name, I must check in every pairs in the cluster list, which makes it a little redundant.
import os
import re
import shutil
srcdir = '/home/username/pictures/' #
if not os.path.isdir(srcdir):
print("Error, %s is not a valid directory!" % srcdir)
return None
pts_cls # is the list of pairs (image_id, cluster_id)
filelist = [(srcdir+fn) for fn in os.listdir(srcdir) if
re.search(r'\.jpg$', fn, re.IGNORECASE)]
filelist.sort(key=lambda var:[int(x) if x.isdigit() else
x for x in re.findall(r'[^0-9]|[0-9]+', var)])
for f in filelist:
fbname = os.path.splitext(os.path.basename(f))[0]
for e,cls in enumerate(pts_cls): # for each (img_id, clst_id) pair
if str(cls[0])==fbname: # check if image_id corresponds to file basename on disk)
if cls[1]==-1: # if cluster_id is -1 (->noise)
outdir = srcdir+'cluster_'+'Noise'+'/'
else:
outdir = srcdir+'cluster_'+str(cls[1])+'/'
if not os.path.isdir(outdir):
os.makedirs(outdir)
dstf = outdir+os.path.basename(f)
if os.path.isfile(dstf)==False:
shutil.copy2(f,dstf)
Of course, as I am pretty new to Python, any other well explained improvements are welcome!
I think you're complicating this far more than needed. Since your image names are unique (there can only be one image_id) you can safely convert pts_cls into a dict and have fast lookups on the spot instead of looping through the list of pairs each and every time. You are also utilizing regex where its not needed and you're packing your paths only to unpack them later.
Also, your code would break if it happens that an image from your source directory is not in the pts_cls as its outdir would never be set (or worse, its outdir would be the one from the previous loop).
I'd streamline it like:
import os
import shutil
src_dir = "/home/username/pictures/"
if not os.path.isdir(src_dir):
print("Error, %s is not a valid directory!" % src_dir)
exit(1) # return is expected only from functions
pts_cls = [] # is the list of pairs (image_id, cluster_id), load from whereever...
# convert your pts_cls into a dict - since there cannot be any images in multiple clusters
# base image name is perfectly ok to use as a key for blazingly fast lookups later
cluster_map = dict(pts_cls)
# get only `.jpg` files; store base name and file name, no need for a full path at this time
files = [(fn[:-4], fn) for fn in os.listdir(src_dir) if fn.lower()[-4:] == ".jpg"]
# no need for sorting based on your code
for name, file_name in files: # loop through all files
if name in cluster_map: # proceed with the file only if in pts_cls
cls = cluster_map[name] # get our cluster value
# get our `cluster_<cluster_id>` or `cluster_Noise` (if cluster == -1) target path
target_dir = os.path.join(src_dir, "cluster_" + str(cls if cls != -1 else "Noise"))
target_file = os.path.join(target_dir, file_name) # get the final target path
if not os.path.exists(target_file): # if the target file doesn't exists
if not os.path.isdir(target_dir): # make sure our target path exists
os.makedirs(target_dir, exist_ok=True) # create a full path if it doesn't
shutil.copy(os.path.join(src_dir, file_name), target_file) # copy
UPDATE - If you have multiple 'special' folders for certain cluster IDs (like Noise is for -1) you can create a map like cluster_targets = {-1: "Noise"} where the keys are your cluster IDs and their values are, obviously, the special names. Then you can replace the target_dir generation with: target_dir = os.path.join(src_dir, "cluster_" + str(cluster_targets.get(cls,cls)))
UPDATE #2 - Since your image_id values appear to be integers while filenames are strings, I'd suggest you to just build your cluster_map dict by converting your image_id parts to strings. That way you'd be comparing likes to likes without the danger of type mismatch:
cluster_map = {str(k): v for k, v in pts_cls}
If you're sure that none of the *.jpg files in your src_dir will have a non-integer in their name you can instead convert the filename into an integer to begin with in the files list generation - just replace fn[:-4] with int(fn[:-4]). But I wouldn't advise that as, again, you never know how your files might be named.

How can I sort given a specific order that I provide

I am trying to sort files in a directory given their extension, but provided an order that I give first. Let's say I want the extension order to be
ext_list = [ .bb, .cc , .dd , aa ]
The only way that I can think of would be to go through every single file
and place them in a list every time a specific extension is encountered.
for subdir, dirs, files in os.walk(directory):
if file.endswith( '.bb') --> append file
then go to the end of the directory
then loop again
if file.endswith( '.cc') -->append file
and so on...
return sorted_extension_list
and then finally
for file in sorted_extension_list :
print file
Here is another way of doing it:
files = []
for _, _, f in os.walk('directory'):
files.append(f)
sorted(files,key=lambda x: ext_list.index(*[os.path.basename(x).split('.',1)[-1]]))
['buzz.bb', 'foo.cc', 'fizz.aa']
Edit: My output doesn't have dd since I didn't make a file for it in my local test dir. It will work regardless.
You can use sorted() with a custom key:
import os
my_custom_keymap = {".aa":2, ".bb":3, ".cc":0, ".dd":1}
def mycompare(f):
return my_custom_keymap[os.path.splitext(f)[1]]
files = ["alfred.bb", "butters.dd", "charlie.cc", "derkins.aa"]
print(sorted(files, key=mycompare))
The above uses the mycompare function as a custom key compare. In this case, it extracts the extension, and the looks up the extension in the my_custom_keymap dictionary.
A very similar way (but closer to your example) could use a list as the map:
import os
my_custom_keymap = [".cc", ".dd", ".aa", ".bb"]
def mycompare(f):
return my_custom_keymap.index(os.path.splitext(f)[1])
files = ["alfred.bb", "butters.dd", "charlie.cc", "derkins.aa"]
print(sorted(files, key=mycompare))
import os
# List you should get once:
file_list_name =['mickey.aa','mickey.dd','mickey_B.cc','mickey.bb']
ext_list = [ '.bb', '.cc' , '.dd' , '.aa' ]
order_list =[]
for ext in ext_list:
for file in file_list_name:
extension = os.path.splitext(file)[1]
if extension == ext:
order_list.append(file)
order_list is what you are looking for. Otherwise you can use the command sorted() with key attribute. Just look for it on SO!
Using sorted with a custom key is probably best, but here's another method where you store the filenames in lists based on their extension. Then put those lists together based on your custom order.
def get_sorted_list_of_files(dirname, extensions):
extension_map = collections.defaultdict(list)
for _, _, files in os.walk(dirname):
for filename in files:
extension = os.path.splitext(filename)[1]
extension_map[extension].append(filename)
pprint.pprint(extension_map)
sorted_list = []
for extension in extensions:
sorted_list.extend(extension_map[extension])
pprint.pprint(extensions)
pprint.pprint(sorted_list)
return sorted_list

Categories