Iterating over PE files - python

Below is part of my code in which I am trying to iterate over PE files. I am still getting the same error which is:
[Errno 2] No such file or directory: '//FlickLearningWizard.exe'
Tried using os.path.join(filepath) but it does not do anything since I am have already made the path. I got rid of '/' but it did not add much. Here is my code:
B = 65521
T = {}
for directories in datasetPath: # directories iterating over my datasetPath which contains list of my pe files
samples = [f for f in os.listdir(datasetPath) if isfile(join(datasetPath, f))]
for file in samples:
filePath = directories+"/"+file
fileByteSequence = readFile(filePath)
fileNgrams = byteSequenceToNgrams(filePath,N)
hashFileNgramsIntoDictionary(fileNgrams,T)
K1 = 1000
import heapq
K1_most_common_Ngrams_Using_Hash_Grams = heapq.nlargest(K1, T)
And here is my complete error message:
FileNotFoundError Traceback (most recent call last)
<ipython-input-63-eb8b9254ac6d> in <module>
6 for file in samples:
7 filePath = directories+"/"+file
----> 8 fileByteSequence = readFile(filePath)
9 fileNgrams = byteSequenceToNgrams(filePath,N)
10 hashFileNgramsIntoDictionary(fileNgrams,T)
<ipython-input-3-4bdd47640108> in readFile(filePath)
1 def readFile(filePath):
----> 2 with open(filePath, "rb") as binary_file:
3 data = binary_file.read()
4 return data
5 def byteSequenceToNgrams(byteSequence, n):
A sample of the files I am trying to iterate through in which is in the datasetpath:
['FlickLearningWizard.exe', 'autochk.exe', 'cmd.exe', 'BitLockerWizard.exe', 'iexplore.exe', 'AxInstUI.exe', 'fvenotify.exe', 'DismHost.exe', 'GameBarPresenceWriter.exe', 'consent.exe', 'fax_390392029_072514.exe', 'Win32.AgentTesla.exe', '{71257279-042b-371d-a1d3-fbf8d2fadffa}.exe', 'imecfmui.exe', 'HxCalendarAppImm.exe', 'CExecSvc.exe', 'bootim.exe', 'dumped.exe', 'FXSSVC.exe', 'drvinst.exe', 'DW20.exe', 'appidtel.exe', 'baaupdate.exe', 'AuthHost.exe', 'last.exe', 'BitLockerToGo.exe', 'EhStorAuthn.exe', 'IMTCLNWZ.EXE', 'drvcfg.exe', 'makecab.exe', 'licensingdiag.exe', 'ldp.exe', 'win33.exe', 'forfiles.exe', 'DWWIN.EXE', 'comp.exe', 'coredpussvr.exe', 'AddSuggestedFoldersToLibraryDialog.exe', 'InetMgr6.exe', '3_4.exe', 'CIDiag.exe', 'win32.exe', 'LanguageComponentsInstallerComHandler.exe', 'sample.exe', 'Win32.SofacyCarberp.exe', 'EASPolicyManagerBrokerHost.exe', '131.exe', 'AddInUtil.exe', 'fixmapi.exe', 'cmdl32.exe', 'chkntfs.exe', 'instnm.exe', 'ImagingDevices.exe', 'BitLockerWizardElev.exe', 'bdechangepin.exe', 'logman.exe', '.DS_Store', 'bootcfg.exe', 'DsmUserTask.exe', 'find.exe', 'LogCollector.exe', 'HxTsr.exe', 'lpq.exe', 'ctfmon.exe', 'AppInstaller.exe', 'hvsimgr.exe', 'Vcffipzmnipbxzdl.exe', 'lpremove.exe', 'hdwwiz.exe', 'CastSrv.exe', 'gpresult.exe', 'hvix64.exe', 'HvsiSettingsWorker.exe', 'fodhelper.exe', '21.exe', 'InspectVhdDialog6.2.exe', '798_abroad.exe', 'doskey.exe', 'AuditShD.exe', 'alg.exe', 'certutil.exe', 'bitsadmin.exe', 'help.exe', 'fsquirt.exe', 'PDFXCview.exe', 'inetinfo.exe', 'Win32.Wannacry.exe', 'dcdiag.exe', 'LsaIso.exe', 'lpr.exe', 'dtdump.exe', 'FileHistory.exe', 'LockApp.exe', 'AppVShNotify.exe', 'DeviceProperties.exe', 'ilasm.exe', 'CheckNetIsolation.exe', 'FilePicker.exe', 'choice.exe', 'ComSvcConfig.exe', 'Calculator.exe', 'CredDialogHost.exe', 'logagent.exe', 'InspectVhdDialog6.3.exe', 'junction.exe', 'findstr.exe', 'ktmutil.exe', 'csvde.exe', 'esentutl.exe', 'Win32.GravityRAT.exe', 'bootsect.exe', 'BdeUISrv.exe', 'ChtIME.exe', 'ARP.EXE', 'dsdbutil.exe', 'iisreset.exe', '1003.exe', 'getmac.exe', 'dllhost.exe', 'BOTBINARY.EXE', 'cscript.exe', 'dnscacheugc.exe', 'aspnet_regbrowsers.exe', 'hvax64.exe', 'CredentialUIBroker.exe', 'dpnsvr.exe', 'ApplyTrustOffline.exe', 'LxRun.exe', 'credwiz.exe', '1002.exe', 'FileExplorer.exe', 'BackgroundTransferHost.exe', 'convert.exe', 'AppVClient.exe', 'evntcmd.exe', 'attrib.exe', 'ClipUp.exe', 'DmNotificationBroker.exe', 'dcomcnfg.exe', 'dvdplay.exe', 'Dism.exe', 'AtBroker.exe', 'invoice_2318362983713_823931342io.pdf.exe', 'DataSvcUtil.exe', 'bdeunlock.exe', 'DeviceCensus.exe', 'dstokenclean.exe', 'AndroRat Binder_Patched.exe', 'iediagcmd.exe', 'comrepl.exe', 'dispdiag.exe', 'FlashUtil_ActiveX.exe', 'cliconfg.exe', 'aitstatic.exe', 'gpupdate.exe', 'GetHelp.exe', 'charmap.exe', 'aspnet_regsql.exe', 'IMEWDBLD.EXE', 'AppVStreamingUX.exe', 'dwm.exe', 'Ransomware.Unnamed_0.exe', 'csc.exe', 'bridgeunattend.exe', 'icacls.exe', 'dialer.exe', 'BdeHdCfg.exe', 'fontdrvhost.exe', '027cc450ef5f8c5f653329641ec1fed9.exe', 'LocationNotificationWindows.exe', 'dpapimig.exe', 'BitLockerDeviceEncryption.exe', 'ftp.exe', 'Eap3Host.exe', 'dfsvc.exe', 'LogonUI.exe', 'Fake Intel (1).exe', 'chglogon.exe', 'fhmanagew.exe', 'changepk.exe', 'aspnetca.exe', 'IMEPADSV.EXE', 'browserexport.exe', 'bcdboot.exe', 'aspnet_wp.exe', 'FXSCOVER.exe', 'dllhst3g.exe', 'CertEnrollCtrl.exe', 'EduPrintProv.exe', 'ielowutil.exe', 'ADSchemaAnalyzer.exe', 'cygrunsrv.exe', 'HxAccounts.exe', 'diskperf.exe', 'certreq.exe', 'bcdedit.exe', 'efsui.exe', 'klist.exe', 'raffle.exe', 'cacls.exe', 'hvc.exe', 'cmmon32.exe', 'BioIso.exe', 'AssignedAccessLockApp.exe', 'DmOmaCpMo.exe', 'AppLaunch.exe', 'AddInProcess.exe', 'dasHost.exe', 'dmcertinst.exe', 'IMJPSET.EXE', 'cmbins.exe', 'LicenseManagerShellext.exe', 'diskpart.exe', 'iscsicpl.exe', 'chown.exe', 'Magnify.exe', 'aapt.exe', 'false.exe', 'BioEnrollmentHost.exe', 'hvsirdpclient.exe', 'c2wtshost.exe', 'dplaysvr.exe', 'ChsIME.exe', 'fsavailux.exe', 'Win32.WannaPeace.exe', 'CasPol.exe', 'icsunattend.exe', 'fveprompt.exe', 'expand.exe', 'chgusr.exe', 'hvsirpcd.exe', 'MiniConfigBuilder.exe', 'FirstLogonAnim.exe', 'EDPCleanup.exe', 'ksetup.exe', 'AppVDllSurrogate.exe', 'InstallUtil.exe', 'immersivetpmvscmgrsvr.exe', 'cmdkey.exe', 'appcmd.exe', 'Build.exe', 'hostr.exe', 'CloudStorageWizard.exe', 'DWTRIG20.EXE', 'file_4571518150a8181b403df4ae7ad54ce8b16ded0c.exe', 'FsIso.exe', 'chmod.exe', 'imjpuexc.exe', 'CHXSmartScreen.exe', 'iissetup.exe', '7ZipSetup.exe', 'svchost.exe', 'ldifde.exe', 'logoff.exe', 'DiskSnapshot.exe', 'fontview.exe', 'LaunchWinApp.exe', 'GamePanel.exe', 'yfoye_dump.exe', 'ls.exe', 'HOSTNAME.EXE', 'at.exe', 'InetMgr.exe', 'FaceFodUninstaller.exe', 'InputPersonalization.exe', 'AppVNice.exe', 'ImeBroker.exe', 'CameraSettingsUIHost.exe', 'Defrag.exe', 'lpksetup.exe', 'djoin.exe', 'irftp.exe', 'DTUHandler.exe', 'LockScreenContentServer.exe', 'dsamain.exe', 'lpkinstall.exe', 'DataStoreCacheDumpTool.exe', 'dmclient.exe', 'dump1.exe', 'Cain.exe', 'AddInProcess32.exe', 'appidcertstorecheck.exe', 'IMJPUEX.EXE', 'HxOutlook.exe', 'FlashPlayerApp.exe', 'diskraid.exe', 'bthudtask.exe', 'explorer.exe', 'CompMgmtLauncher.exe', 'malware.exe', 'njRAT.exe', 'CompatTelRunner.exe', 'evntwin.exe', 'Dxpserver.exe', 'HelpPane.exe', 'cvtres.exe', 'dxdiag.exe', 'hvsievaluator.exe', 'signed.exe', 'csrss.exe', 'InstallBC201401.exe', 'audiodg.exe', 'dsregcmd.exe', 'ApproveChildRequest.exe', 'iisrstas.exe', 'chkdsk.exe', 'lodctr.exe', 'aspnet_state.exe', 'DiagnosticsHub.StandardCollector.Service.exe', 'chgport.exe', 'cleanmgr.exe', 'GameBar.exe', 'AgentService.exe', 'InfDefaultInstall.exe', 'IMESEARCH.EXE', 'Fondue.exe', 'iexpress.exe', 'backgroundTaskHost.exe', 'dfrgui.exe', 'cofire.exe', 'BrowserCore.exe', 'clip.exe', 'appidpolicyconverter.exe', 'ed01ebfbc9eb5bbea545af4d01bf5f1071661840480439c6e5babe8e080e41aa.exe', 'cipher.exe', 'DeviceEject.exe', 'cerber.exe', '5a765351046fea1490d20f25.exe', 'CloudExperienceHostBroker.exe', 'FXSUNATD.exe', 'GenValObj.exe', 'lsass.exe', 'ddodiag.exe', 'cmstp.exe', 'wirelesskeyview.exe', 'edpnotify.exe', 'CameraBarcodeScannerPreview.exe', 'bfsvc.exe', 'eventcreate.exe', 'driverquery.exe', 'CCG.exe', 'ConfigSecurityPolicy.exe', 'ieUnatt.exe', 'eshell.exe', 'ipconfig.exe', 'jsc.exe', 'gpscript.exe', 'LaunchTM.exe', 'cttunesvr.exe', 'curl.exe', 'cttune.exe', 'DevicePairingWizard.exe', 'ByteCodeGenerator.exe', 'IEChooser.exe', 'LockAppHost.exe', 'DataExchangeHost.exe', 'dxgiadaptercache.exe', 'dsacls.exe', 'Locator.exe', 'DpiScaling.exe', 'DisplaySwitch.exe', 'autoconv.exe', 'IMJPDCT.EXE', 'ieinstal.exe', 'colorcpl.exe', 'auditpol.exe', 'dccw.exe', 'DeviceEnroller.exe', 'UpdateCheck.exe', 'LicensingUI.exe', 'ExtExport.exe', 'easinvoker.exe', 'ApplySettingsTemplateCatalog.exe', 'eventvwr.exe', 'browser_broker.exe', 'extrac32.exe', 'EaseOfAccessDialog.exe', 'label.exe', 'change.exe', 'IMCCPHR.exe', 'audit.exe', 'aspnet_compiler.exe', 'aspnet_regiis.exe', 'desktopimgdownldr.exe', 'dmcfghost.exe', 'ComputerDefaults.exe', 'control.exe', 'DeviceCredentialDeployment.exe', 'compact.exe', 'InspectVhdDialog.exe', 'EdmGen.exe', 'cmak.exe', 'AppHostRegistrationVerifier.exe', 'DataUsageLiveTileTask.exe', 'hcsdiag.exe', 'gchrome.exe', 'adamuninstall.exe', 'CloudNotifications.exe', 'dusmtask.exe', 'fc.exe', 'hh.exe', 'eudcedit.exe', 'iscsicli.exe', 'DFDWiz.exe', 'isoburn.exe', 'IMTCPROP.exe', 'CapturePicker.exe', 'abba_-_happy_new_year_zaycev_net.exe', 'finger.exe', 'ApplicationFrameHost.exe', 'calc.exe', 'counter.exe', 'editrights.exe', 'fltMC.exe', 'convertvhd.exe', 'LegacyNetUXHost.exe', 'grpconv.exe', 'ie4uinit.exe', 'dsmgmt.exe', 'fsutil.exe', 'AppResolverUX.exe', 'BootExpCfg.exe', 'conhost.exe', 'bash.exe', 'IcsEntitlementHost.exe']
Can anyone help please?

(Edited in reaction to question updates; probably scroll down to the end.)
This probably contains more than one bug:
for directories in datasetPath: # directories iterating over my datasetPath which contains list of my pe files
samples = [f for f in os.listdir(datasetPath) if isfile(join(datasetPath, f))]
for file in samples:
filePath = directories+"/"+file
fileByteSequence = readFile(filePath)
Without knowledge of the precise data types here, it's hard to know exactly how to fix this. But certainly, if datasetPath is a list of paths, os.path.join(datasetPath, f) will not produce what you hope and expect.
Assuming datasetPath contains something like [r'A:\', r'c:\windows\kill me now'], a more or less logical rewrite could look something like
for dir in datasetPath:
samples = []
for f in os.listdir(dir):
p = os.path.join(dir, f)
if isfile(p):
samples.append(p)
for filePath in samples:
fileByteSequence = readFile(filePath)
Notice how we produce the full path just once, and then keep that. Notice how we use the loop variable dir inside the loop, not the list of paths we are looping over.
Actually I'm guessing datasetPath is actually a string, but then the for loop makes no sense (you end up looping over the characters in the string one by one).
If you merely want to check which of these files exist in the current directory, you are massively overcomplicating things.
for filePath in os.listdir('.'):
if filePath in datasetPath:
fileByteSequence = readFile(filePath)
Whether you loop over the files on the disk and check which ones are on you list or vice versa is not a crucial design detail; I have preferred the former on the theory that you want to minimize the number of disk accesses (in the ideal case you get all file names from the disk with a single system call).

I got it working finally.
For some reason it did not run properly on the Macintosh machine therefore I used the same code to run it on Linux and windows and it ran successfully.
The problem was with this line specifically:
samples = [f for f in os.listdir(datasetPath) if isfile(join(datasetPath, f))]

Related

File retention mechanism in a large data storage

recently I faced performance problem with mp4 files retention. I have kind a recorder which saves 1 min long mp4 files from multiple RTSP streams. Those files are stored on external drive in file tree like this:
./recordings/{camera_name}/{YYYY-MM-DD}/{HH-MM}.mp4
Apart from video files, there are many other files on this drive which are not considered (unless they have mp4 extension), as they took much less space.
Assumption of file retention is as follows. Every minute, python script that is responsible for recording, check for external drive fulfillment level. If the level is above 80%, it performs a scan of the whole drive, and look for .mp4 files. When scanning is done, it sorts a list of files by its creation date, and deletes the number of the oldest files which is equal to the cameras number.
The part of the code, which is responsible for files retention, is shown below.
total, used, free = shutil.disk_usage("/home")
used_percent = int(used / total * 100)
if used_percent > 80:
logging.info("SSD usage %s. Looking for the oldest files", used_percent)
try:
oldest_files = sorted(
(
os.path.join(dirname, filename)
for dirname, dirnames, filenames in os.walk('/home')
for filename in filenames
if filename.endswith(".mp4")
),
key=lambda fn: os.stat(fn).st_mtime,
)[:len(camera_devices)]
logging.info("Removing %s", oldest_files)
for oldest_file in oldest_files:
os.remove(oldest_file)
logging.info("%s removed", oldest_file)
except ValueError as e:
# no files to delete
pass
(/home is external drive mount point)
The problem is that this mechanism used to work as a charm, when I used 256 or 512 GB SSD. Now I have a need of larger space (more cameras and longer storage time), and it takes a lot of time to create files list on larger SSD (from 2 to 5 TB now and maybe 8 TB in the future). The scanning process takes a lot more than 1 min, what could be resolved by performing it more rarely, and extending the length of "to delete" files list. The real problem is, that the process uses a lot of CPU load (by I/O ops) itself. The performance drop is visible is the whole system. Other applications, like some simple computer vision algorithms, works slower, and CPU load can even cause kernel panic.
The HW I work on is Nvidia Jetson Nano and Xavier NX. Both devices have problem with performance as I described above.
The question is if you know some algorithms or out of the box software for file retention that will work on the case I described. Or maybe there is a way to rewrite my code, to let it be more reliable and perform?
EDIT:
I was able to lower os.walk() impact by limit space to check.Now I just scan /home/recordings and /home/recognition/ which also lower directory tree (for recursive scan). At the same time, I've added .jpg files checking, so now I look from both .mp4 and .jpg. Result is much better in this implementation.
However, I need further optimization. I prepared some test cases, and tested them on 1 TB drive which is 80% filled (media files mostly). I attached profiler results per case below.
#time_measure
def method6():
paths = [
"/home/recordings",
"/home/recognition",
"/home/recognition/marked_frames",
]
files = []
for path in paths:
files.extend((
os.path.join(dirname, filename)
for dirname, dirnames, filenames in os.walk(path)
for filename in filenames
if (filename.endswith(".mp4") or filename.endswith(".jpg")) and not os.path.islink(os.path.join(dirname, filename))
))
oldest_files = sorted(
files,
key=lambda fn: os.stat(fn).st_mtime,
)
print(oldest_files[:5])
#time_measure
def method7():
ext = [".mp4", ".jpg"]
paths = [
"/home/recordings/*/*/*",
"/home/recognition/*",
"/home/recognition/marked_frames/*",
]
files = []
for path in paths:
files.extend((file for file in glob(path) if not os.path.islink(file) and (file.endswith(".mp4") or file.endswith(".jpg"))))
oldest_files = sorted(files, key=lambda fn: os.stat(fn).st_mtime)
print(oldest_files[:5])
The original implementation on the same data set last ~100 s
EDIT2
#norok2 proposals comparation
I compared them with method6 and method7 from above. I tried several times with similar result.
Testing method7
['/home/recordings/35e68df5-44b1-5010-8d12-74b892c60136/2022-06-24/17-36-18.jpg', '/home/recordings/db33186d-3607-5055-85dd-7e5e3c46faba/2021-11-22/11-27-30.jpg', '/home/recordings/acce21a2-763d-56fe-980d-a85af1744b7a/2021-11-22/11-27-30.jpg', '/home/recordings/b97eb889-e050-5c82-8034-f52ae2d99c37/2021-11-22/11-28-23.jpg', '/home/recordings/01ae845c-b743-5b64-86f6-7f1db79b73ae/2021-11-22/11-28-23.jpg']
Took 24.73726773262024 s
_________________________
Testing find_oldest
['/home/recordings/35e68df5-44b1-5010-8d12-74b892c60136/2022-06-24/17-36-18.jpg', '/home/recordings/db33186d-3607-5055-85dd-7e5e3c46faba/2021-11-22/11-27-30.jpg', '/home/recordings/acce21a2-763d-56fe-980d-a85af1744b7a/2021-11-22/11-27-30.jpg', '/home/recordings/b97eb889-e050-5c82-8034-f52ae2d99c37/2021-11-22/11-28-23.jpg', '/home/recordings/01ae845c-b743-5b64-86f6-7f1db79b73ae/2021-11-22/11-28-23.jpg']
Took 34.355509757995605 s
_________________________
Testing find_oldest_cython
['/home/recordings/35e68df5-44b1-5010-8d12-74b892c60136/2022-06-24/17-36-18.jpg', '/home/recordings/db33186d-3607-5055-85dd-7e5e3c46faba/2021-11-22/11-27-30.jpg', '/home/recordings/acce21a2-763d-56fe-980d-a85af1744b7a/2021-11-22/11-27-30.jpg', '/home/recordings/b97eb889-e050-5c82-8034-f52ae2d99c37/2021-11-22/11-28-23.jpg', '/home/recordings/01ae845c-b743-5b64-86f6-7f1db79b73ae/2021-11-22/11-28-23.jpg']
Took 25.81963086128235 s
method7 (glob())
iglob()
Cython
You could get an extra few percent speed-up on top of your method7() with the following:
import os
import glob
def find_oldest(paths=("*",), exts=(".mp4", ".jpg"), k=5):
result = [
filename
for path in paths
for filename in glob.iglob(path)
if any(filename.endswith(ext) for ext in exts) and not os.path.islink(filename)]
mtime_idxs = sorted(
(os.stat(fn).st_mtime, i)
for i, fn in enumerate(result))
return [result[mtime_idxs[i][1]] for i in range(k)]
The main improvements are:
use iglob instead of glob -- while it may be of comparable speed, it takes significantly less memory which may help on low end machines
str.endswith() is done before the allegedly more expensive os.path.islink() which helps reducing the number of such calls due to shortcircuiting
an intermediate list with all the mtimes is produces to minimize the os.stat() calls
This can be sped up even further with Cython:
%%cython --cplus -c-O3 -c-march=native -a
import os
import glob
cpdef find_oldest_cy(paths=("*",), exts=(".mp4", ".jpg"), k=5):
result = []
for path in paths:
for filename in glob.iglob(path):
good_ext = False
for ext in exts:
if filename.endswith(ext):
good_ext = True
break
if good_ext and not os.path.islink(filename):
result.append(filename)
mtime_idxs = []
for i, fn in enumerate(result):
mtime_idxs.append((os.stat(fn).st_mtime, i))
mtime_idxs.sort()
return [result[mtime_idxs[i][1]] for i in range(k)]
My tests on the following files:
def gen_files(n, exts=("mp4", "jpg", "txt"), filename="somefile", content="content"):
for i in range(n):
ext = exts[i % len(exts)]
with open(f"{filename}{i}.{ext}", "w") as f:
f.write(content)
gen_files(10_000)
produces the following:
funcs = find_oldest_OP, find_oldest, find_oldest_cy
timings = []
base = funcs[0]()
for func in funcs:
res = func()
is_good = base == res
timed = %timeit -r 8 -n 4 -q -o func()
timing = timed.best * 1e3
timings.append(timing if is_good else None)
print(f"{func.__name__:>24} {is_good} {timing:10.3f} ms")
# find_oldest_OP True 81.074 ms
# find_oldest True 70.994 ms
# find_oldest_cy True 64.335 ms
find_oldest_OP is the following, based on method7() from OP:
def find_oldest_OP(paths=("*",), exts=(".mp4", ".jpg"), k=5):
files = []
for path in paths:
files.extend(
(file for file in glob.glob(path)
if not os.path.islink(file) and any(file.endswith(ext) for ext in exts)))
oldest_files = sorted(files, key=lambda fn: os.stat(fn).st_mtime)
return oldest_files[:k]
The Cython version seems to point to a ~25% reduction in execution time.
You could use the subprocess module to list all the mp4 files directly, without having to loop through all the files in the directory.
import subprocess as sb
oldest_files = sb.getoutput("dir /b /s .\home\*.mp4").split("\n")).sort(lambda fn: os.stat(fn).st_mtime,)[:len(camera_devices)]
A quick optimization would be not to bother checking file creation time and trusting the filename.
total, used, free = shutil.disk_usage("/home")
used_percent = int(used / total * 100)
if used_percent > 80:
logging.info("SSD usage %s. Looking for the oldest files", used_percent)
try:
files = []
for dirname, dirnames, filenames in os.walk('/home/recordings'):
for filename in filenames:
files.push((
name := os.path.join(dirname, filename),
datetime.strptime(
re.search(r'\d{4}-\d{2}-\d{2}\/\d{2}-\d{2}', name)[0],
"%Y-%m-%d/%H-%M"
))
oldest_files = files.sort(key=lambda e: e[1])[:len(camera_devices)]
logging.info("Removing %s", oldest_files)
for oldest_file in oldest_files:
os.remove(oldest_file)
# logging.info("%s removed", oldest_file)
logging.info("Removed")
except ValueError as e:
# no files to delete
pass

How to define a function that would move files to given directories?

I need to define a function that will copy files from one directory to TRAINING and TESTING directories given the split size (so for example if split size = 0.9, 90% of the files go to TRAINING). The condition is that the file can't be of size 0 and the files should be randomized. this is what I've managed to think of, but it doesn't work.
def split_data(SOURCE, TRAINING, TESTING, SPLIT_SIZE):
file = os.listdir(SOURCE)
file = os.path.getsize(SOURCE) > 0
if file:
copyfile(random.sample(SOURCE, SPLIT_SIZE), TRAINING),
copyfile(random.sample(SOURCE, 1-SPLIT_SIZE), TESTING)
When I try to run this function an error comes up:
TypeError: can't multiply sequence by non-int of type 'float'
Can you please tell me how to define a function that will take the following arguments (this is what it must look like since it's part of an assignment in a course)?:
def split_data(SOURCE, TRAINING, TESTING, SPLIT_SIZE)
Thank you, Joanna
PS: I'm reposting my previous question changing its content as suggested by one of the users, so that it's more clear.
Here's one such method:
def split_data(SOURCE, TRAINING, TESTING, SPLIT_SIZE):
source_files = [f for f in os.listdir(SOURCE) if os.path.getsize(f) > 0]
random.shuffle(source_files)
total = len(source_files)
to_training = source_files[0: int(total * SPLIT_SIZE)]
to_test = source_files[int(total * SPLIT_SIZE):]
for f in to_training:
copyfile(os.path.join(SOURCE, f), TRAINING)
for f in to_test:
copyfile(os.path.join(SOURCE, f), TESTING)
assert len(source_files) == len(to_training) + len(to_test)

How to modify iteration list?

Following scenario of traversing dir structure.
"Build complete dir tree with files but if files in single dir are similar in name list only single entity"
Example tree ( let's assume they're are not sorted ):
- rootDir
-dirA
fileA_01
fileA_03
fileA_05
fileA_06
fileA_04
fileA_02
fileA_...
fileAB
fileAC
-dirB
fileBA
fileBB
fileBC
Expected output:
- rootDir
-dirA
fileA_01 - fileA_06 ...
fileAB
fileAC
-dirB
fileBA
fileBB
fileBC
So I did already simple def findSimilarNames that for fileA_01 (or any fileA_) will return list [fileA_01...fileA_06]
Now I'm in os.walk and I'm doing loop over files so every file will be checked against similar filenames so e.g fileA_03 I've got rest of them [fileA_01 - fileA_06] and now I want to modify the list that I iterate over to just skip items from findSimilarNames, without need of using another loop or if's inside.
I searched here and people are suggesting avoidance of modifying iteration list, but doing so I would avoid every file iteration.
Pseudo code:
for root,dirs,files in os.walk( path ):
for file in files:
similarList = findSimilarNames( file )
#OVERWRITE ITERATION LIST SOMEHOW
files = (set(files)-set(similarList))
#DEAL WITH ELEMENT
What I'm trying to avoid is below - checking each file because maybe it's already found by findSimilarNames.
for root,dirs,files in os.walk( path ):
filteredbysimilar = files[:]
for file in files:
similar = findSimilarNames( file )
filteredbysimilar = list(set(filteredbysimilar)-set(similar))
#--
for filteredFile in filteredbysimilar:
#DEAL WITH ELEMENT
#OVERWRITE ITERATION LIST SOMEHOW
You can get this effect by using a while-loop style iteration. Since you want to do set subtraction to remove the similar groups anyway, the natural approach is to start with a set of all the filenames, and repeatedly remove groups until nothing is left. Thus:
unprocessed = set(files)
while unprocessed:
f = unprocessed.pop() # removes and returns an arbitrary element
group = findSimilarNames(f)
unprocessed -= group # it is not an error that `f` has already been removed.
doSomethingWith(group) # i.e., "DEAL WITH ELEMENT" :)
How about building up a list of files that aren't similar?
unsimilar = set()
for f in files:
if len(findSimilarNames(f).intersection(unsimilar))==0:
unsimilar.add(f)
This assumes findSimilarNames yields a set.

Open two files pairwise out of many - python

Hey guys I'm a rookie in python and need some help.
My problem is, that I have a folder full of text files (with lists in it), where two belong to each other and need to be read and compared.
Folder with many files: File1_in.xlo, File1_out.xlo, File2_in.xlo, File2_out.xlo, ...
--> so File1_in.xlo and File1_out.xlo belong together and need to be compared.
I already can append the lists of the 'in-Files' (or 'out-Files') and then compare them, but since there are many Files the lists become really long (thousands and thousands of entries), so the idea is to compare the files or respectively the lists pairwise.
My first try looks like:
import os
for filename in sorted(os.listdir('path')):
if filename.endswith('in.xlo'):
with open(os.path.join('path', filename)) as inn:
lines = inn.readlines()
for x in lines:
temperatureIn = x.split()[4]
if filename.endswith('out.xlo'):
with open(os.path.join('path', filename)) as outt:
lines = outt.readlines()
for x in lines:
temperatureOut = x.split()[4] #4. column in list
So the problem is, as you can see, the 'temperatureIn's are always overwritten before I can compare them with the 'temperatureOut's. I think/ hope there must be a way to open both files at once to compare the list entries.
I hope you can understand my problem and someone can help me.
Thanks
Use zip to access in-Files and out-Files in pairs
files = sorted(os.listdir('path'))
in_files = [fname for fname in files if fname.endswith('in.xlo')]
out_files = [fname for fname in files if fname.endswith('out.xlo')]
for in_file, out_file in zip(in_files, out_files):
with open(os.path.join('path', in_file)) as inn, open(os.path.join('path', out_file)) as outt:
# Do whatever you want
add them to a list created just before your for loop, as:
temps_in =[]
for x in lines:
temperatureIn = x.split()[4]
temps_in.append(temperatureIn)
Do the same thoing for temperatures out, then compare your two lists

removing file names from a list python

I have all filenames of a directory in a list named files. And I want to filter it so only the files with the .php extension remain.
for x in files:
if x.find(".php") == -1:
files.remove(x)
But this seems to skip filenames. What can I do about this?
How about a simple list comprehension?
files = [f for f in files if f.endswith('.php')]
Or if you prefer a generator as a result:
files = (f for f in files if f.endswith('.php'))
>>> files = ['a.php', 'b.txt', 'c.html', 'd.php']
>>> [f for f in files if f.endswith('.php')]
['a.php', 'd.php']
Most of the answers provided give list / generator comprehensions, which are probably the way you want to go 90% of the time, especially if you don't want to modify the original list.
However, for those situations where (say for size reasons) you want to modify the original list in place, I generally use the following snippet:
idx = 0
while idx < len(files):
if files[idx].find(".php") == -1:
del files[idx]
else:
idx += 1
As to why your original code wasn't working - it's changing the list as you iterator over it... the "for x in files" is implicitly creating an iterator, just like if you'd done "for x in iter(files)", and deleting elements in the list confuses the iterator about what position it is at. For such situations, I generally use the above code, or if it happens a lot in a project, factor it out into a function, eg:
def filter_in_place(func, target):
idx = 0
while idx < len(target):
if func(target[idx)):
idx += 1
else:
del target[idx]
Just stumbled across this old question. Many solutions here will do the job but they ignore a case where filename could be just ".php". I suspect that the question was about how to filter PHP scripts and ".php" may not be a php script. Solution that I propose is as follows:
>>> import os.path
>>> files = ['a.php', 'b.txt', 'c.html', 'd.php', '.php']
>>> [f for f in files if os.path.splitext(f)[1] == ".php"]

Categories