File retention mechanism in a large data storage - python

recently I faced performance problem with mp4 files retention. I have kind a recorder which saves 1 min long mp4 files from multiple RTSP streams. Those files are stored on external drive in file tree like this:
./recordings/{camera_name}/{YYYY-MM-DD}/{HH-MM}.mp4
Apart from video files, there are many other files on this drive which are not considered (unless they have mp4 extension), as they took much less space.
Assumption of file retention is as follows. Every minute, python script that is responsible for recording, check for external drive fulfillment level. If the level is above 80%, it performs a scan of the whole drive, and look for .mp4 files. When scanning is done, it sorts a list of files by its creation date, and deletes the number of the oldest files which is equal to the cameras number.
The part of the code, which is responsible for files retention, is shown below.
total, used, free = shutil.disk_usage("/home")
used_percent = int(used / total * 100)
if used_percent > 80:
logging.info("SSD usage %s. Looking for the oldest files", used_percent)
try:
oldest_files = sorted(
(
os.path.join(dirname, filename)
for dirname, dirnames, filenames in os.walk('/home')
for filename in filenames
if filename.endswith(".mp4")
),
key=lambda fn: os.stat(fn).st_mtime,
)[:len(camera_devices)]
logging.info("Removing %s", oldest_files)
for oldest_file in oldest_files:
os.remove(oldest_file)
logging.info("%s removed", oldest_file)
except ValueError as e:
# no files to delete
pass
(/home is external drive mount point)
The problem is that this mechanism used to work as a charm, when I used 256 or 512 GB SSD. Now I have a need of larger space (more cameras and longer storage time), and it takes a lot of time to create files list on larger SSD (from 2 to 5 TB now and maybe 8 TB in the future). The scanning process takes a lot more than 1 min, what could be resolved by performing it more rarely, and extending the length of "to delete" files list. The real problem is, that the process uses a lot of CPU load (by I/O ops) itself. The performance drop is visible is the whole system. Other applications, like some simple computer vision algorithms, works slower, and CPU load can even cause kernel panic.
The HW I work on is Nvidia Jetson Nano and Xavier NX. Both devices have problem with performance as I described above.
The question is if you know some algorithms or out of the box software for file retention that will work on the case I described. Or maybe there is a way to rewrite my code, to let it be more reliable and perform?
EDIT:
I was able to lower os.walk() impact by limit space to check.Now I just scan /home/recordings and /home/recognition/ which also lower directory tree (for recursive scan). At the same time, I've added .jpg files checking, so now I look from both .mp4 and .jpg. Result is much better in this implementation.
However, I need further optimization. I prepared some test cases, and tested them on 1 TB drive which is 80% filled (media files mostly). I attached profiler results per case below.
#time_measure
def method6():
paths = [
"/home/recordings",
"/home/recognition",
"/home/recognition/marked_frames",
]
files = []
for path in paths:
files.extend((
os.path.join(dirname, filename)
for dirname, dirnames, filenames in os.walk(path)
for filename in filenames
if (filename.endswith(".mp4") or filename.endswith(".jpg")) and not os.path.islink(os.path.join(dirname, filename))
))
oldest_files = sorted(
files,
key=lambda fn: os.stat(fn).st_mtime,
)
print(oldest_files[:5])
#time_measure
def method7():
ext = [".mp4", ".jpg"]
paths = [
"/home/recordings/*/*/*",
"/home/recognition/*",
"/home/recognition/marked_frames/*",
]
files = []
for path in paths:
files.extend((file for file in glob(path) if not os.path.islink(file) and (file.endswith(".mp4") or file.endswith(".jpg"))))
oldest_files = sorted(files, key=lambda fn: os.stat(fn).st_mtime)
print(oldest_files[:5])
The original implementation on the same data set last ~100 s
EDIT2
#norok2 proposals comparation
I compared them with method6 and method7 from above. I tried several times with similar result.
Testing method7
['/home/recordings/35e68df5-44b1-5010-8d12-74b892c60136/2022-06-24/17-36-18.jpg', '/home/recordings/db33186d-3607-5055-85dd-7e5e3c46faba/2021-11-22/11-27-30.jpg', '/home/recordings/acce21a2-763d-56fe-980d-a85af1744b7a/2021-11-22/11-27-30.jpg', '/home/recordings/b97eb889-e050-5c82-8034-f52ae2d99c37/2021-11-22/11-28-23.jpg', '/home/recordings/01ae845c-b743-5b64-86f6-7f1db79b73ae/2021-11-22/11-28-23.jpg']
Took 24.73726773262024 s
_________________________
Testing find_oldest
['/home/recordings/35e68df5-44b1-5010-8d12-74b892c60136/2022-06-24/17-36-18.jpg', '/home/recordings/db33186d-3607-5055-85dd-7e5e3c46faba/2021-11-22/11-27-30.jpg', '/home/recordings/acce21a2-763d-56fe-980d-a85af1744b7a/2021-11-22/11-27-30.jpg', '/home/recordings/b97eb889-e050-5c82-8034-f52ae2d99c37/2021-11-22/11-28-23.jpg', '/home/recordings/01ae845c-b743-5b64-86f6-7f1db79b73ae/2021-11-22/11-28-23.jpg']
Took 34.355509757995605 s
_________________________
Testing find_oldest_cython
['/home/recordings/35e68df5-44b1-5010-8d12-74b892c60136/2022-06-24/17-36-18.jpg', '/home/recordings/db33186d-3607-5055-85dd-7e5e3c46faba/2021-11-22/11-27-30.jpg', '/home/recordings/acce21a2-763d-56fe-980d-a85af1744b7a/2021-11-22/11-27-30.jpg', '/home/recordings/b97eb889-e050-5c82-8034-f52ae2d99c37/2021-11-22/11-28-23.jpg', '/home/recordings/01ae845c-b743-5b64-86f6-7f1db79b73ae/2021-11-22/11-28-23.jpg']
Took 25.81963086128235 s
method7 (glob())
iglob()
Cython

You could get an extra few percent speed-up on top of your method7() with the following:
import os
import glob
def find_oldest(paths=("*",), exts=(".mp4", ".jpg"), k=5):
result = [
filename
for path in paths
for filename in glob.iglob(path)
if any(filename.endswith(ext) for ext in exts) and not os.path.islink(filename)]
mtime_idxs = sorted(
(os.stat(fn).st_mtime, i)
for i, fn in enumerate(result))
return [result[mtime_idxs[i][1]] for i in range(k)]
The main improvements are:
use iglob instead of glob -- while it may be of comparable speed, it takes significantly less memory which may help on low end machines
str.endswith() is done before the allegedly more expensive os.path.islink() which helps reducing the number of such calls due to shortcircuiting
an intermediate list with all the mtimes is produces to minimize the os.stat() calls
This can be sped up even further with Cython:
%%cython --cplus -c-O3 -c-march=native -a
import os
import glob
cpdef find_oldest_cy(paths=("*",), exts=(".mp4", ".jpg"), k=5):
result = []
for path in paths:
for filename in glob.iglob(path):
good_ext = False
for ext in exts:
if filename.endswith(ext):
good_ext = True
break
if good_ext and not os.path.islink(filename):
result.append(filename)
mtime_idxs = []
for i, fn in enumerate(result):
mtime_idxs.append((os.stat(fn).st_mtime, i))
mtime_idxs.sort()
return [result[mtime_idxs[i][1]] for i in range(k)]
My tests on the following files:
def gen_files(n, exts=("mp4", "jpg", "txt"), filename="somefile", content="content"):
for i in range(n):
ext = exts[i % len(exts)]
with open(f"{filename}{i}.{ext}", "w") as f:
f.write(content)
gen_files(10_000)
produces the following:
funcs = find_oldest_OP, find_oldest, find_oldest_cy
timings = []
base = funcs[0]()
for func in funcs:
res = func()
is_good = base == res
timed = %timeit -r 8 -n 4 -q -o func()
timing = timed.best * 1e3
timings.append(timing if is_good else None)
print(f"{func.__name__:>24} {is_good} {timing:10.3f} ms")
# find_oldest_OP True 81.074 ms
# find_oldest True 70.994 ms
# find_oldest_cy True 64.335 ms
find_oldest_OP is the following, based on method7() from OP:
def find_oldest_OP(paths=("*",), exts=(".mp4", ".jpg"), k=5):
files = []
for path in paths:
files.extend(
(file for file in glob.glob(path)
if not os.path.islink(file) and any(file.endswith(ext) for ext in exts)))
oldest_files = sorted(files, key=lambda fn: os.stat(fn).st_mtime)
return oldest_files[:k]
The Cython version seems to point to a ~25% reduction in execution time.

You could use the subprocess module to list all the mp4 files directly, without having to loop through all the files in the directory.
import subprocess as sb
oldest_files = sb.getoutput("dir /b /s .\home\*.mp4").split("\n")).sort(lambda fn: os.stat(fn).st_mtime,)[:len(camera_devices)]

A quick optimization would be not to bother checking file creation time and trusting the filename.
total, used, free = shutil.disk_usage("/home")
used_percent = int(used / total * 100)
if used_percent > 80:
logging.info("SSD usage %s. Looking for the oldest files", used_percent)
try:
files = []
for dirname, dirnames, filenames in os.walk('/home/recordings'):
for filename in filenames:
files.push((
name := os.path.join(dirname, filename),
datetime.strptime(
re.search(r'\d{4}-\d{2}-\d{2}\/\d{2}-\d{2}', name)[0],
"%Y-%m-%d/%H-%M"
))
oldest_files = files.sort(key=lambda e: e[1])[:len(camera_devices)]
logging.info("Removing %s", oldest_files)
for oldest_file in oldest_files:
os.remove(oldest_file)
# logging.info("%s removed", oldest_file)
logging.info("Removed")
except ValueError as e:
# no files to delete
pass

Related

Is there a better way to do this? Counting Files, and directories via for loop vs map

Folks,
I'm trying to optimize this to help speed up the process...
What I am doing is creating a dictionary of scandir entries...
e.g.
fs_data = {}
for item in Path(fqpn).iterdir():
# snipped out a bunch of normalization code
fs_data[item.name.title().strip()] = item
{'file1': <file1 scandisk data>, etc}
and then later using a function to gather the count of files, and directories in the data.
Now I suspect that the new code, using map could be optimized to be faster than the old code. I suspect that having to run the list comprehension twice, once for files, and once for directories.
But I can't think of a way to optimize it to only have to run once.
Can anyone suggest a way to sum the files, and directories at the same time in the new version? (I could fall back to the old code, if necessary)
But I might be over optimizing at this point?
Any feedback would be welcome.
def new_fs_counts(fs_entries) -> (int, int):
"""
Quickly count the files vs directories in a list of scandir entries
Used primary by sync_database_disk to count a path's files & directories
Parameters
----------
fs_entries (list) - list of scandir entries
Returns
-------
tuple - (# of files, # of dirs)
"""
def counter(fs_entry):
return (fs_entry.is_file(), not fs_entry.is_file())
mapdata = list(map(counter, fs_entries.values()))
files = sum(files for files, _ in mapdata)
dirs = sum(dirs for _, dirs in mapdata)
return (files, dirs)
vs
def old_fs_counts(fs_entries) -> (int, int):
"""
Quickly count the files vs directories in a list of scandir entries
Used primary by sync_database_disk to count a path's files & directories
Parameters
----------
fs_entries (list) - list of scandir entries
Returns
-------
tuple - (# of files, # of dirs)
"""
files = 0
dirs = 0
for fs_item in fs_entries:
is_file = fs_entries[fs_item].is_file()
files += is_file
dirs += not is_file
return (files, dirs)
map is fast here if you map the is_file function directly:
files = sum(map(os.DirEntry.is_file, fs_entries.values()))
dirs = len(fs_entries) - files
(Something with filter might be even faster, at least if most entries aren't files. Or filter with is_dir if that works for you and most entries aren't directories. Or itertools.filterfalse with is_file. Or using itertools.compress. Also, counting True with list.count or operator.countOf instead of summing bools might be faster. But all of these ideas take more code (and some also memory). I'd prefer my above way.)
Okay, map is definitely not the right answer here.
This morning I got up and created a test using timeit...
and it was a bit of a splash of reality to the face.
Without optimizations, new vs old, the new map code was roughly 2x the time.
New : 0.023185124970041215
old : 0.011841499945148826
I really ended up falling for a bit of click bait, and thought that rewriting with MAP would gain some better efficiency.
For the sake of completeness.
from timeit import timeit
import os
new = '''
def counter(fs_entry):
files = fs_entry.is_file()
return (files, not files)
mapdata = list(map(counter, fs_entries.values()))
files = sum(files for files, _ in mapdata)
dirs = sum(dirs for _, dirs in mapdata)
#dirs = len(fs_entries)-files
'''
#dirs = sum(dirs for _, dirs in mapdata)
old = '''
files = 0
dirs = 0
for fs_item in fs_entries:
is_file = fs_entries[fs_item].is_file()
files += is_file
dirs += not is_file
'''
fs_location = '/Volumes/4TB_Drive/gallery/albums/collection1'
fs_data = {}
for item in os.scandir(fs_location):
fs_data[item.name] = item
print("New : ", timeit(stmt=new, number=1000, globals={'fs_entries':fs_data}))
print("old : ", timeit(stmt=old, number=1000, globals={'fs_entries':fs_data}))
And while I was able close the gap with some optimizations.. (Thank you Lee for your suggestion)
New : 0.10864979098550975
old : 0.08246175001841038
It is clear that the for loop solution is easier to read, faster, and just simpler.
The speed difference between new and old, doesn't seem to be map specifically.
The duplicate sum statement added .021, and The biggest slow down was from the second fs_entry.is_file, it added .06x to the timings...

How to get the list of csv files in a directory sorted by creation date in Python

I need to get the list of ".csv" files in a directory, sorted by creation date.
I use this function:
from os import listdir
from os.path import isfile, join, getctime
def get_sort_files(path, file_extension):
list_of_files = filter(lambda x: isfile(join(path, x)),listdir(path))
list_of_files = sorted(list_of_files, key=lambda x: getctime(join(path, x)))
list_of_files = [file for file in list_of_files if file.endswith(file_extension)] # keep only csv files
return list_of_files
It works fine when I use it in directories that contain a small number of csv files (e.g. 500), but it's very slow when I use it in directories that contain 50000 csv files: it takes about 50 seconds to return.
How can I modify it? Or can I use a better alternative function?
EDIT1:
The bottleneck is the sorted function, so I must find an alternative to sort the files by creation date without using it
EDIT2:
I only need the oldest file (the first if sorted by creation date), so maybe I don't need to sort all the files. Can I just pick the oldest one?
You should start by only examining the creation time on relevant files. You can do this by using glob() to return the files of interest.
Build a list of 2-tuples - i.e., (creation time, file name)
A sort of that list will implicitly be performed on the first item in each tuple (the creation date).
Then you can return a list of files in the required order.
from glob import glob
from os.path import join, getctime
def get_sort_files(path, extension):
list_of_files = []
for file in glob(join(path,f'*{extension}')):
list_of_files.append((getctime(file), file))
return [file for _, file in sorted(list_of_files)]
print(get_sort_files('some directory', 'csv'))
Edit:
I created a directory with 50,000 dummy CSV files and timed the code shown in this answer. It took 0.24s
Edit 2:
OP only wants oldest file. In which case:
def get_oldest_file(path, extension):
ctime = float('inf')
old_file = None
for file in glob(join(path,f'*{extension}')):
if (ctime_ := getctime(file)) < ctime:
ctime = ctime_
old_file = file
return old_file
You could try using os.scandir:
from os import scandir
def get_sort_files(path, file_extension):
"""Return the oldest file in path with correct file extension"""
list_of_files = [(d.stat().st_ctime, d.path) for d in scandir(path) if d.is_file() and d.path.endswith(file_extension)]
return min(list_of_files)
os.scandir seems to used less calls to stat. See this post for details.
I could see much better performance on a sample folder with 5000 csv files.
You could try the following code:
def get_sort_files(path, file_extension):
list_of_files = [file for file in listdir(path) if isfile(join(path, file)) and file.endswith(file_extension)]
list_of_files.sort(key=lambda x: getctime(join(path, x)))
return list_of_files
This version could have better performance especially on big folders. It uses a list comprehension directly at the beginning to ignore irrelevant files right from the beginning. It uses in-place sorting.
This way, this code uses only one list. In your code, you create multiple lists in memory and the data has to be copied each time:
listdir(path) returns the initial list of filenames
sorted(...) returns a filtered and sorted copy of the initial list
The list comprehension before the return statement creates another new list
You can try this method:
def get_sort_files(path, extention):
# Relative path generator
sort_paths = (join(path, i)
for i in listdir(path) if i.endswith(extention))
sort_paths = sorted(sort_paths, key=getctime)
return sort_paths
# Include the . char to be explicit
>>> get_sort_files("dir", ".csv")
['dir/new.csv', 'dir/test.csv']
However, all file names are in a relative path; folder/file.csv. A slightly less efficient work-around would be to use a lambda key again:
def get_sort_files(path, extention):
# File name generator
sort_paths = (i for i in listdir(path) if i.endswith(extention))
sort_paths = sorted(sort_paths, key=lambda x: getctime(join(path, x)))
return sort_paths
>>> get_sort_files("dir", ".csv")
['new.csv', 'test.csv']
Edit for avoiding sorted():
Using min():
This is the fastest method of all listed in this answer
def get_sort_files(path, extention):
# Relative path generator
sort_paths = (join(path, i) for i in listdir(path) if i.endswith(extention))
return min(sort_paths, key=getctime)
Manually:
def get_sort_files(path, extention):
# Relative path generator
sort_paths = [join(path, i) for i in listdir(path) if i.endswith(extention)]
oldest = (getctime(sort_paths[0]), sort_paths[0])
for i in sort_paths[1:]:
t = getctime(i)
if t < oldest[0]:
oldest = (t, i)
return oldest[1]

Is it possible to do dbutils io asynchronously?

I've written some code (based on https://stackoverflow.com/a/40199652/529618) that writes partitioned data to blob, and for the most part it's quite quick. The slowest part is that the one csv file per partition I have spark generate are named in a user-unfriendly way, so I do a simple rename operation to clean them up (and delete some excess files). This takes much longer than writing the data in the first place.
# Organize the data into a folders matching the specified partitions, with a single CSV per partition
from datetime import datetime
def one_file_per_partition(df, path, partitions, sort_within_partitions, VERBOSE = False):
extension = ".csv.gz" # TODO: Support multiple extention
start = datetime.now()
df.repartition(*partitions).sortWithinPartitions(*sort_within_partitions) \
.write.partitionBy(*partitions).option("header", "true").option("compression", "gzip").mode("overwrite").csv(path)
log(f"Wrote {get_df_name(df)} data partitioned by {partitions} and sorted by {sort_within_partitions} to:" +
f"\n {path}\n Time taken: {(datetime.now() - start).total_seconds():,.2f} seconds")
# Recursively traverse all partition subdirectories and rename + move the CSV to their root
# TODO: This is very slow, it should be parallelizable
def traverse(root, remaining_partitions):
if VERBOSE: log(f"Traversing partitions by {remaining_partitions[0]} within folder: {root}")
for folder in list_subfolders(root):
subdirectory = os.path.join(root, folder)
if(len(remaining_partitions) > 1):
traverse(subdirectory, remaining_partitions[1:])
else:
destination = os.path.join(root, folder[len(f"{remaining_partitions[0]}="):]) + extension
if VERBOSE: log(f"Moving file\nFrom:{subdirectory}\n To:{destination}")
spark_output_to_single_file(subdirectory, destination, VERBOSE)
log(f"Cleaning up spark output directories...")
start = datetime.now()
traverse(path, partitions)
log(f"Moving output files to their destination took {(datetime.now() - start).total_seconds():,.2f} seconds")
# Convert a single-file spark output folder into a single file at the specified location, and clean up superfluous artifacts
def spark_output_to_single_file(output_folder, destination_path, VERBOSE = False):
output_files = [x for x in dbutils.fs.ls(output_folder) if x.name.startswith("part-")]
if(len(output_files) == 0):
raise FileNotFoundError(f"Could not find any output files (prefixed with 'part-') in the specified spark output folder: {output_folder}")
if(len(output_files) > 1):
raise ValueError(f"The specified spark folder has more than 1 output file in the specified spark output folder: {output_folder}\n" +
f"We found {len(output_files)}: {[x.name for x in output_files]}\n" +
f"This function should only be used for single-file spark outputs.")
dbutils.fs.mv(output_files[0].path, destination_path)
# Clean up all the other spark output generated to our temp folder
dbutils.fs.rm(output_folder, recurse=True)
if VERBOSE: log(f"Successfully wrote {destination_path}")
Here is a sample output:
2022-04-22 20:36:45.313963 Wrote df_test data partitioned by ['Granularity', 'PORTINFOID'] and sorted by ['Rank'] to: /mnt/.../all_data_by_rank
Time taken: 19.31 seconds
2022-04-22 20:36:45.314020 Cleaning up spark output directories...
2022-04-22 20:37:42.583850 Moving output files to their destination took 57.27 seconds
I believe the reason is that I'm processing the folders sequentially, and if I could simply do it in parallel, it would go much quicker.
The problem is that all IO on databricks is done with "dbutils", which is abstracting out mounted blob container and making this sort of thing very easy. I just can't find any information about doing async IO with this utility though.
Does anyone know how I could attempt to parallelize this activity?
The solution wound up being to abandon dbutils, which does not support parallelism in any way, and instead use os operations, which does:
import os
from datetime import datetime
from pyspark.sql.types import StringType
# Recursively traverse all partition subdirectories and rename + move the outputs to their root
# NOTE: The code to do this sequentially is much simpler, but very slow. The complexity arises from parallelising the file operations
def spark_output_to_single_file_per_partition(root, partitions, output_extension, VERBOSE = False):
if VERBOSE: log(f"Cleaning up spark output directories...")
start = datetime.now()
# Helper to recursively collect information from all partitions and flatten it into a single list
def traverse_partitions(root, partitions, fn_collect_info, currentPartition = None):
results = [fn_collect_info(root, currentPartition)]
return results if len(partitions) == 0 else results + \
[result for subdir in [traverse_partitions(os.path.join(root, folder), partitions[1:], fn_collect_info, partitions[0]) for folder in list_subfolders(root)] for result in subdir]
# Get the path of files to rename or delete. Note: We must convert to OS paths because we cannot parallelize use of dbutils
def find_files_to_rename_and_delete(folder, partition):
files = [x.name for x in dbutils.fs.ls(folder)]
renames = [x for x in files if x[0:5] == "part-"]
deletes = [f"/dbfs{folder}/{x}" for x in files if x[0:1] == "_"]
if len(renames) > 0 and partition is None: raise Exception(f"Found {len(files)} partition file(s) in the root location: {folder}. Have files already been moved?")
elif len(renames) > 1: raise Exception(f"Expected at most one partition file, but found {len(files)} in location: {folder}")
elif len(renames) == 1: deletes.append(f"/dbfs{folder}/") # The leaf-folders (containing partitions) should be deleted after the file is moved
return (deletes, None if len(renames) == 0 else (f"/dbfs{folder}/{renames[0]}", f"/dbfs{folder.replace(partition + '=', '')}{output_extension}"))
# Scan the file system to find all files and folders that need to be moved and deleted
if VERBOSE: log(f"Collecting a list of files that need to be renamed and deleted...")
actions = traverse_partitions(root, partitions, find_files_to_rename_and_delete)
# Rename all files in parallel using spark executors
renames = [rename for (deletes, rename) in actions if rename is not None]
if VERBOSE: log(f"Renaming {len(renames)} partition files...")
spark.createDataFrame(renames, ['from', 'to']).foreach(lambda r: os.rename(r[0], r[1]))
# Delete unwanted spark temp files and empty folders
deletes = [path for (deletes, rename) in actions for path in deletes]
delete_files = [d for d in deletes if d[-1] != "/"]
delete_folders = [d for d in deletes if d[-1] == "/"]
if VERBOSE: log(f"Deleting {len(delete_files)} spark outputs...")
spark.createDataFrame(delete_files, StringType()).foreach(lambda r: os.remove(r[0]))
if VERBOSE: log(f"Deleting {len(delete_folders)} empty folders...")
spark.createDataFrame(delete_folders, StringType()).foreach(lambda r: os.rmdir(r[0]))
log(f"Moving output files to their destination and cleaning spark artifacts took {(datetime.now() - start).total_seconds():,.2f} seconds")
This lets you generate partitioned data, with user-friendly names, and clean up all the spark temp files (_started..., _committed..., _SUCCESS) generated in the process.
Usage:
# Organize the data into a folders matching the specified partitions, with a single CSV per partition
def dataframe_to_csv_gz_per_partition(df, path, partitions, sort_within_partitions, rename_spark_outputs = True, VERBOSE = False):
start = datetime.now()
# Write the actual data to disk using spark
df.repartition(*partitions).sortWithinPartitions(*sort_within_partitions) \
.write.partitionBy(*partitions).option("header", "true").option("compression", "gzip").mode("overwrite").csv(path)
log(f"Wrote {get_df_name(df)} data partitioned by {partitions} and sorted by {sort_within_partitions} to:" +
f"\n {path}\n Time taken: {(datetime.now() - start).total_seconds():,.2f} seconds")
# Rename outputs and clean up
spark_output_to_single_file_per_partition(path, partitions, ".csv.gz", VERBOSE)
For what it's worth, I also tried parallelizing with Pool, but the results were not as good. I haven't attempted importing and using any libraries that can do async io, I imagine this would perform the best.

Iterating over PE files

Below is part of my code in which I am trying to iterate over PE files. I am still getting the same error which is:
[Errno 2] No such file or directory: '//FlickLearningWizard.exe'
Tried using os.path.join(filepath) but it does not do anything since I am have already made the path. I got rid of '/' but it did not add much. Here is my code:
B = 65521
T = {}
for directories in datasetPath: # directories iterating over my datasetPath which contains list of my pe files
samples = [f for f in os.listdir(datasetPath) if isfile(join(datasetPath, f))]
for file in samples:
filePath = directories+"/"+file
fileByteSequence = readFile(filePath)
fileNgrams = byteSequenceToNgrams(filePath,N)
hashFileNgramsIntoDictionary(fileNgrams,T)
K1 = 1000
import heapq
K1_most_common_Ngrams_Using_Hash_Grams = heapq.nlargest(K1, T)
And here is my complete error message:
FileNotFoundError Traceback (most recent call last)
<ipython-input-63-eb8b9254ac6d> in <module>
6 for file in samples:
7 filePath = directories+"/"+file
----> 8 fileByteSequence = readFile(filePath)
9 fileNgrams = byteSequenceToNgrams(filePath,N)
10 hashFileNgramsIntoDictionary(fileNgrams,T)
<ipython-input-3-4bdd47640108> in readFile(filePath)
1 def readFile(filePath):
----> 2 with open(filePath, "rb") as binary_file:
3 data = binary_file.read()
4 return data
5 def byteSequenceToNgrams(byteSequence, n):
A sample of the files I am trying to iterate through in which is in the datasetpath:
['FlickLearningWizard.exe', 'autochk.exe', 'cmd.exe', 'BitLockerWizard.exe', 'iexplore.exe', 'AxInstUI.exe', 'fvenotify.exe', 'DismHost.exe', 'GameBarPresenceWriter.exe', 'consent.exe', 'fax_390392029_072514.exe', 'Win32.AgentTesla.exe', '{71257279-042b-371d-a1d3-fbf8d2fadffa}.exe', 'imecfmui.exe', 'HxCalendarAppImm.exe', 'CExecSvc.exe', 'bootim.exe', 'dumped.exe', 'FXSSVC.exe', 'drvinst.exe', 'DW20.exe', 'appidtel.exe', 'baaupdate.exe', 'AuthHost.exe', 'last.exe', 'BitLockerToGo.exe', 'EhStorAuthn.exe', 'IMTCLNWZ.EXE', 'drvcfg.exe', 'makecab.exe', 'licensingdiag.exe', 'ldp.exe', 'win33.exe', 'forfiles.exe', 'DWWIN.EXE', 'comp.exe', 'coredpussvr.exe', 'AddSuggestedFoldersToLibraryDialog.exe', 'InetMgr6.exe', '3_4.exe', 'CIDiag.exe', 'win32.exe', 'LanguageComponentsInstallerComHandler.exe', 'sample.exe', 'Win32.SofacyCarberp.exe', 'EASPolicyManagerBrokerHost.exe', '131.exe', 'AddInUtil.exe', 'fixmapi.exe', 'cmdl32.exe', 'chkntfs.exe', 'instnm.exe', 'ImagingDevices.exe', 'BitLockerWizardElev.exe', 'bdechangepin.exe', 'logman.exe', '.DS_Store', 'bootcfg.exe', 'DsmUserTask.exe', 'find.exe', 'LogCollector.exe', 'HxTsr.exe', 'lpq.exe', 'ctfmon.exe', 'AppInstaller.exe', 'hvsimgr.exe', 'Vcffipzmnipbxzdl.exe', 'lpremove.exe', 'hdwwiz.exe', 'CastSrv.exe', 'gpresult.exe', 'hvix64.exe', 'HvsiSettingsWorker.exe', 'fodhelper.exe', '21.exe', 'InspectVhdDialog6.2.exe', '798_abroad.exe', 'doskey.exe', 'AuditShD.exe', 'alg.exe', 'certutil.exe', 'bitsadmin.exe', 'help.exe', 'fsquirt.exe', 'PDFXCview.exe', 'inetinfo.exe', 'Win32.Wannacry.exe', 'dcdiag.exe', 'LsaIso.exe', 'lpr.exe', 'dtdump.exe', 'FileHistory.exe', 'LockApp.exe', 'AppVShNotify.exe', 'DeviceProperties.exe', 'ilasm.exe', 'CheckNetIsolation.exe', 'FilePicker.exe', 'choice.exe', 'ComSvcConfig.exe', 'Calculator.exe', 'CredDialogHost.exe', 'logagent.exe', 'InspectVhdDialog6.3.exe', 'junction.exe', 'findstr.exe', 'ktmutil.exe', 'csvde.exe', 'esentutl.exe', 'Win32.GravityRAT.exe', 'bootsect.exe', 'BdeUISrv.exe', 'ChtIME.exe', 'ARP.EXE', 'dsdbutil.exe', 'iisreset.exe', '1003.exe', 'getmac.exe', 'dllhost.exe', 'BOTBINARY.EXE', 'cscript.exe', 'dnscacheugc.exe', 'aspnet_regbrowsers.exe', 'hvax64.exe', 'CredentialUIBroker.exe', 'dpnsvr.exe', 'ApplyTrustOffline.exe', 'LxRun.exe', 'credwiz.exe', '1002.exe', 'FileExplorer.exe', 'BackgroundTransferHost.exe', 'convert.exe', 'AppVClient.exe', 'evntcmd.exe', 'attrib.exe', 'ClipUp.exe', 'DmNotificationBroker.exe', 'dcomcnfg.exe', 'dvdplay.exe', 'Dism.exe', 'AtBroker.exe', 'invoice_2318362983713_823931342io.pdf.exe', 'DataSvcUtil.exe', 'bdeunlock.exe', 'DeviceCensus.exe', 'dstokenclean.exe', 'AndroRat Binder_Patched.exe', 'iediagcmd.exe', 'comrepl.exe', 'dispdiag.exe', 'FlashUtil_ActiveX.exe', 'cliconfg.exe', 'aitstatic.exe', 'gpupdate.exe', 'GetHelp.exe', 'charmap.exe', 'aspnet_regsql.exe', 'IMEWDBLD.EXE', 'AppVStreamingUX.exe', 'dwm.exe', 'Ransomware.Unnamed_0.exe', 'csc.exe', 'bridgeunattend.exe', 'icacls.exe', 'dialer.exe', 'BdeHdCfg.exe', 'fontdrvhost.exe', '027cc450ef5f8c5f653329641ec1fed9.exe', 'LocationNotificationWindows.exe', 'dpapimig.exe', 'BitLockerDeviceEncryption.exe', 'ftp.exe', 'Eap3Host.exe', 'dfsvc.exe', 'LogonUI.exe', 'Fake Intel (1).exe', 'chglogon.exe', 'fhmanagew.exe', 'changepk.exe', 'aspnetca.exe', 'IMEPADSV.EXE', 'browserexport.exe', 'bcdboot.exe', 'aspnet_wp.exe', 'FXSCOVER.exe', 'dllhst3g.exe', 'CertEnrollCtrl.exe', 'EduPrintProv.exe', 'ielowutil.exe', 'ADSchemaAnalyzer.exe', 'cygrunsrv.exe', 'HxAccounts.exe', 'diskperf.exe', 'certreq.exe', 'bcdedit.exe', 'efsui.exe', 'klist.exe', 'raffle.exe', 'cacls.exe', 'hvc.exe', 'cmmon32.exe', 'BioIso.exe', 'AssignedAccessLockApp.exe', 'DmOmaCpMo.exe', 'AppLaunch.exe', 'AddInProcess.exe', 'dasHost.exe', 'dmcertinst.exe', 'IMJPSET.EXE', 'cmbins.exe', 'LicenseManagerShellext.exe', 'diskpart.exe', 'iscsicpl.exe', 'chown.exe', 'Magnify.exe', 'aapt.exe', 'false.exe', 'BioEnrollmentHost.exe', 'hvsirdpclient.exe', 'c2wtshost.exe', 'dplaysvr.exe', 'ChsIME.exe', 'fsavailux.exe', 'Win32.WannaPeace.exe', 'CasPol.exe', 'icsunattend.exe', 'fveprompt.exe', 'expand.exe', 'chgusr.exe', 'hvsirpcd.exe', 'MiniConfigBuilder.exe', 'FirstLogonAnim.exe', 'EDPCleanup.exe', 'ksetup.exe', 'AppVDllSurrogate.exe', 'InstallUtil.exe', 'immersivetpmvscmgrsvr.exe', 'cmdkey.exe', 'appcmd.exe', 'Build.exe', 'hostr.exe', 'CloudStorageWizard.exe', 'DWTRIG20.EXE', 'file_4571518150a8181b403df4ae7ad54ce8b16ded0c.exe', 'FsIso.exe', 'chmod.exe', 'imjpuexc.exe', 'CHXSmartScreen.exe', 'iissetup.exe', '7ZipSetup.exe', 'svchost.exe', 'ldifde.exe', 'logoff.exe', 'DiskSnapshot.exe', 'fontview.exe', 'LaunchWinApp.exe', 'GamePanel.exe', 'yfoye_dump.exe', 'ls.exe', 'HOSTNAME.EXE', 'at.exe', 'InetMgr.exe', 'FaceFodUninstaller.exe', 'InputPersonalization.exe', 'AppVNice.exe', 'ImeBroker.exe', 'CameraSettingsUIHost.exe', 'Defrag.exe', 'lpksetup.exe', 'djoin.exe', 'irftp.exe', 'DTUHandler.exe', 'LockScreenContentServer.exe', 'dsamain.exe', 'lpkinstall.exe', 'DataStoreCacheDumpTool.exe', 'dmclient.exe', 'dump1.exe', 'Cain.exe', 'AddInProcess32.exe', 'appidcertstorecheck.exe', 'IMJPUEX.EXE', 'HxOutlook.exe', 'FlashPlayerApp.exe', 'diskraid.exe', 'bthudtask.exe', 'explorer.exe', 'CompMgmtLauncher.exe', 'malware.exe', 'njRAT.exe', 'CompatTelRunner.exe', 'evntwin.exe', 'Dxpserver.exe', 'HelpPane.exe', 'cvtres.exe', 'dxdiag.exe', 'hvsievaluator.exe', 'signed.exe', 'csrss.exe', 'InstallBC201401.exe', 'audiodg.exe', 'dsregcmd.exe', 'ApproveChildRequest.exe', 'iisrstas.exe', 'chkdsk.exe', 'lodctr.exe', 'aspnet_state.exe', 'DiagnosticsHub.StandardCollector.Service.exe', 'chgport.exe', 'cleanmgr.exe', 'GameBar.exe', 'AgentService.exe', 'InfDefaultInstall.exe', 'IMESEARCH.EXE', 'Fondue.exe', 'iexpress.exe', 'backgroundTaskHost.exe', 'dfrgui.exe', 'cofire.exe', 'BrowserCore.exe', 'clip.exe', 'appidpolicyconverter.exe', 'ed01ebfbc9eb5bbea545af4d01bf5f1071661840480439c6e5babe8e080e41aa.exe', 'cipher.exe', 'DeviceEject.exe', 'cerber.exe', '5a765351046fea1490d20f25.exe', 'CloudExperienceHostBroker.exe', 'FXSUNATD.exe', 'GenValObj.exe', 'lsass.exe', 'ddodiag.exe', 'cmstp.exe', 'wirelesskeyview.exe', 'edpnotify.exe', 'CameraBarcodeScannerPreview.exe', 'bfsvc.exe', 'eventcreate.exe', 'driverquery.exe', 'CCG.exe', 'ConfigSecurityPolicy.exe', 'ieUnatt.exe', 'eshell.exe', 'ipconfig.exe', 'jsc.exe', 'gpscript.exe', 'LaunchTM.exe', 'cttunesvr.exe', 'curl.exe', 'cttune.exe', 'DevicePairingWizard.exe', 'ByteCodeGenerator.exe', 'IEChooser.exe', 'LockAppHost.exe', 'DataExchangeHost.exe', 'dxgiadaptercache.exe', 'dsacls.exe', 'Locator.exe', 'DpiScaling.exe', 'DisplaySwitch.exe', 'autoconv.exe', 'IMJPDCT.EXE', 'ieinstal.exe', 'colorcpl.exe', 'auditpol.exe', 'dccw.exe', 'DeviceEnroller.exe', 'UpdateCheck.exe', 'LicensingUI.exe', 'ExtExport.exe', 'easinvoker.exe', 'ApplySettingsTemplateCatalog.exe', 'eventvwr.exe', 'browser_broker.exe', 'extrac32.exe', 'EaseOfAccessDialog.exe', 'label.exe', 'change.exe', 'IMCCPHR.exe', 'audit.exe', 'aspnet_compiler.exe', 'aspnet_regiis.exe', 'desktopimgdownldr.exe', 'dmcfghost.exe', 'ComputerDefaults.exe', 'control.exe', 'DeviceCredentialDeployment.exe', 'compact.exe', 'InspectVhdDialog.exe', 'EdmGen.exe', 'cmak.exe', 'AppHostRegistrationVerifier.exe', 'DataUsageLiveTileTask.exe', 'hcsdiag.exe', 'gchrome.exe', 'adamuninstall.exe', 'CloudNotifications.exe', 'dusmtask.exe', 'fc.exe', 'hh.exe', 'eudcedit.exe', 'iscsicli.exe', 'DFDWiz.exe', 'isoburn.exe', 'IMTCPROP.exe', 'CapturePicker.exe', 'abba_-_happy_new_year_zaycev_net.exe', 'finger.exe', 'ApplicationFrameHost.exe', 'calc.exe', 'counter.exe', 'editrights.exe', 'fltMC.exe', 'convertvhd.exe', 'LegacyNetUXHost.exe', 'grpconv.exe', 'ie4uinit.exe', 'dsmgmt.exe', 'fsutil.exe', 'AppResolverUX.exe', 'BootExpCfg.exe', 'conhost.exe', 'bash.exe', 'IcsEntitlementHost.exe']
Can anyone help please?
(Edited in reaction to question updates; probably scroll down to the end.)
This probably contains more than one bug:
for directories in datasetPath: # directories iterating over my datasetPath which contains list of my pe files
samples = [f for f in os.listdir(datasetPath) if isfile(join(datasetPath, f))]
for file in samples:
filePath = directories+"/"+file
fileByteSequence = readFile(filePath)
Without knowledge of the precise data types here, it's hard to know exactly how to fix this. But certainly, if datasetPath is a list of paths, os.path.join(datasetPath, f) will not produce what you hope and expect.
Assuming datasetPath contains something like [r'A:\', r'c:\windows\kill me now'], a more or less logical rewrite could look something like
for dir in datasetPath:
samples = []
for f in os.listdir(dir):
p = os.path.join(dir, f)
if isfile(p):
samples.append(p)
for filePath in samples:
fileByteSequence = readFile(filePath)
Notice how we produce the full path just once, and then keep that. Notice how we use the loop variable dir inside the loop, not the list of paths we are looping over.
Actually I'm guessing datasetPath is actually a string, but then the for loop makes no sense (you end up looping over the characters in the string one by one).
If you merely want to check which of these files exist in the current directory, you are massively overcomplicating things.
for filePath in os.listdir('.'):
if filePath in datasetPath:
fileByteSequence = readFile(filePath)
Whether you loop over the files on the disk and check which ones are on you list or vice versa is not a crucial design detail; I have preferred the former on the theory that you want to minimize the number of disk accesses (in the ideal case you get all file names from the disk with a single system call).
I got it working finally.
For some reason it did not run properly on the Macintosh machine therefore I used the same code to run it on Linux and windows and it ran successfully.
The problem was with this line specifically:
samples = [f for f in os.listdir(datasetPath) if isfile(join(datasetPath, f))]

Python: Continuously check size of files being added to list, stop at size, zip list, continue

I am trying to loop through a directory, check the size of each file, and add the files to a list until they reach a certain size (2040 MB). At that point, I want to put the list into a zip archive, and then continue looping through the next set of files in the directory and continue to do the same thing. The other constraint is that files with the same name but different extension need to be added together into the zip, and can't be separated. I hope that makes sense.
The issue I am having is that my code basically ignores the size constraint that I have added, and just zips up all the files in the directory anyway.
I suspect there is some logic issue, but I am failing to see it. Any help would be appreciated. Here is my code:
import os,os.path, zipfile
from time import *
#### Function to create zip file ####
# Add the files from the list to the zip archive
def zipFunction(zipList):
# Specify zip archive output location and file name
zipName = "D:\Documents\ziptest1.zip"
# Create the zip file object
zipA = zipfile.ZipFile(zipName, "w", allowZip64=True)
# Go through the list and add files to the zip archive
for w in zipList:
# Create the arcname parameter for the .write method. Otherwise the zip file
# mirrors the directory structure within the zip archive (annoying).
arcname = w[len(root)+1:]
# Write the files to a zip
zipA.write(w, arcname, zipfile.ZIP_DEFLATED)
# Close the zip process
zipA.close()
return
#################################################
#################################################
sTime = clock()
# Set the size counter
totalSize = 0
# Create an empty list for adding files to count MB and make zip file
zipList = []
tifList = []
xmlList = []
# Specify the directory to look at
searchDirectory = "Y:\test"
# Create a counter to check number of files
count = 0
# Set the root, directory, and file name
for root,direc,f in os.walk(searchDirectory):
#Go through the files in directory
for name in f:
# Set the os.path file root and name
full = os.path.join(root,name)
# Split the file name from the file extension
n, ext = os.path.splitext(name)
# Get size of each file in directory, size is obtained in BYTES
fileSize = os.path.getsize(full)
# Add up the total sizes for all the files in the directory
totalSize += fileSize
# Convert from bytes to megabytes
# 1 kilobyte = 1,024 bytes
# 1 megabyte = 1,048,576 bytes
# 1 gigabyte = 1,073,741,824 bytes
megabytes = float(totalSize)/float(1048576)
if ext == ".tif": # should be everything that is not equal to XML (could be TIF, PDF, etc.) need to fix this later
tifList.append(n)#, fileSize/1048576])
tifSorted = sorted(tifList)
elif ext == ".xml":
xmlList.append(n)#, fileSize/1048576])
xmlSorted = sorted(xmlList)
if full.endswith(".xml") or full.endswith(".tif"):
zipList.append(full)
count +=1
if megabytes == 2040 and len(tifList) == len(xmlList):
zipFunction(zipList)
else:
continue
eTime = clock()
elapsedTime = eTime - sTime
print "Run time is %s seconds"%(elapsedTime)
The only thing I can think of is that there is never an instance where my variable megabytes==2040 exactly. I can't figure out how to make the code stop at that point otherwise though; I wonder if using a range would work? I also tried:
if megabytes < 2040:
zipList.append(full)
continue
elif megabytes == 2040:
zipFunction(zipList)
Your main problem is that you need to reset your file size tally when you archive the current list of files. Eg
if megabytes >= 2040:
zipFunction(zipList)
totalSize = 0
BTW, you don't need
else:
continue
there, since it's the end of the loop.
As for the constraint that you need to keep files together that have the same main file name but different extensions, the only fool-proof way to do that is to sort the file names before processing them.
If you want to guarantee that the total file size in each archive is under the limit you need to test the size before you add the file(s) to the list. Eg,
if (totalSize + fileSize) // 1048576 > 2040:
zipFunction(zipList)
totalsize = 0
totalSize += fileSize
That logic will need to be modified slightly to handle keeping a group of files together: you'll need to add the filesizes of each file in the group together into a sub-total, and then see if adding that sub-total to totalSize takes it over the limit.

Categories