I have a little script that does a few simple tasks. Running Python 3.7.
One of the tasks has to merge some files together which can be a little time consuming.
It loops through multiple directories, then each directory gets passed to the function. The function just loops through the files and merges them.
Instead of waiting for it to finish one directory, then onto the next one, then wait, then onto the next one, etc...
I'd like to utilize the horsepower/cores/threads to have the script merging the PDF's in multiple directories at once, together, which should shave time.
I've got something like this:
if multi_directories:
if os.path.isdir('merged'):
pass
else:
os.makedirs('merged')
for directory in multi_directories:
merge_pdfs(directory)
My merge PDF function looks like this:
def merge_pdfs(directory):
root_dir = os.path.dirname(os.path.abspath(__file__))
merged_dir_location = os.path.join(root_dir, 'merged')
dir_title = directory.rsplit('/', 1)[-1]
file_list = [file for file in os.listdir(directory)]
merger = PdfFileMerger()
for pdf in file_list:
file_to_open = os.path.join(directory, pdf)
merger.append(open(file_to_open, 'rb'))
file_to_save = os.path.join(
merged_dir_location,
dir_title+"-merged.pdf"
)
with open(file_to_save, "wb") as fout:
merger.write(fout)
return True
This works great - but merge_pdfs runs slow in some instances where there are a high number of PDF's in the directory.
Essentially - I want to be a be able to loop through multi_directories and create a new thread or process for each directory and merge the PDF's at the same time.
I've looked at asyncio, multithreading and a wealth of little snippets here and there but can't seem to get it to work.
You can do something like:
from multiprocessing import Pool
n_processes = 2
...
if multi_directories:
if os.path.isdir('merged'):
pass
else:
os.makedirs('merged')
pool = Pool(n_processes)
pool.map(merge_pdfs, multi_directories)
It should help if the bottleneck is CPU usage. But it may make things even worse if the bottleneck is HDD, cause reading several files in parallel from one physical HDD is usually slower then reading them consecutively. Try it with different values of n_processes.
BTW, to make list from iterable use list(): file_list = list(os.listdir(directory)). And since listdir() returns List, you can just write file_list = os.listdir(directory)
Related
I have some code that is great for doing small numbers of mp4s, but at the 100th one I start to run out of ram. I know you can sequentially write CSV files, I am just not sure how to do that for mp4s. Here is the code I have:`11
This solution works:
from moviepy.editor import *
import os
from natsort import natsorted
L = []
for root, dirs, files in os.walk("/path/to/the/files"):
#files.sort()
files = natsorted(files)
for file in files:
if os.path.splitext(file)[1] == '.mp4':
filePath = os.path.join(root, file)
video = VideoFileClip(filePath)
L.append(video)
final_clip = concatenate_videoclips(L)
final_clip.to_videofile("output.mp4", fps=24, remove_temp=False)`
The code above is what I tried, I expected a smooth result on first glance, though it worked perfect on a test batch it could not handle the main batch.
You appear to be appending the contents
of a large number of video files to a list.
Yet you report that available RAM is much
less than total size of those files.
So don't accumulate the result in memory.
Follow one of these approaches:
keep an open file descriptor
with open("combined_video.mp4", "wb") as fout:
for file in files:
...
video = ...
fout.write(video)
Or perhaps it is fout.write(video.data)
or video.write_segment(fout) -- I don't
know about the video I/O library you're using.
The point is that the somewhat large video
object is re-assigned each time, so it
does not grow without bound, unlike your list L.
append to existing file
We can nest in the other order, if that's more convenient.
for file in files:
with open("combined_video.mp4", "ab") as fout:
...
video = ...
fout.write(video)
Here we're doing binary append.
Repeated open / close is slighty less efficient.
But it has the advantage of letting you
do a run with four input files,
then python exits,
then later you do a run with pair of new files
and you'll still find the expected half a dozen
files in the combined output.
I am interested in speeding up my file read times by implementing multiprocessing, but I am having trouble getting data back from each process. The order does matter when all the data is put together and I am using Python 3.9.
# read files from file list in the given indices
def read_files(files, folder_path):
raw_data = []
# loops through all tif files in the given folder and parses the data.
for file in files:
if file[-3:] == "tif":
curr_frame = Image.open(os.path.join(folder_path, file))
raw_data.append(np.array(curr_frame))
return np.asarray(raw_data).astype(np.float64)
def run_processes(folder_path=None):
if folder_path is None:
global PATH
folder_path = PATH
files = os.listdir(folder_path)
start = time.time()
processes = []
num_files_per = int(len(files) / os.cpu_count())
for i in range(os.cpu_count()):
processes.append(Process(target=read_files, args=(files[(i*num_files_per):((i+1)*num_files_per)], folder_path)))
for process in processes:
process.start()
for process in processes:
process.join()
end = time.time()
print(f"Multi: {end - start}")
Any help is much appreciated!
To potentially increase the spead, generate a list of file paths, and write a worker function that takes a single path as its argument and returns its data.
If you use that worker with a multiprocessing.Pool, it will take care of the details of returning the data for you.
Keep in mind that you are trading the time to read a file for the overhead of returning the data to the parent process.
It is not a given that this is a net improvement.
And then there is the issue of file reads themselves. Since these files are presumably on the same device, you could run into the maximum throughput of the device here.
In general, if the processing you have to do on the images only depends on a single image, it could be worth it to do that processing in in the worker, because that would speed things up.
I'm in the process of developing a data column check, but I'm having a tough time figuring out how to properly loop through a list of files. I have a folder with a list of csv files. I need to check if each file maintains a certain structure. I'm not worried about checking the structure of each file, I'm more worried about how to properly pull each individual file from the dir, dataframe it, and then move on to the next file. Any help would be much appreciated.
def files(path):
files = os.listdir(path)
len_files = len(files)
cnt = 0
while cnt < len_files:
print(files)
for file in os.listdir(path):
if os.path.isfile(os.path.join(path, file)):
with open(path + file, 'r') as f:
return data_validate(f)
def data_validate(file):
# Validation check code will eventually go here...
print(pd.read_csv(file))
def run():
files("folder/subfolder/")
Which version of python do you use?
I use Pathlib and python3.6+ to do a lot of file processing with pandas. I find Pathlib easy to use, though you still have to dip back into os for a couple of functions they haven't implemented yet. A plus is that Path objects can be passed into the os functions without modification - so I like the flexibility.
This is a function I used to recursively go through an arbitrary directory structure that I have modified to look more like what you're trying to achieve above, returning a list of DataFrames.
If your directory is always going to be flat, you can simplify this even more.
def files(directory):
top_dir = Path(directory)
validated_files = list()
for item in top_dir.iterdir():
if item.is_file():
validated_files.append(data_validate(item))
elif item.is_dir():
validated_files.append(files(item))
return validated_files
I have a link to a folder which has enormous number of files that I want to download. I started downloading it single file at a time, however it's taking a very long time. Is there a way to spawn some multi-threaded processes to download maybe a batch of files simultaneously. Probably like process1 downloads the first 20 files in the folder, process2 downloads the next 20 simultaneously and so on.
Right now, I'm doing as follows:
import urllib, os
os.chdir('/directory/to/save/the/file/to')
url = 'http://urltosite/folderthathasfiles
urllib.urlretrieve(url)
You can define a function that takes the link and a list of filenames then it will loop through the list and download files, then create a thread for each list and have it target the function. For example:
def download_files(url, filenames):
for filename in filenames:
urllib.urlretrieve(os.path.join(url,filename))
# then create the lists and threads
url = 'test.url'
files = [[file1, file2, file3....], [file21, file22, file23...]...]
for lst in files:
threading.Thread(target=download_files, args=(url, lst)).start()
I am executing the python code that follows.
I am running it on a folder ("articles") which has a couple hundred subfolders and 240,226 files in all.
I am timing the execution. At first the times were pretty stable but went non-linear after 100,000 files. Now the times (I am timing at 10,000 file intervals) can go non_linear after 30,000 or so (or not).
I have the Task Manager open and correlate the slow-downs to 99% Disk usage by python.exe. I have done gc-collect(). dels etc., turned off Windows indexing. I have re-started Windows, emptied the trash (I have a few hundred GBs free). Nothing helps, the disk usage seems to be getting more erratic if anything.
Sorry for the long post - Thanks for the help
def get_filenames():
for (dirpath, dirnames, filenames) in os.walk("articles/"):
dirs.extend(dirnames)
for dir in dirs:
path = "articles" + "\\" + dir
nxml_files.extend(glob.glob(path + "/*.nxml"))
return nxml_files
def extract_text_from_files(nxml_files):
for nxml_file in nxml_files:
fast_parse(nxml_file)
def fast_parse(infile):
file = open(infile,"r")
filetext = file.read()
tag_breaks = filetext.split('><')
paragraphs = [tag_break.strip('p>').strip('</') for tag_break in tag_breaks if tag_break.startswith('p>')]
def run_files():
nxml_files = get_filenames()
extract_text_from_files(nxml_files)
if __name__ == "__main__":
run_files()
There are some things that could be optimized.
At first, is you open files, close them as well. A with open(...) as name: block will do that easily. BTW in Python 2 file is a bad choice for a variable name, it is built-in function's name.
You can remove one disc read by doing string comparisons instead of the glob.
And last but not least: os.walk spits out the results cleverly, so don't buffer them into a list, process everything inside one loop. This will save a lot of memory.
That is what I can advise from the code. For more details on what is causing the I/O you should use profiling. See https://docs.python.org/2/library/profile.html for details.