I have a link to a folder which has enormous number of files that I want to download. I started downloading it single file at a time, however it's taking a very long time. Is there a way to spawn some multi-threaded processes to download maybe a batch of files simultaneously. Probably like process1 downloads the first 20 files in the folder, process2 downloads the next 20 simultaneously and so on.
Right now, I'm doing as follows:
import urllib, os
os.chdir('/directory/to/save/the/file/to')
url = 'http://urltosite/folderthathasfiles
urllib.urlretrieve(url)
You can define a function that takes the link and a list of filenames then it will loop through the list and download files, then create a thread for each list and have it target the function. For example:
def download_files(url, filenames):
for filename in filenames:
urllib.urlretrieve(os.path.join(url,filename))
# then create the lists and threads
url = 'test.url'
files = [[file1, file2, file3....], [file21, file22, file23...]...]
for lst in files:
threading.Thread(target=download_files, args=(url, lst)).start()
Related
I have some code that is great for doing small numbers of mp4s, but at the 100th one I start to run out of ram. I know you can sequentially write CSV files, I am just not sure how to do that for mp4s. Here is the code I have:`11
This solution works:
from moviepy.editor import *
import os
from natsort import natsorted
L = []
for root, dirs, files in os.walk("/path/to/the/files"):
#files.sort()
files = natsorted(files)
for file in files:
if os.path.splitext(file)[1] == '.mp4':
filePath = os.path.join(root, file)
video = VideoFileClip(filePath)
L.append(video)
final_clip = concatenate_videoclips(L)
final_clip.to_videofile("output.mp4", fps=24, remove_temp=False)`
The code above is what I tried, I expected a smooth result on first glance, though it worked perfect on a test batch it could not handle the main batch.
You appear to be appending the contents
of a large number of video files to a list.
Yet you report that available RAM is much
less than total size of those files.
So don't accumulate the result in memory.
Follow one of these approaches:
keep an open file descriptor
with open("combined_video.mp4", "wb") as fout:
for file in files:
...
video = ...
fout.write(video)
Or perhaps it is fout.write(video.data)
or video.write_segment(fout) -- I don't
know about the video I/O library you're using.
The point is that the somewhat large video
object is re-assigned each time, so it
does not grow without bound, unlike your list L.
append to existing file
We can nest in the other order, if that's more convenient.
for file in files:
with open("combined_video.mp4", "ab") as fout:
...
video = ...
fout.write(video)
Here we're doing binary append.
Repeated open / close is slighty less efficient.
But it has the advantage of letting you
do a run with four input files,
then python exits,
then later you do a run with pair of new files
and you'll still find the expected half a dozen
files in the combined output.
I have about 30 .wav files in a folder C:\Users\Maheswar.reddy\Desktop\NLP\wav_folder. I am trying to write code to read all the .wav files in the folder, but I couldn't do it. How can I read all the files at once, given the name of the folder?
I was able to read a single file giving the path now I want to read all the files at once.
It's unclear what you mean by "read all the files at once". Here's an example using a pathlib glob that you can extend to process the files sequentially:
from pathlib import Path
base_path = Path(r"C:\Users\Maheswar.reddy\Desktop\NLP\wav_folder")
for wav_file_path in base_path.glob("*.wav"):
print(f"WAV File: {wav_file_path}")
# do something, e.g. with open(wav_file_path) as wav_file:
If you want to process all of the files concurrently, you'll need to look at threading or multiprocessing.
I'm developing a mechanism that sends archives to FTP server. It uses a PriorityQueue. The process of sending files is like that:
Scan the ReadyToSend folder for the .tar archives and put their names into listOfArchives list.
For each archive in listOfArchives determine the sending priority and put the name of the archive with corresponding priority to the SendingQueue.
Using shutil move archive to another directory, to prevent it from being added to listOfArchives again in the next scan.
Is there any possible way to exclude adding archive to listOfArchives and thus adding it again to the SendingQueue without moving archive to another directory just after adding it to the SendingQueue?
Here is a sample code:
def send_queue_builder(self):
while True:
listOfArchives = [f for f in os.listdir(self.readyToSendDir) if f.endswith('.tar')]
for archive in listOfArchives:
#code that determines priority
self.SendingQueue.put(archive,priority)
shutil.move(archive,newDirectory)
time.sleep(20)
def ftp_sender(self):
print('SENDER: Begin sending...')
while True:
if self.sendQueue.empty() != True:
# command below automatically gets the archive with the highest priority
archive = self.sendQueue.get()
# send archive using subprocess.call
else:
time.sleep(20)
a = threading.Thread(target=send_queue_builder)
b = threading.Thread(target=ftp_sender)
a.start()
b.start()
a.join()
b.join()
I'd be really thankful for any hint bringing me closer to the solution.
I have a little script that does a few simple tasks. Running Python 3.7.
One of the tasks has to merge some files together which can be a little time consuming.
It loops through multiple directories, then each directory gets passed to the function. The function just loops through the files and merges them.
Instead of waiting for it to finish one directory, then onto the next one, then wait, then onto the next one, etc...
I'd like to utilize the horsepower/cores/threads to have the script merging the PDF's in multiple directories at once, together, which should shave time.
I've got something like this:
if multi_directories:
if os.path.isdir('merged'):
pass
else:
os.makedirs('merged')
for directory in multi_directories:
merge_pdfs(directory)
My merge PDF function looks like this:
def merge_pdfs(directory):
root_dir = os.path.dirname(os.path.abspath(__file__))
merged_dir_location = os.path.join(root_dir, 'merged')
dir_title = directory.rsplit('/', 1)[-1]
file_list = [file for file in os.listdir(directory)]
merger = PdfFileMerger()
for pdf in file_list:
file_to_open = os.path.join(directory, pdf)
merger.append(open(file_to_open, 'rb'))
file_to_save = os.path.join(
merged_dir_location,
dir_title+"-merged.pdf"
)
with open(file_to_save, "wb") as fout:
merger.write(fout)
return True
This works great - but merge_pdfs runs slow in some instances where there are a high number of PDF's in the directory.
Essentially - I want to be a be able to loop through multi_directories and create a new thread or process for each directory and merge the PDF's at the same time.
I've looked at asyncio, multithreading and a wealth of little snippets here and there but can't seem to get it to work.
You can do something like:
from multiprocessing import Pool
n_processes = 2
...
if multi_directories:
if os.path.isdir('merged'):
pass
else:
os.makedirs('merged')
pool = Pool(n_processes)
pool.map(merge_pdfs, multi_directories)
It should help if the bottleneck is CPU usage. But it may make things even worse if the bottleneck is HDD, cause reading several files in parallel from one physical HDD is usually slower then reading them consecutively. Try it with different values of n_processes.
BTW, to make list from iterable use list(): file_list = list(os.listdir(directory)). And since listdir() returns List, you can just write file_list = os.listdir(directory)
I have several .mat files (matlab) that I want to process with PySpark. But I'm not sure how to do it in parallel. Here's the basic single-threaded setup that I wish to parallelize. The code will generate a list of lists, where each inner list has arbitrary length:
filenames = ['1.mat','2.mat',...]
output_lists = [None]*len(filenames) # will be a list of lists
for i,filename in enumerate(filenames):
output_lists[i] = analyze(filename) # analyze is some function that returns a list
Any individual output_lists[i] can fit in memory, but the entire output_lists object cannot. I would like output_lists to be an rdd.
Any ideas? I am also open to using a combination of pyspark and the multiprocessing module. Thanks!
Put files in a POSIX compliant file system which can be accessed from every worker (NFS, MapR filesystem, Databricks filesystem, Ceph)
Convert path so they reflect path in the file system.
parallelize names:
rdd = sc.parallelize(filenames)
map with process:
result = rdd.map(analyze)
Do whatever you want to do with results.
The other answer looks elegant, but I didn't want to install a new file system. I opted for analyzing the files in parallel with the joblib module, writing the results to .txt files, and opening the .txt files with Spark.
from joblib import Parallel, delayed
def analyze(filename):
# write results to text file with name= filename+'.txt'
return
filenames = ['1.mat', '2.mat', ...]
Parallel(n_jobs=8)(delayed(analyze)(filename) for filename in filenames)
Then I use Pyspark to read all the .txt files to one rdd:
data = sc.textFile('path/*.txt')