Python odd behavior with joblib Paralllel

Python odd behavior with joblib Paralllel - python

I'm pretty new to python (my native language is C#, and I find python quite... messy) and I'm trying to process a large dataset. The work is mostly CPU bound, so I'm using joblib Parallel. My function is like so:
def processData(datasegment):
// read previous data from disk
// format dataset to match previous data, and combine <-- takes 80-90 % of the time
// write data back to disk
If I run my app such that each segment of data is processed sequentially, it runs fine but takes a couple of hours. During this time 19 of my 20 logical processors are idle, so I want to parallelize this. It's also cold outside, and this 10900k can help heat my house :p
If I try to just parallelize the entire dataset, running 20 jobs, I run out of memory. So I've broken the dataset up into batches, and am trying to run do batched parallelization - using code like so:
def processBatch(batch):
results = Parallel(n_jobs=20)(delayed(processData)(segment) for segment in batch)
return results
def processData(datasegment):
// read previous data from disk
// format [dataset] to match previous data, and combine
// write data back to disk
batch_size = 500
steps = math.ceil(len(dataset) / batch_size)
for i in range(steps):
start = i * batch_size
end = min(start + batch_size, len(dataset))
batch = dataset[start:end]
results += processBatch(batch)
This works fine for small datasets, but when I try to run the full thing something goes wrong in the multiprocessing part way through, and it ends up running things sequentially. I still have 20 python processes, but only 1 of them is doing work.
Can someone help me understand what's going on?
Thanks!

Related

Randomly selected file take longer to load with numpy.load than sequential ones

Context
While training a neural network I realized the time spent per batch increased when I increased the size of my dataset (without changing the batch size). The important part is, I need to fetch 20 .npy files per data point, this number doesn't depend on the dataset size.
Problem
Training goes from 2s/iteration to 10s/iteration...
There is no apparent reason why training would take longer. However, I managed to track down the bottleneck. It seems to have to do with the loading of the .npy files.
To reproduce this behavior, here's a small script you can run to generate 10,000 dummy .npy files:
def path(i):
return os.sep.join(('../datasets/test', str(i)))
def create_dummy_files(N=10000):
for i in range(N):
x = np.random.random((100, 100))
np.save(path(random.getrandbits(128)), x)
Then you can run the following two scripts and compare them yourself:
The first script where 20 .npy files are randomly selected and loaded:
L = os.listdir('../datasets/test')
S = random.sample(L, 20)
for s in S:
np.load(path(s)) # <- timed this
The second version, where 20 .npy 'sequential' files are selected and loaded.
L = os.listdir('../datasets/test')
i = 100
S = L[i: i + 20]
for s in S:
np.load(path(s)) # <- timed this
I tested both scripts and ran them 100 times each (in the 2nd script I used the iteration count as the value for i so the same files are not loaded twice). I wrapped the np.load(path(s)) line with time.time() calls. I'm not timing the sampling, only the loading. Here are the results:
Random loads (times roughly stay between 0.1s and 0.4s, average is 0.25s):
Non random loads (times roughly stay between 0.010s and 0.014s, average is 0.01s):
I'm assuming those times are related to the CPU's activity when the scripts are loaded. However, it doesn't explain this gap. Why are these two results so different? Is there something to do with the way files are indexed?
Edit: I printed S in the random sample script, copied the list of 20 filenames then ran it again with S as a list literally defined. The time it took is comparable to the 'sequential' script. This means it's not related to files not being sequential in the fs or anything. It seems the random sampling gets counted in the timer, yet timing is defined as:
t = time.time()
np.load(path(s))
print(time.time() - t)
I tried as well wrapping np.load (exclusively) with cProfile: same result.

I did say:
I tested both scripts and ran them 100 times each (in the 2nd script I used the iteration count as the value for i so the same files are not loaded twice)
But as tevemadar mentioned
i should be randomized
I completely messed up the operation of selecting different files in the second version. My code was timing the scripts 100 times like so:
for i in trange(100):
if rand:
S = random.sample(L, 20)
else:
S = L[i: i+20] # <- every loop there's only 1 new file added in the selection,
# 19 files will have already been cached in the previous fetch
For the second script, it should rather be S = L[100*i, 100*i+20]!
And yes, when timing, the results are comparable.

why does moviepy's write_videofile function get slower and slower as it processes each frame? And how to improve/fix it?

My script takes two movie files as an input, and writes a 2x1 array movie output (stereoscopic Side-by-Side Half-Width). The input video clips are of equal resolution (1280x720), frame rate (60), number of frames (23,899), format (mp4)...
When the write_videofile function starts processing, it provides an estimated time of completion that is very reasonable ~20min. As it processes every frame, the process gets slower and slower and slower (indicated by progress bar and estimated completion time). In my case, the input movie clips are about 6min long. After three minutes of processing, it indicates it will take over 3 hours to complete. After a half hour of processing, it then indicates it will take over 24hours to complete.
I have tried the 'threads' option of the write_videofile function, butit did not help.
Any idea? Thanks for the help.
---- Script ----
movie_L = 'movie_L.mp4'
movie_R = 'movie_R.mp4'
output_movie = 'new_movie.mp4')
clip_L = VideoFileClip(movie_L)
(width_L, height_L) = clip_L.size
clip_L = clip_L.resize((width_L/2, height_L))
clip_R = VideoFileClip(movie_R)
(width_R, height_R) = clip_R.size
clip_R = clip_R.resize((width_R/2, height_R))
print("*** Make an array of the two movies side by side")
arrayClip = clips_array([[clip_L, clip_R]])
print("*** Write the video file")
arrayClip.write_videofile(output_movie, threads=4, audio = False)

I realize that this is old but for anyone still having this issue be sure to add
progress_bar = False to your code. EG.
arrayClip.write_videofile(output_movie, threads=4, audio = False, progress_bar = False)
Having the progress bar printing out each time it updates into IDLE takes up a ton of memory, thus slowing down your program until it stops completely.

I have also had problems with slow rendering. I find that it helps a lot to use multithreading and also to set the bitrate.
This is my configuration:
videoclip.write_videofile("fractal.mp4",fps=20,threads=16,logger=None,codec="mpeg4",preset="slow",ffmpeg_params=['-b:v','10000k'])
This works very well even with preset set to slow. This setting gives better quality for the same number of bits and if this is not an issue, you could set it to medium or fast to gain some more on speed.

Huge memory usage in pyROOT

My pyROOT analysis code is using huge amounts of memory. I have reduced the problem to the example code below:
from ROOT import TChain, TH1D
# Load file, chain
chain = TChain("someChain")
inFile = "someFile.root"
chain.Add(inFile)
nentries = chain.GetEntries()
# Declare histograms
h_nTracks = TH1D("h_nTracks", "h_nTracks", 16, -0.5, 15.5)
h_E = TH1D("h_E","h_E",100,-0.1,6.0)
h_p = TH1D("h_p", "h_p", 100, -0.1, 6.0)
h_ECLEnergy = TH1D("h_ECLEnergy","h_ECLEnergy",100,-0.1,14.0)
# Loop over entries
for jentry in range(nentries):
# Load entry
entry = chain.GetEntry(jentry)
# Define variables
cands = chain.__ncandidates__
nTracks = chain.nTracks
E = chain.useCMSFrame__boE__bc
p = chain.useCMSFrame__bop__bc
ECLEnergy = chain.useCMSFrame__boECLEnergy__bc
# Fill histos
h_nTracks.Fill(nTracks)
h_ECLEnergy.Fill(ECLEnergy)
for cand in range(cands):
h_E.Fill(E[cand])
h_p.Fill(p[cand])
where someFile.root is a root file with 700,000 entries and multiple particle candidates per entry.
When I run this script it uses ~600 MB of memory. If I remove the line
h_p.Fill(p[cand])
it uses ~400 MB.
If I also remove the line
h_E.Fill(E[cand])
it uses ~150 MB.
If I also remove the lines
h_nTracks.Fill(nTracks)
h_ECLEnergy.Fill(ECLEnergy)
there is no further reduction in memory usage.
It seems that for every extra histogram that I fill of the form
h_variable.Fill(variable[cand])
(i.e. histograms that are filled once per candidate per entry, as opposed to histograms that are just filled once per entry) I use an extra ~200 MB of memory. This becomes a serious problem when I have 10 or more histograms because I am using GBs of memory and I am exceeding the limits of my computing system. Does anybody have a solution?
Update: I think this is a python3 problem.
If I take the script in my original post (above) and run it using python2 the memory usage is ~200 MB, compared to ~600 MB with python3. Even if I try to replicate Problem 2 by using the long variable names, the job still only uses ~200 MB of memory with python2, compared to ~1.3 GB with python3.
During my Googling I came across a few other accounts of people encountering memory leaks when using pyROOT with python3. It seems this is still an issue as of Python 3.6.2 and ROOT 6.08/06, and that for the moment you must use python2 if you want to use pyROOT.
So, using python2 appears to be my "solution" for now, but it's not ideal. If anybody has any further information or suggestions I'd be grateful to hear from you!

I'm glad you figured out Python3 was the problem. But if you (or anyone) continues to have memory usage issues when working with histograms in the future, here are some potential solutions I hope you'll find helpful!
THnSparse
Use THnSparse--THnSparse is an efficient multidimensional histogram that shows its strengths in histograms where only a small fraction of the total bins are filled. You can read more about it here.
TTree
TTrees are data structures in ROOT that are quite frankly glorified tables. However, they are highly optimized. A TTree is composed of branches and leaves that contain data that, through ROOT, can be speedily and efficiently accessed. If you put your data into a TTree first, and then read it into a histogram, I guarantee you will find lower memory usage and higher run times.
Here is some example TTree code.
root_file_path = "../hadd_www.root"
muon_ps = ROOT.TFile(root_file_path)
muon_ps_tree = muon_ps.Get("WWWNtuple")
muon_ps_branches = muon_ps_tree.GetListOfBranches()
canv= ROOT.TCanvas()
num_of_events = 5000
ttvhist = ROOT.TH1F('Statistics2', 'Jet eta for ttV (aqua) vs WWW (white); Pseudorapidity',100, -3, 3)
i = 0
muon_ps_tree.GetEntry(i)
print len(muon_ps_tree.jet_eta)
#sys.exit()
while muon_ps_tree.GetEntry(i):
if i > num_of_events: break
for k in range(0,len(muon_ps_tree.jet_eta)-1):
wwwhist.Fill(float(muon_ps_tree.jet_eta[0]), 1)
i += 1
ttvhist.Write()
ttvhist.Draw("hist")
ttvhist.SetFillColor(70);
And here's a resource where you can learn about how fantastic TTrees are:
TTree ROOT documentation
For more reading, here is a discussion on speeding up ROOT historgram building on the CERN help forum:
Memory-conservative histograms for usage in DQ monitoring
Best of luck with your data analysis, and happy coding!

Parallelize for-loop in python

I have a simple set of code that runs Clustal Omega (a protein multiple sequence alignment program) from Python:
from Bio.Align.Applications import ClustalOmegaCommandline
segments = range(1, 9)
segments.reverse()
for segment in segments:
in_file = '1.0 - Split FASTA Files/Segment %d.fasta' % segment
out_file = '1.1 - Aligned FASTA Files/Segment %d Aligned.fasta' % segment
distmat = '1.1 - Distmats/Segment %d Distmat.fasta' % segment
cline = ClustalOmegaCommandline(infile=in_file,
outfile=out_file,
distmat_out=distmat,
distmat_full=True,
verbose=True,
force=True)
print cline
cline()
I've done some informal tests at timing how long my multiple sequence alignments (MSAs) take. On average, each one takes 4 hours. To run all 8 one after another took me 32 hours in total. Therefore, that was my original intent in running it as a for loop - that I could let it run and not worry about it.
However, I did yet another informal test - I took the output from the printed cline, and copied-and-pasted it into 8 separate terminal windows spread across two computers, and ran the MSAs that way. On average, each one took about 8 hours or so... but because they were all running at the same time, it took me only 8 hours to get the results.
In some ways, I've discovered parallel processing! :D
But I'm now faced with the dilemma of how to get it running in Python. I've tried looking at the following SO posts, but I still cannot seem to wrap my head around how the multiprocessing module works.
List of posts:
How do I parallelize a simple Python loop?
Perform a for-loop in parallel in Python 3.2
Parallel loop in python
how to parallelize big for loops in python
Would anybody be kind enough to share how they would parallelize this loop? Many loops I do look similar to this loop, in which I perform some action on a file and write to another file, without ever needing to aggregate the results in memory. The specific difference I am facing is the need to do file I/O, rather than aggregate results from parallel runs of the loop.

Possibly the Joblib library is what you are looking for.
Let me give you an example of its use:
import time
from joblib import Parallel, delayed
def long_function():
time.sleep(1)
REPETITIONS = 4
Parallel(n_jobs=REPETITIONS)(
delayed(long_function)() for _ in range(REPETITIONS))
This code runs in 1 second, instead of 4 seconds.
Adapting your code looks like this (sorry, I can't test if this is correct):
from joblib import Parallel, delayed
from Bio.Align.Applications import ClustalOmegaCommandline
def run(segment):
in_file = '1.0 - Split FASTA Files/Segment %d.fasta' % segment
out_file = '1.1 - Aligned FASTA Files/Segment %d Aligned.fasta' % segment
distmat = '1.1 - Distmats/Segment %d Distmat.fasta' % segment
cline = ClustalOmegaCommandline(infile=in_file,
outfile=out_file,
distmat_out=distmat,
distmat_full=True,
verbose=True,
force=True)
print cline
cline()
if __name__ == "__main__":
segments = range(1, 9)
segments.reverse()
Parallel(n_jobs=len(segments)(
delayed(run)(segment) for segment in segments)

Instead of for segment in segments, write def f(segment) and then use multiprocessing.Pool().map(f, segments)
Figuring out how to put this in context is left as an exercise to the reader.

python file io loop mystery (CPU goes disk sleep)

I'm looping histogram operation on HDF5 files of size ~800MB each (equal size).
The result of histogram is stored in text files each with ~5column x 30 lines.
t0 = time.time()
for f in filelist:
d = h5py.File(f,'r')
result = make_histogram(d['X'].value)
ascii_write(result)
print time.time()-t0
d.close()
One pass through the loop normally seems to take ~6-7 seconds for each file.
However, at some point it takes significantly longer to pass one loop.
And this point in time seems to start rather randomly if I try running multiple times with different files starting first.
I noticed that in my system monitor, at this point, CPU is in "disk sleep".
How can I fix this?
It seems to be related to this question, but I could not find a definitive answer.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.