Parallelize for-loop in python - python

I have a simple set of code that runs Clustal Omega (a protein multiple sequence alignment program) from Python:
from Bio.Align.Applications import ClustalOmegaCommandline
segments = range(1, 9)
segments.reverse()
for segment in segments:
in_file = '1.0 - Split FASTA Files/Segment %d.fasta' % segment
out_file = '1.1 - Aligned FASTA Files/Segment %d Aligned.fasta' % segment
distmat = '1.1 - Distmats/Segment %d Distmat.fasta' % segment
cline = ClustalOmegaCommandline(infile=in_file,
outfile=out_file,
distmat_out=distmat,
distmat_full=True,
verbose=True,
force=True)
print cline
cline()
I've done some informal tests at timing how long my multiple sequence alignments (MSAs) take. On average, each one takes 4 hours. To run all 8 one after another took me 32 hours in total. Therefore, that was my original intent in running it as a for loop - that I could let it run and not worry about it.
However, I did yet another informal test - I took the output from the printed cline, and copied-and-pasted it into 8 separate terminal windows spread across two computers, and ran the MSAs that way. On average, each one took about 8 hours or so... but because they were all running at the same time, it took me only 8 hours to get the results.
In some ways, I've discovered parallel processing! :D
But I'm now faced with the dilemma of how to get it running in Python. I've tried looking at the following SO posts, but I still cannot seem to wrap my head around how the multiprocessing module works.
List of posts:
How do I parallelize a simple Python loop?
Perform a for-loop in parallel in Python 3.2
Parallel loop in python
how to parallelize big for loops in python
Would anybody be kind enough to share how they would parallelize this loop? Many loops I do look similar to this loop, in which I perform some action on a file and write to another file, without ever needing to aggregate the results in memory. The specific difference I am facing is the need to do file I/O, rather than aggregate results from parallel runs of the loop.

Possibly the Joblib library is what you are looking for.
Let me give you an example of its use:
import time
from joblib import Parallel, delayed
def long_function():
time.sleep(1)
REPETITIONS = 4
Parallel(n_jobs=REPETITIONS)(
delayed(long_function)() for _ in range(REPETITIONS))
This code runs in 1 second, instead of 4 seconds.
Adapting your code looks like this (sorry, I can't test if this is correct):
from joblib import Parallel, delayed
from Bio.Align.Applications import ClustalOmegaCommandline
def run(segment):
in_file = '1.0 - Split FASTA Files/Segment %d.fasta' % segment
out_file = '1.1 - Aligned FASTA Files/Segment %d Aligned.fasta' % segment
distmat = '1.1 - Distmats/Segment %d Distmat.fasta' % segment
cline = ClustalOmegaCommandline(infile=in_file,
outfile=out_file,
distmat_out=distmat,
distmat_full=True,
verbose=True,
force=True)
print cline
cline()
if __name__ == "__main__":
segments = range(1, 9)
segments.reverse()
Parallel(n_jobs=len(segments)(
delayed(run)(segment) for segment in segments)

Instead of for segment in segments, write def f(segment) and then use multiprocessing.Pool().map(f, segments)
Figuring out how to put this in context is left as an exercise to the reader.

Related

Python odd behavior with joblib Paralllel

I'm pretty new to python (my native language is C#, and I find python quite... messy) and I'm trying to process a large dataset. The work is mostly CPU bound, so I'm using joblib Parallel. My function is like so:
def processData(datasegment):
// read previous data from disk
// format dataset to match previous data, and combine <-- takes 80-90 % of the time
// write data back to disk
If I run my app such that each segment of data is processed sequentially, it runs fine but takes a couple of hours. During this time 19 of my 20 logical processors are idle, so I want to parallelize this. It's also cold outside, and this 10900k can help heat my house :p
If I try to just parallelize the entire dataset, running 20 jobs, I run out of memory. So I've broken the dataset up into batches, and am trying to run do batched parallelization - using code like so:
def processBatch(batch):
results = Parallel(n_jobs=20)(delayed(processData)(segment) for segment in batch)
return results
def processData(datasegment):
// read previous data from disk
// format [dataset] to match previous data, and combine
// write data back to disk
batch_size = 500
steps = math.ceil(len(dataset) / batch_size)
for i in range(steps):
start = i * batch_size
end = min(start + batch_size, len(dataset))
batch = dataset[start:end]
results += processBatch(batch)
This works fine for small datasets, but when I try to run the full thing something goes wrong in the multiprocessing part way through, and it ends up running things sequentially. I still have 20 python processes, but only 1 of them is doing work.
Can someone help me understand what's going on?
Thanks!

How can I time prints consistently

I'm training my python abilities by making a bunch of generally useless code and today I was attempting to print Bad apple in the console using ASCII art as one does, I did everything just fine until I had to time the prints so they end in 3 minutes and 52 seconds maintaining a consistent framerate. I tried just adding a time.sleep()in between prints hoping it would all just magically work but obviously it didn't.
I customized a version of this git https://github.com/aypro-droid/image-to-ascii to transform frames to ASCII art and used https://pypi.org/project/opencv-python/ for transforming the video to frames
here is my code:
import time
frames = {}
#saving each .txt frame on a dict
for i in range(6955):
f = open("Frames-to-Ascii/output/output{0}.txt".format(i), "rt")
frames['t{0}'.format(i)] = f.read()
f.close()
#start "trigger"
ini = input('start(type anything): ')
start = time.time()
#printing the 6954 frames from the dict
for x in range(6955):
eval("print(frames['t{0}'])".format(x))
#my attempt at timing
time.sleep(0.015)
end = time.time()
#calculating how much time the prints took overall, should be about 211.2 seconds evenly distributed to all the "frames"
print(end-start)
frame example:
here
I'm attempting to time the prints perfectly to the video so I can later use it somewhere else, any tips?
What I understand is that you need to print the frames at a given constant rate?
If yes, then you need to evaluate the time used to print and then sleep for the delay minus the time to print. Something like:
for x in range(6955):
start = time.time()
print("hips")
end = time.time()
time.sleep(0.5-(end-start))
Thus the loop will take (approximatively) 0.5s to run. (Change the value accordingly to your needs).
Of course if a single print takes more time than the delay, you need to find another strategy: for example stepping over the next frame, etc.

Randomly selected file take longer to load with numpy.load than sequential ones

Context
While training a neural network I realized the time spent per batch increased when I increased the size of my dataset (without changing the batch size). The important part is, I need to fetch 20 .npy files per data point, this number doesn't depend on the dataset size.
Problem
Training goes from 2s/iteration to 10s/iteration...
There is no apparent reason why training would take longer. However, I managed to track down the bottleneck. It seems to have to do with the loading of the .npy files.
To reproduce this behavior, here's a small script you can run to generate 10,000 dummy .npy files:
def path(i):
return os.sep.join(('../datasets/test', str(i)))
def create_dummy_files(N=10000):
for i in range(N):
x = np.random.random((100, 100))
np.save(path(random.getrandbits(128)), x)
Then you can run the following two scripts and compare them yourself:
The first script where 20 .npy files are randomly selected and loaded:
L = os.listdir('../datasets/test')
S = random.sample(L, 20)
for s in S:
np.load(path(s)) # <- timed this
The second version, where 20 .npy 'sequential' files are selected and loaded.
L = os.listdir('../datasets/test')
i = 100
S = L[i: i + 20]
for s in S:
np.load(path(s)) # <- timed this
I tested both scripts and ran them 100 times each (in the 2nd script I used the iteration count as the value for i so the same files are not loaded twice). I wrapped the np.load(path(s)) line with time.time() calls. I'm not timing the sampling, only the loading. Here are the results:
Random loads (times roughly stay between 0.1s and 0.4s, average is 0.25s):
Non random loads (times roughly stay between 0.010s and 0.014s, average is 0.01s):
I'm assuming those times are related to the CPU's activity when the scripts are loaded. However, it doesn't explain this gap. Why are these two results so different? Is there something to do with the way files are indexed?
Edit: I printed S in the random sample script, copied the list of 20 filenames then ran it again with S as a list literally defined. The time it took is comparable to the 'sequential' script. This means it's not related to files not being sequential in the fs or anything. It seems the random sampling gets counted in the timer, yet timing is defined as:
t = time.time()
np.load(path(s))
print(time.time() - t)
I tried as well wrapping np.load (exclusively) with cProfile: same result.
I did say:
I tested both scripts and ran them 100 times each (in the 2nd script I used the iteration count as the value for i so the same files are not loaded twice)
But as tevemadar mentioned
i should be randomized
I completely messed up the operation of selecting different files in the second version. My code was timing the scripts 100 times like so:
for i in trange(100):
if rand:
S = random.sample(L, 20)
else:
S = L[i: i+20] # <- every loop there's only 1 new file added in the selection,
# 19 files will have already been cached in the previous fetch
For the second script, it should rather be S = L[100*i, 100*i+20]!
And yes, when timing, the results are comparable.

python file io loop mystery (CPU goes disk sleep)

I'm looping histogram operation on HDF5 files of size ~800MB each (equal size).
The result of histogram is stored in text files each with ~5column x 30 lines.
t0 = time.time()
for f in filelist:
d = h5py.File(f,'r')
result = make_histogram(d['X'].value)
ascii_write(result)
print time.time()-t0
d.close()
One pass through the loop normally seems to take ~6-7 seconds for each file.
However, at some point it takes significantly longer to pass one loop.
And this point in time seems to start rather randomly if I try running multiple times with different files starting first.
I noticed that in my system monitor, at this point, CPU is in "disk sleep".
How can I fix this?
It seems to be related to this question, but I could not find a definitive answer.

How to organize a Python GIS-project with multiple analysis steps?

I just started to use ArcPy to analyse geo-data with ArcGIS. The analysis has different steps, which are to be executed one after the other.
Here is some pseudo-code:
import arcpy
# create a masking variable
mask1 = "mask.shp"
# create a list of raster files
files_to_process = ["raster1.tif", "raster2.tif", "raster3.tif"]
# step 1 (e.g. clipping of each raster to study extent)
for index, item in enumerate(files_to_process):
raster_i = "temp/ras_tem_" + str(index) + ".tif"
arcpy.Clip_management(item, '#', raster_i, mask1)
# step 2 (e.g. change projection of raster files)
...
# step 3 (e.g. calculate some statistics for each raster)
...
etc.
This code works amazingly well so far. However, the raster files are big and some steps take quite long to execute (5-60 minutes). Therefore, I would like to execute those steps only if the input raster data changes. From the GIS-workflow point of view, this shouldn't be a problem, because each step saves a physical result on the hard disk which is then used as input by the next step.
I guess if I want to temporarily disable e.g. step 1, I could simply put a # in front of every line of this step. However, in the real analysis, each step might have a lot of lines of code, and I would therefore prefer to outsource the code of each step into a separate file (e.g. "step1.py", "step2.py",...), and then execute each file.
I experimented with execfile(step1.py), but received the error NameError: global name 'files_to_process' is not defined. It seems that the variables defined in the main script are not automatically passed to scripts called by execfile.
I also tried this, but I received the same error as above.
I'm a total Python newbie (as you might have figured out by the misuse of any Python-related expressions), and I would be very thankful for any advice on how to organize such a GIS project.
I think what you want to do is build each step into a function. These functions can be stored in the same script file or in their own module that gets loaded with the import statement (just like arcpy). The pseudo code would be something like this:
#file 1: steps.py
def step1(input_files):
# step 1 code goes here
print 'step 1 complete'
return
def step2(input_files):
# step 2 code goes here
print 'step 2 complete'
return output # optionally return a derivative here
#...and so on
Then in a second file in the same directory, you can import and call the functions passing the rasters as your inputs.
#file 2: analyze.py
import steps
files_to_process = ["raster1.tif", "raster2.tif", "raster3.tif"]
steps.step1(files_to_process)
#steps.step2(files_to_process) # uncomment this when you're ready for step 2
Now you can selectively call different steps of your code and it only requires commenting/excluding one line instead of a whle chunk of code. Hopefully I understood your question correctly.

Categories