I am running the below code in Python. The code shows the example of how to use multiprocessing in Python. But it is not printing the result (line 16), Only the dataset (line 9) is getting printed and the kernel keeps on running. I am of the opinion that the results should be produced instantly as the code is using various cores of the cpu through multiprocessing so fast execution of the code should be there. Can someone tell what is the issue??
from multiprocessing import Pool
def square(x):
# calculate the square of the value of x
return x*x
if __name__ == '__main__':
# Define the dataset
dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
# Output the dataset
print ('Dataset: ' + str(dataset)) # Line 9
# Run this with a pool of 5 agents having a chunksize of 3 until finished
agents = 5
chunksize = 3
with Pool(processes=agents) as pool:
result = pool.map(square, dataset, chunksize)
# Output the result
print ('Result: ' + str(result)) # Line 16
You are running under Windows. Look at the console output where you started up Jupyter. You will see 5 instances (you have 5 processes in your pool) of:
AttributeError: Can't get attribute 'square' on <module 'main' (built-in)>
You need to put your worker function in a file, for instance, workers.py, in the same directory as your Jupyter code:
workers.py
def square(x):
# calculate the square of the value of x
return x*xdef square(x):
Then remove the above function from your cell and instead add:
import workers
and then:
with Pool(processes=agents) as pool:
result = pool.map(workers.square, dataset, chunksize)
Note
See my comment(s) to your post concerning the chunksize argument to the Pool constructor. The default chunksize for the map method when None is calculated more or less as follows based on the iterable argument:
if not hasattr(iterable, '__len__'):
iterable = list(iterable)
if chunksize is None:
chunksize, extra = divmod(len(iterable), len(self._pool) * 4)
if extra:
chunksize += 1
So I mispoke in one deatil: Regardless of whether you specify a chunksize or not, map will convert your iterable to a list if it needs to in order to get its length. But essentially the default chunksize used is the ceiling of the number of jobs being submitted divided by 4 times the pool size.
There is a known bug in older versions of python and Windows10 https://bugs.python.org/issue35797
It occurs when using multiprocessing through venv.
Bugfix is released in Python 3.7.3.
Related
I am trying to run a job on an HPC using multiprocessing. Each process has a peak memory usage of ~44GB. The job class I can use allows 1-16 nodes to be used, each with 32 CPUs and a memory of 124GB. Therefore if I want to run the code as quickly as possible (and within the max walltime limit) I should be able to run 2 CPUs on each node up to a maximum of 32 across all 16 nodes. However, when I specify mp.Pool(32) the job quickly exceeds the memory limit, I assume because more than two CPUs were used on a node.
My natural instinct was to specify 2 CPUs as the maximum in the pbs script I run my python script from, however this configuration is not permitted on the system. Would really appreciate any insight, having been scratching my head on this one for most of today - and have faced and worked around similar problems in the past without addressing the fundamentals at play here.
Simplified versions of both scripts below:
#!/bin/sh
#PBS -l select=16:ncpus=32:mem=124gb
#PBS -l walltime=24:00:00
module load anaconda3/personal
source activate py_env
python directory/script.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import multiprocessing as mp
def df_function(df, arr1, arr2):
df['col3'] = some_algorithm(df, arr1, arr2)
return df
def parallelize_dataframe(df, func, num_cores):
df_split = np.array_split(df, num_cores)
with mp.Pool(num_cores, maxtasksperchild = 10 ** 3) as pool:
df = pd.concat(pool.map(func, df_split))
return df
def main():
# Loading input data
direc = '/home/dir1/dir2/'
file = 'input_data.csv'
a_file = 'array_a.npy'
b_file = 'array_b.npy'
df = pd.read_csv(direc + file)
a = np.load(direc + a_file)
b = np.load(direc + b_file)
# Globally defining function with keyword defaults
global f
def f(df):
return df_function(df, arr1 = a, arr2 = b)
num_cores = 32 # i.e. 2 per node if evenly distributed.
# Running the function as a multiprocess:
df = parallelize_dataframe(df, f, num_cores)
# Saving:
df.to_csv(direc + 'outfile.csv', index = False)
if __name__ == '__main__':
main()
To run your job as-is, you could simply request ncpu=32 and then in your python script set num_cores = 2. Obviously this has you paying for 32 cores and then leaving 30 of them idle, which is wasteful.
The real problem here is that your current algorithm is memory-bound, not CPU-bound. You should be going to great lengths to read only chunks of your files into memory, operating on the chunks, and then writing the result chunks to disk to be organized later.
Fortunately Dask is built to do exactly this kind of thing. As a first step, you can take out the parallelize_dataframe function and directly load and map your some_algorithm with a dask.dataframe and dask.array:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import dask.dataframe as dd
import dask.array as da
def main():
# Loading input data
direc = '/home/dir1/dir2/'
file = 'input_data.csv'
a_file = 'array_a.npy'
b_file = 'array_b.npy'
df = dd.read_csv(direc + file, blocksize=25e6)
a_and_b = da.from_np_stack(direc)
df['col3'] = df.apply(some_algorithm, args=(a_and_b,))
# dask is lazy, this is the only line that does any work
# Saving:
df.to_csv(
direc + 'outfile.csv',
index = False,
compute_kwargs={"scheduler": "threads"}, # also "processes", but try threads first
)
if __name__ == '__main__':
main()
That will require some tweaks to some_algorithm, and to_csv and from_np_stack work a bit differently, but you will be able to reasonably run this thing just on your own laptop and it will scale to your cluster hardware. You can level up from here by using the distributed scheduler or even deploy it directly to your cluster with dask-jobqueue.
I am trying to understand parallelization in Python better. I'm having trouble implementing a parallelized Fibonacci algorithm.
I've been reading the docs on multiprocessing and threading, but haven't had luck.
# TODO Figure out how threads work
# TODO Do a Fibonacci counter
import concurrent.futures
def fib(pos, _tpe):
"""Return the Fibonacci number at position."""
if pos < 2:
return pos
x = fib(pos - 1, None)
y = fib(pos - 2, None)
return x + y
def fibp(pos, tpe):
"""Return the Fibonacci number at position."""
if pos < 2:
return pos
x = tpe.submit(fib, (pos - 1), tpe).result()
y = tpe.submit(fib, (pos - 2), tpe).result()
return x + y
if __name__ == '__main__':
import sys
with concurrent.futures.ThreadPoolExecutor() as cftpe:
fun = fibp if len(sys.argv) is 3 else fib
position = int(sys.argv[1])
print(fun(position, cftpe))
print(fun.__name__)
Command line output:
$ time python test_fib_parallel.py 35
9227465
fib
real 0m3.778s
user 0m3.746s
sys 0m0.017s
$ time python test_fib_parallel.py 35 dummy-var
9227465
fibp
real 0m3.776s
user 0m3.749s
sys 0m0.018s
I thought the timing should be significantly different, but it is not.
UPDATE
Okay, so the tips helped and I got it to work (much faster than the original version). Ignoring the toy code in the original post, I'm putting a snippet of the actual problem I was trying to solve, which was to grow a decision tree (for homework). This was a CPU bound problem. It was solved using a ProcessPoolExecutor and a Manager.
def induce_tree(data, splitter, out=stdout, print_bool=True):
"""Greedily induce a decision tree from the given training data."""
from multiprocessing import Manager
from concurrent.futures import ProcessPoolExecutor
output_root = Node(depth=1)
input_root = Node(depth=1, data=data)
proxy_stack = Manager().list([(input_root, output_root)])
pool_exec = ProcessPoolExecutor()
while proxy_stack:
# while proxy_stack:
# pass proxy stack to function
# return data tree
current = proxy_stack.pop()
if current[0] is not ():
pool_exec.submit(grow_tree, proxy_stack, current, splitter)
finish_output_tree(input_root, output_root)
return output_root
UPDATE AGAIN
This is still buggy because I'm not managing the splitter correctly between processes. I'm going to try something else.
LAST UPDATE
I was trying to parallelize at too high of a level in my code. I had to drill down to the part in my splitter object where the bottleneck was. The parallel version is about twice as fast. (The error is high because this is just a test sample without enough data to actually train a good tree.)
The code ended up looking like this
inputs = [(data, idx, depth) for idx in range(data.shape[1] - 1)]
if self._splittable(data, depth):
with ProcessPoolExecutor() as executor:
for output in executor.map(_best_split_help, inputs):
# ... ...
SEQUENTIAL
$ time python main.py data/spambase/spam50.data
Test Fold: 1, Error: 0.166667
Test Fold: 2, Error: 0.583333
Test Fold: 3, Error: 0.583333
Test Fold: 4, Error: 0.230769
Average training error: 0.391026
real 0m29.178s
user 0m28.924s
sys 0m0.106s
PARALLEL
$ time python main.py data/spambase/spam50.data
Test Fold: 1, Error: 0.166667
Test Fold: 2, Error: 0.583333
Test Fold: 3, Error: 0.583333
Test Fold: 4, Error: 0.384615
Average training error: 0.429487
real 0m14.419s
user 0m50.748s
sys 0m1.396s
I have a function that takes a text file as input, does some processing, and writes a pickled result to file. I'm trying to perform this in parallel across multiple files. The order in which files are processed doesn't matter, and the processing of each is totally independent. Here's what I have now:
import mulitprocessing as mp
import pandas as pd
from glob import glob
def processor(fi):
df = pd.read_table(fi)
...do some processing to the df....
filename = fi.split('/')[-1][:-4]
df.to_pickle('{}.pkl'.format(filename))
if __name__ == '__main__':
files = glob('/path/to/my/files/*.txt')
pool = mp.Pool(8)
for _ in pool.imap_unordered(processor, files):
pass
Now, this actually works totally fine as far as I can tell, but the syntax seems really hinky and I'm wondering if there is a better way of going about it. E.g. can I get the same result without having to perform an explicit loop?
I tried map_async(processor, files), but this doesn't generate any output files (but doesn't throw any errors).
Suggestions?
You can use map_async, but you need to wait for it to finish, since the async bit means "don't block after setting off the jobs, but return immediately". If you don't wait, if there's nothing after your code your program will exit and all subprocesses will be killed immediately and before completing - not what you want!
The following example should help:
from multiprocessing.pool import Pool
from time import sleep
def my_func(val):
print('Executing %s' % val)
sleep(0.5)
print('Done %s' % val)
pl = Pool()
async_result = pl.map_async(my_func, [1, 2, 3, 4, 5])
res = async_result.get()
print('Pool done: %s' % res)
The output of which (when I ran it) is:
Executing 2
Executing 1
Executing 3
Executing 4
Done 2
Done 1
Executing 5
Done 4
Done 3
Done 5
Pool done: [None, None, None, None, None]
Alternatively, using plain map would also do the trick, and then you don't have to wait for it since it is not "asynchronous" and synchronously waits for all jobs to be complete:
pl = Pool()
res = pl.map(my_func, [1, 2, 3, 4, 5])
print('Pool done: %s' % res)
I am working with DEAP.
I am evaluating a population (currently 50 individuals) against a large dataset (400.000 columns of 200 floating points).
I have successfully tested the algorithm without any multiprocessing. Execution time is about 40s/generation.
I want to work with larger populations and more generations, so I try to speed up by using multiprocessing.
I guess that my question is more related to multiprocessing than to DEAP.
This question is not directly related to sharing memory/variables between processes. The main issue is how to minimise disk access.
I have started to work with Python multiprocessing module.
The code looks like this
toolbox = base.Toolbox()
creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create("Individual", list, fitness=creator.FitnessMax)
PICKLE_SEED = 'D:\\Application Data\\Dev\\20150925173629ClustersFrame.pkl'
PICKLE_DATA = 'D:\\Application Data\\Dev\\20150925091456DataSample.pkl'
if __name__ == "__main__":
pool = multiprocessing.Pool(processes = 2)
toolbox.register("map", pool.map)
data = pd.read_pickle(PICKLE_DATA).values
And then, a little bit further:
def main():
NGEN = 10
CXPB = 0.5
MUTPB = 0.2
population = toolbox.population_guess()
fitnesses = list(toolbox.map(toolbox.evaluate, population))
print(sorted(fitnesses, reverse = True))
for ind, fit in zip(population, fitnesses):
ind.fitness.values = fit
# Begin the evolution
for g in range(NGEN):
The evaluation function uses the global "data" variable.
and, finally:
if __name__ == "__main__":
start = datetime.now()
main()
pool.close()
stop = datetime.now()
delta = stop-start
print (delta.seconds)
So: the main processing loop and the pool definition are guarded by if __name__ == "__main__":.
It somehow works. Execution times are:
1 process: 398 s
2 processes: 270 s
3 processes: 272 s
4 processes: 511 s
Multiprocessing does not dramatically improve the execution time, and can even harm it.
The 4 process (lack of) performance can be explained by memory constraints. My system is basically paging instead of processing.
I guess that the other measurements can be explained by the loading of data.
My questions:
1) I understand that the file will be read and unpickled each time the module is started as a separate process. Is this correct? Does this mean it will be read each time one of the functions it contains will be called by map?
2) I have tried to move the unpickling under the if __name__ == "__main__": guard, but, then, I get an error message saying the "data" is not defined when I call the evaluation function. Could you explain how I can read the file once, and then only pass the array to the processes
I'm trying to perform some costly scientific calculation with Python. I have to read a bunch of data stored in csv files and process then. Since each process take a long time and I have some 8 processors to use, I was trying to use the Pool method from Multiprocessing.
This is how I structured the multiprocessing call:
pool = Pool()
vector_components = []
for sample in range(samples):
vector_field_x_i = vector_field_samples_x[sample]
vector_field_y_i = vector_field_samples_y[sample]
vector_component = pool.apply_async(vector_field_decomposer, args=(x_dim, y_dim, x_steps, y_steps,
vector_field_x_i, vector_field_y_i))
vector_components.append(vector_component)
pool.close()
pool.join()
vector_components = map(lambda k: k.get(), vector_components)
for vector_component in vector_components:
CsvH.write_vector_field(vector_component, '../CSV/RotationalFree/rotational_free_x_'+str(sample)+'.csv')
I was running a data set of 500 samples of size equal to 100 (x_dim) by 100 (y_dim).
Until then everything worked fine.
Then I receive a data set of 500 samples of 400 x 400.
When running it, I get an error when calling the get.
I also tried to run a single sample of 400 x 400 and got the same error.
Traceback (most recent call last):
File "__init__.py", line 33, in <module>
VfD.samples_vector_field_decomposer(samples, x_dim, y_dim, x_steps, y_steps, vector_field_samples_x, vector_field_samples_y)
File "/export/home/pceccon/VectorFieldDecomposer/Sources/Controllers/VectorFieldDecomposerController.py", line 43, in samples_vector_field_decomposer
vector_components = map(lambda k: k.get(), vector_components)
File "/export/home/pceccon/VectorFieldDecomposer/Sources/Controllers/VectorFieldDecomposerController.py", line 43, in <lambda>
vector_components = map(lambda k: k.get(), vector_components)
File "/export/home/pceccon/.pyenv/versions/2.7.5/lib/python2.7/multiprocessing/pool.py", line 554, in get
raise self._value
MemoryError
What should I do?
Thank you in advance.
Right now you're keeping several lists in memory - vector_field_x, vector_field_y, vector_components, and then a separate copy of vector_components during the map call (which is when you actually run out of memory). You can avoid needing either copy of the vector_components list by using pool.imap, instead of pool.apply_async along with a manually created list. imap returns an iterator instead of a complete list, so you never have all the results in memory.
Normally, pool.map breaks the iterable passed to it into chunks, and sends the those chunks to the child processes, rather than sending one element at a time. This helps improve performance. Because imap uses an iterator instead of a list, it doesn't know the complete size of the iterable you're passing to it. Without knowing the size of the iterable, it doesn't know how big to make each chunk, so it defaults to a chunksize of 1, which will work, but may not perform all that well. To avoid this, you can provide it with a good chunksize argument, since you know the iterable is sample elements long. It may not make much difference for your 500 element list, but it's worth experimenting with.
Here's some sample code that demonstrates all this:
import multiprocessing
from functools import partial
def vector_field_decomposer(x_dim, y_dim, x_steps, y_steps, vector_fields):
vector_field_x_i = vector_fields[0]
vector_field_y_i = vector_fields[1]
# Do whatever is normally done here.
if __name__ == "__main__":
num_workers = multiprocessing.cpu_count()
pool = multiprocessing.Pool(num_workers)
# Calculate a good chunksize (based on implementation of pool.map)
chunksize, extra = divmod(samples // 4 * num_workers)
if extra:
chunksize += 1
# Use partial so many arguments can be passed to vector_field_decomposer
func = partial(vector_field_decomposer, x_dim, y_dim, x_steps, y_steps)
# We use a generator expression as an iterable, so we don't create a full list.
results = pool.imap(func,
((vector_field_samples_x[s], vector_field_samples_y[s]) for s in xrange(samples)),
chunksize=chunksize)
for vector in results:
CsvH.write_vector_field(vector_component,
'../CSV/RotationalFree/rotational_free_x_'+str(sample)+'.csv')
pool.close()
pool.join()
This should allow you to avoid the MemoryError issues, but if not, you could try running imap over smaller chunks of your total sample, and just do multiple passes. I don't think you'll have any issues though, because you're not building any additional lists, other than the vector_field_* lists you start with.