Update: it's working after updating my Spyder to 5.0.5. Thanks everyone!
I am trying to speed up a loop using multiprocessing. The code below aims to generate 10000 random vectors.
My idea is to split the task into 5 processes and store it in result. However, it returned an empty list when I run the code.
But, if I remove result = add_one(result) in the randomize_data function, the code runs perfectly. So, the error must be coming from using functions from other modules (Testing.test) inside multiprocessing.
Here is the add_one function from Testing.test:
def add_one(x):
return x+1
How can I use function from other modules inside process? Thank you.
import multiprocessing
import numpy as np
import pandas as pd
def randomize_data(mean, cov, n_init, proc_num, return_dict):
result = pd.DataFrame()
for _ in range(n_init):
temp = np.random.multivariate_normal(mean, cov)
result = result.append(pd.Series(temp), ignore_index=True)
result = add_one(result)
return_dict[proc_num] = result
if __name__ == "__main__":
from Testing.test import add_one
mean = np.arange(0, 1, 0.1)
cov = np.identity(len(mean))
manager = multiprocessing.Manager()
return_dict = manager.dict()
jobs = []
for i in range(5):
p = multiprocessing.Process(target=randomize_data, args=(mean, cov, 2000, i, return_dict, ))
jobs.append(p)
p.start()
for proc in jobs:
proc.join()
result = return_dict.values()
The issue here is pretty obvious:
You imported add_one in a local scope, not in global. Because of this, the referenz to this function only exists inside your main-if.
Move this import-statement to the other ones to the top of your file, and your code should work.
import multiprocessing
import numpy as np
import pandas as pd
from Testing.test import add_one
Related
I have been using parfor in MATLAB to run parallel for loops for quite some time. I need to do something similar in Python but I cannot find any simple solution. This is my code:
t = list(range(1,3,1))
G = list(range(0,3,2))
results = pandas.DataFrame(columns = ['tau', 'p_value','G','t_i'],index=range(0,len(G)*len(t)))
counter = 0
for iteration_G in list(range(0,len(G))):
for iteration_t in list(range(0,len(t))):
matrix_1,matrix_2 = bunch of code
tau, p_value = scipy.stats.kendalltau(matrix_1, matrix_2)
results['tau'][counter] = tau
results['p_value'][counter] = p_value
results['G'][counter] = G[iteration_G]
results['t_i'][counter] = G[iteration_t]
counter = counter + 1
I would like to use the parfor equivalent in the first loop.
I'm not familiar with parfor, but you can use the joblib package to run functions in parallel.
In this simple example there's a function that prints its argument and we use Parallel to execute it multiple times in parallel with a for-loop
import multiprocessing
from joblib import Parallel, delayed
# function that you want to run in parallel
def foo(i):
print(i)
# define the number of cores (this is how many processes wil run)
num_cores = multiprocessing.cpu_count()
# execute the function in parallel - `return_list` is a list of the results of the function
# in this case it will just be a list of None's
return_list = Parallel(n_jobs=num_cores)(delayed(foo)(i) for i in range(20))
If this doesn't work for what you want to do, you can try to use numba - it might be a bit more difficult to set-up, but in theory with numba you can just add #njit(parallel=True) as a decorator to your function and numba will try to parallelise it for you.
I found a solution using parfor. It is still a bit more complicated than MATLAB's parfor but it's pretty close to what I am used to.
t = list(range(1,16,1))
G = list(range(0,62,2))
for iteration_t in list(range(0,len(t))):
#parfor(list(range(0,len(G))))
def fun(iteration_G):
result = pandas.DataFrame(columns = ['tau', 'p_value'],index=range(0,1))
matrix_1,matrix_2 = bunch of code
tau, p_value = scipy.stats.kendalltau(matrix_1, matrix_2)
result['tau'] = tau
result['p_value'] = p_value
fun = numpy.array([tau,p_value])
return fun
I am trying to use multiprocessing library to speed up CSV reading from files. I've done so using Pool and now I'm trying to do it with Process(). However when running the code, it's giving me the following error:
AttributeError: 'tuple' object has no attribute 'join'
Can someone tell me what's wrong? I don't understand the error.
import glob
import pandas as pd
from multiprocessing import Process
import matplotlib.pyplot as plt
import os
location = "/home/data/csv/"
uber_data = []
def read_csv(filename):
return uber_data.append(pd.read_csv(filename))
def data_wrangling(uber_data):
uber_data['Date/Time'] = pd.to_datetime(uber_data['Date/Time'], format="%m/%d/%Y %H:%M:%S")
uber_data['Dia Setmana'] = uber_data['Date/Time'].dt.weekday_name
uber_data['Num dia'] = uber_data['Date/Time'].dt.dayofweek
return uber_data
def plotting(uber_data):
weekdays = uber_data.pivot_table(index=['Num dia','Dia Setmana'], values='Base', aggfunc='count')
weekdays.plot(kind='bar', figsize=(8,6))
plt.ylabel('Total Journeys')
plt.title('Journey on Week Day')
def main():
processes = []
files = list(glob.glob(os.path.join(location,'*.csv*')))
for i in files:
p = Process(target=read_csv, args=[i])
processes.append(p)
p.start()
for process in enumerate(processes):
process.join()
#combined_df = pd.concat(df_list, ignore_index=True)
#dades_mod = data_wrangling(combined_df)
#plotting(dades_mod)
main()
Thank you.
I'm not 100% sure how Process works in this context, but what you have written here:
for process in enumerate(processes):
process.join()
will obviously throw an error and you can see this just from knowing builtins.
Calling enumerate on any iterable will produce a tuple where the first element is a counter.
Try this for a start:
for i, process in enumerate(processes): # assign the counter to the variable i, and grab the process which is the second element of the tuple
process.join()
Or this:
for process in processes:
process.join()
For more on enumerate see the builtin documentation here: https://docs.python.org/3/library/functions.html#enumerate
I'm attempting to get python multiprocessing working to speed up a code I've written. The code looks like this:
from multiprocessing import Array, Pool
import numpy as np
#setting up shared memory array
global misfit
misfit = Array('d', np.empty((dim1,dim2,dim3,dim4)).flat)
#looping through some values
for i in xrange(0,1):
#setting up pool
pool = Pool()
p = [pool.apply_async(self.testfunc,args=(somevals,j)) for j in xrange(0,1)]
pool.close()
pool.join()
Where self.testfunc looks like:
def testfunc(self,somevals,j):
#some calculations
for k in xrange(0,1):
#some calculations
for mn in xrange(0,1):
#some more calculations
#save results
result = i*j*k*mn # example
misfit[i*j*k*mn] = result
My problem is that when I run this none of the values are saved in the shared Array, and it remains empty. I understand this could be to do with the global variable, but in a simpler program that uses this exact setup, the values are saved to the array. The array is quite large in the full program as well (4561920000 values). Also if I call this function outside of the Pool, it works and the values are saved.
So my question is what I am doing wrong here? Am I sending the shared Array incorrectly?
EDIT: Figured I'd add in the code that works:
from multiprocessing import Array, Pool
from numpy import empty, sin
from time import time
import numpy as np
def initarr():
a = Array('d', empty((5, 50, 80)).flat)
return a
def testfunc(i, j, k):
count = (i*50*80) + (j*80) + k
x = sin(k)
a[count] = x
y = np.fft.fft(np.exp(2j*np.pi*np.arange(50000)/50000))
def process(i):
start = time()
pool = Pool()
for j in xrange(0, 50):
p = [pool.apply_async(testfunc, args=(i, j, k)) for k in xrange(0, 80)]
pool.close()
pool.join()
print time() - start
global a
a = initarr()
for i in xrange(0, 5):
process(i)
Ok so with the help of someone from our IT department, I finally have a version of this that works, so for anybody in the future viewing this question, I'll post a solution. I haven't really used stack overflow much so sorry if it's bad etiquette to answer my own question.
We got this working using an initializer function, but we had to make sure the initializer function was in the same file (module) as the function being run by the Pool. So in one module (misc) we had:
**misc.py**
def testfunc(self,somevals,j):
#some calculations
for k in xrange(0,len(krange)):
#some calculations
for mn in xrange(0,len(mnrange)):
#some more calculations
#save results
loc = (i*len(jrange)*len(krange)*len(mnrange))+
(j*len(krange)*len(mnrange))+(k*len(mnrange))+mn
result = i*j*k*mn # example
misfit[loc] = result
def initpool(a):
global misfit
misfit = a
And in the main file we have:
**main.py**
from multiprocessing import Array, Pool
from misc import initpool, testfunc
import numpy as np
#setting up shared memory array
misfit = Array('d', np.empty((dim1,dim2,dim3,dim4)).flat)
#looping through some values
for i in xrange(0,len(irange)):
#setting up pool
pool = Pool(initializer=initpool,initargs=(misfit,),processes=20)
p = [pool.apply_async(testfunc,args=(somevals,j)) for j in xrange(0,len(jrange))]
pool.close()
pool.join()
print(misfit[0])
Note that when we initially set up the Array, it must be named the same as the variable you set in initpool, at least from when I tested it.
This probably isn't the best way to do it but it works and hopefully some other people might find a use for it!
This is my first time trying to use multiprocessing in Python. I'm trying to parallelize my function fun over my dataframe df by row. The callback function is just to append results to an empty list that I'll sort through later.
Is this the correct way to use apply_async? Thanks so much.
import multiprocessing as mp
function_results = []
async_results = []
p = mp.Pool() # by default should use number of processors
for row in df.iterrows():
r = p.apply_async(fun, (row,), callback=function_results.extend)
async_results.append(r)
for r in async_results:
r.wait()
p.close()
p.join()
It looks like using map or imap_unordered (dependending on whether you need your results to be ordered or not) would better suit your needs
import multiprocessing as mp
#prepare stuff
if __name__=="__main__":
p = mp.Pool()
function_results = list(p.imap_unorderd(fun,df.iterrows())) #unordered
#function_results = p.map(fun,df.iterrows()) #ordered
p.close()
I have a code which reads data from multiple files named 001.txt, 002.txt, ... , 411.txt. I would like to read the data from each file, plot them, and save as 001.jpg, 002.jpg, ... , 411.jpg.
I can do this by looping through the files, but I would like to use the multiprocess module to speed things up.
However, when I use the code below, the computer hangs- I can't click on anything, but the mouse moves, and the sound continues. I then have to power down the computer.
I'm obviously misusing the multiprocess module with matplotlib. I have used something very similar to the below code to actually generate the data, and save to text files with no problems. What am I missing?
import multiprocessing
def do_plot(number):
fig = figure(number)
a, b = random.sample(range(1,9999),1000), random.sample(range(1,9999),1000)
# generate random data
scatter(a, b)
savefig("%03d" % (number,) + ".jpg")
print "Done ", number
close()
for i in (0, 1, 2, 3):
jobs = []
# for j in chunk:
p = multiprocessing.Process(target = do_plot, args = (i,))
jobs.append(p)
p.start()
p.join()
The most important thing in using multiprocessing is to run the main code of the module only for the main process. This can be achieved by testing if __name__ == '__main__' as shown below:
import matplotlib.pyplot as plt
import numpy.random as random
from multiprocessing import Pool
def do_plot(number):
fig = plt.figure(number)
a = random.sample(1000)
b = random.sample(1000)
# generate random data
plt.scatter(a, b)
plt.savefig("%03d.jpg" % (number,))
plt.close()
print("Done ", number)
if __name__ == '__main__':
pool = Pool()
pool.map(do_plot, range(4))
Note also that I replaced the creation of the separate processes by a process pool (which scales better to many pictures since it only uses as many process as you have cores available).