I'd like to parallelize a function that returns a flatten list of values (called "keys") in a dict but I don't understand how to obtain in the final result. I have tried:
def toParallel(ht, token):
keys = []
words = token[token['hashtag'] == ht]['word']
for w in words:
keys.append(checkString(w))
y = {ht:keys}
num_cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(num_cores)
token = pd.read_csv('/path', sep=",", header = None, encoding='utf-8')
token.columns = ['word', 'hashtag', 'count']
hashtag = pd.DataFrame(token.groupby(by='hashtag', as_index=False).count()['hashtag'])
result = pd.DataFrame(index = hashtag['hashtag'], columns = range(0, 21))
result = result.fillna(0)
final_result = []
final_result = [pool.apply_async(toParallel, args=(ht,token,)) for ht in hashtag['hashtag']]
Where toParallel function should return a dict with hashtag as key and a list of keys (where keys are int). But if I try to print final_result, I obtain only
bound method ApplyResult.get of multiprocessing.pool.ApplyResult object at 0x10c4fa950
How can I do it?
final_result = [pool.apply_async(toParallel, args=(ht,token,)) for ht in hashtag['hashtag']]
You can either use Pool.apply() and get the result right away (in which case you do not need multiprocessing hehe, the function is just there for completeness) or use Pool.apply_async() following by Pool.get(). Pool.apply_async() is asynchronous.
Something like this:
workers = [pool.apply_async(toParallel, args=(ht,token,)) for ht in hashtag['hashtag']]
final_result = [worker.get() for worker in workers]
Alternatively, you can also use Pool.map() which will do all this for you.
Either way, I recommend you read the documentation carefully.
Addendum: When answering this question I presumed the OP is using some Unix operating system like Linux or OSX. If you are using Windows, you must not forget to safeguard your parent/worker processes using if __name__ == '__main__'. This is because Windows lacks fork() and so the child process starts at the beginning of the file, and not at the point of forking like in Unix, so you must use an if condition to guide it. See here.
ps: this is unnecessary:
num_cores = multiprocessing.cpu_count()
pool = multiprocessing.Pool(num_cores)
If you call multiprocessing.Pool() without arguments (or None), it already creates a pool of workers with the size of your cpu count.
Related
Result Array is displayed as empty after trying to append values into it.
I have even declared result as global inside function.
Any suggestions?
Error Image
try this
res= []
inputData = [a,b,c,d]
def function(data):
values = [some_Number_1, some_Number_2]
return values
def parallel_run(function, inputData):
cpu_no = 4
if len(inputData) < cpu_no:
cpu_no = len(inputData)
p = multiprocessing.Pool(cpu_no)
global resultsAr
resultsAr = p.map(function, inputData, chunksize=1)
p.close()
p.join()
print ('res = ', res)
This happens since you're misunderstanding the basic point of multiprocessing: the child process spawned by multiprocessing.Process is separate from the parent process, and thus any modifications to data (including global variables) in the child process(es) are not propagated into the parent.
You will need to use multiprocessing-specific data types (queues and pipes), or the higher-level APIs provided by e.g. multiprocessing.Pool, to get data out of the child process(es).
For your application, the high-level recipe would be
def square(v):
return v * v
def main():
arr = [1, 2, 3, 4, 5]
with multiprocessing.Pool() as p:
squared = p.map(square, arr)
print(squared)
– however you'll likely find that this is massively slower than not using multiprocessing due to the overheads involved in such a small task.
Welcome to StackOverflow, Suyash !
The problem is that multiprocessing.Process is, as its name says, a separate process. You can imagine it almost as if you're running your script again from the terminal, with very little connection to the mother script.
Therefore, it has its own copy of the result array, which it modifies and prints.
The result in the "main" process is unmodified.
To convince yourself of this, try to print id(res) in both __main__ and in square(). You'll see they are different.
This is the code I want to make parallel
dataset = {}
for index,Id in enumerate(MarketIds['Market Id']):
dataset[index] = GetAllBidPrice(Id)
Assuming you don't care about the order that keys are inserted into the dictionary, a good option here would probably be the imap_unordered method of a multiprocessing.pool.Pool object. Here's an example using all processor cores:
from multiprocessing.pool import Pool
p = Pool(None) # can pass a specific number of cores
dataset = {idx: d for idx, d in p.imap_unordered(
lambda idx, id: (idx, GetAllBidPrice(id)),
enumerate(MarketIds['Market Id']))}
I'm doing a lot of calculations writing the results to a file. Using multiprocessing I'm trying to parallelise the calculations.
Problem here is that I'm writing to one output file, which all the workers are writing too. I'm quite new to multiprocessing, and wondering how I could make it work.
A very simple concept of the code is given below:
from multiprocessing import Pool
fout_=open('test'+'.txt','w')
def f(x):
fout_.write(str(x) + "\n")
if __name__ == '__main__':
p = Pool(5)
p.map(f, [1, 2, 3])
The result I want would be a file with:
1 2 3
However now I get an empty file. Any suggestions?
I greatly appreciate any help :)!
You shouldn't be letting all the workers/processes write to a single file. They can all read from one file (which may cause slow downs due to workers waiting for one of them to finish reading), but writing to the same file will cause conflicts and potentially corruption.
Like said in the comments, write to separate files instead and then combine them into one on a single process. This small program illustrates it based on the program in your post:
from multiprocessing import Pool
def f(args):
''' Perform computation and write
to separate file for each '''
x = args[0]
fname = args[1]
with open(fname, 'w') as fout:
fout.write(str(x) + "\n")
def fcombine(orig, dest):
''' Combine files with names in
orig into one file named dest '''
with open(dest, 'w') as fout:
for o in orig:
with open(o, 'r') as fin:
for line in fin:
fout.write(line)
if __name__ == '__main__':
# Each sublist is a combination
# of arguments - number and temporary output
# file name
x = range(1,4)
names = ['temp_' + str(y) + '.txt' for y in x]
args = list(zip(x,names))
p = Pool(3)
p.map(f, args)
p.close()
p.join()
fcombine(names, 'final.txt')
It runs f for each argument combination which in this case are value of x and temporary file name. It uses a nested list of argument combinations since pool.map does not accept more than one arguments. There are other way to go around this, especially on newer Python versions.
For each argument combination and pool member it creates a separate file to which it writes the output. In principle your output will be longer, you can simply add another function that computes it to the f function. Also, no need to use Pool(5) for 3 arguments (though I assume that only three workers were active anyway).
Reasons for calling close() and join() are explained well in this post. It turns out (in the comment to the linked post) that map is blocking, so here you don't need them for the original reasons (wait till they all finish and then write to the combined output file from just one process). I would still use them in case other parallel features are added later.
In the last step, fcombine gathers and copies all the temporary files into one. It's a bit too nested, if you for instance decide to remove the temporary file after copying, you may want to use a separate function under the with open('dest', ).. or the for loop underneath - for readability and functionality.
Multiprocessing.pool spawns processes, writing to a common file without lock from each process can cause data loss.
As you said you are trying to parallelise the calculation, multiprocessing.pool can be used to parallelize the computation.
Below is the solution that do parallel computation and writes the result in file, hope it helps:
from multiprocessing import Pool
# library for time
import datetime
# file in which you want to write
fout = open('test.txt', 'wb')
# function for your calculations, i have tried it to make time consuming
def calc(x):
x = x**2
sum = 0
for i in range(0, 1000000):
sum += i
return x
# function to write in txt file, it takes list of item to write
def f(res):
global fout
for x in res:
fout.write(str(x) + "\n")
if __name__ == '__main__':
qs = datetime.datetime.now()
arr = [1, 2, 3, 4, 5, 6, 7]
p = Pool(5)
res = p.map(calc, arr)
# write the calculated list in file
f(res)
qe = datetime.datetime.now()
print (qe-qs).total_seconds()*1000
# to compare the improvement using multiprocessing, iterative solution
qs = datetime.datetime.now()
for item in arr:
x = calc(item)
fout.write(str(x)+"\n")
qe = datetime.datetime.now()
print (qe-qs).total_seconds()*1000
This is my first time trying to use multiprocessing in Python. I'm trying to parallelize my function fun over my dataframe df by row. The callback function is just to append results to an empty list that I'll sort through later.
Is this the correct way to use apply_async? Thanks so much.
import multiprocessing as mp
function_results = []
async_results = []
p = mp.Pool() # by default should use number of processors
for row in df.iterrows():
r = p.apply_async(fun, (row,), callback=function_results.extend)
async_results.append(r)
for r in async_results:
r.wait()
p.close()
p.join()
It looks like using map or imap_unordered (dependending on whether you need your results to be ordered or not) would better suit your needs
import multiprocessing as mp
#prepare stuff
if __name__=="__main__":
p = mp.Pool()
function_results = list(p.imap_unorderd(fun,df.iterrows())) #unordered
#function_results = p.map(fun,df.iterrows()) #ordered
p.close()
My problem is to execute something like :
multicore_apply(serie, func)
to run
So I tried to create a function doing it :
function used to run the apply method in a process :
def adaptator(func, queue) :
serie = queue.get().apply(func)
queue.put(serie)
the process management :
def parallel_apply(ncores, func, serie) :
series = [serie[i::ncores] for i in range(ncores)]
queues =[Queue() for i in range(ncores)]
for _serie, queue in zip(series, queues) :
queue.put(_serie)
result = []
jobs = []
for i in range(ncores) :
jobs.append(process(target = adaptator, args = (func, queues[i])))
for job in jobs :
job.start()
for queue, job in zip(queues, jobs) :
job.join()
result.append(queue.get())
return pd.concat(result, axis = 0).sort_index()
I know the i::ncores is not optimized but actually it's not the problem :
if the input len is greater than 30000 the processes never stop...
Is that a misunderstanding of Queue()?
I don't want to use multiprocessing.map : the func to apply is a method from a class very complex and with a pretty big size, so shared memory make it just too slow. Here I want to pass it in a queue when the problem of process will be solved.
Thank you for your advices
May be that will helps - you can use multiprocessing lib.
Your multicore_apply(serie, func) should look like:
from multiprocessing import Pool
pool = Pool()
pool.map(func, series)
pool.terminate()
You can specify count of process to be created like this pool = Pool(6), by default it equals to count of cores on the machine.
After many nights of intense search, I solved the problem with a post on the python development website about the max size of an object in a queue : the problem was here. I used another post on stackoverflow found here :
then I done the following program, but not as efficient as expected for large objects. I will do the same available for every axis.
Note this version allows to use complex class as function argument, that I cannot do with pool.map
def adaptator(series, results, ns, i) :
serie = series[i]
func = ns.func
result = serie.apply(func)
results[i] = result
def parallel_apply(ncores, func, serie) :
series = pd.np.array_split(serie, ncores, axis = 0)
M = Manager()
s_series = M.list()
s_series.extend(series)
results = M.list()
results.extend([None]*ncores)
ns = M.Namespace()
ns.func = func
jobs = []
for i in range(ncores) :
jobs.append(process(target = adaptator, args = (s_series, results, ns, i)))
for job in jobs :
job.start()
for job in jobs :
job.join()
print(results)
So if you put large objects between queues, Ipython freezes