I am trying to apply function on a large pd.dataframe on pyspark. My code was post below which uses multiprocessing.Pool but is not as fast as expected. It cost the same time as df.apply(f,axis=1).
There shall be some mistakes I didn't notice. I spend my day but find nothing out. That's why I finally come here for help.
def f(series):
# do something
return series
if __name__=='__main__':
#load(df)
output=pd.DataFrame()
pool = multiprocessing.Pool(8)
for name in df.colunms:
res=pool.apply_async(f,(df[name],),callback=logging.info("f with "+name))
output[name]=res.get()
pool.close()
pool.join()
After #Andriy Ivaneyko answers, I also tried this:
if __name__=='__main__':
#load(df)
output=pd.DataFrame()
res={}
pool = multiprocessing.Pool(8)
for name in df.colunms:
res[name]=pool.apply_async(f,(df[name],),callback=logging.info("f with "+name))
for name,val in res.items():
output[name]=val.get()
pool.close()
pool.join()
I change the number of cores from 4 to 8 to 16, however the function consumes almost the same time.
The get() method blocks until the function is completed, that's the reason for not getting performance benefit. Create list of the ApplyResult objects ( returned by apply_async) and perform get when you finish iteration over df.colunms
# ... Code before
apply_results = []
for name in df.colunms:
res=pool.apply_async(f,(df[name],),callback=logging.info("f with "+name))
apply_results[name] = res
for name, res in apply_results.items():
output[name]=res.get()
# ... Code after
Related
I have a script that loops over a pandas dataframe and outputs GIS data to a geopackage based on some searches and geometry manipulation. It works when I use a for loop but with over 4k records it takes a while. Since I have it built as it's own function that returns what I need based on a row iteration I tried to run it with multiprocessing with:
import pandas as pd, bwe_mapping
from multiprocessing import Pool
#Sample dataframe
bwes = [['id', 7216],['item_id', 3277841], ['Date', '2019-01-04T00:00:00.000Z'], ['start_lat', -56.92], ['start_lon', 45.87], ['End_lat', -59.87], ['End_lon', 44.67]]
bwedf = pd.read_csv(bwes)
geopackage = "datalocation\geopackage.gpkg"
tracklayer = "tracks"
if __name__=='__main__':
def task(item):
bwe_mapping.map_bwe(item, geopackage, tracklayer)
pool = Pool()
for index, row in bwedf.iterrows():
task(row)
with Pool() as pool:
for results in pool.imap_unordered(task, bwedf.iterrows()):
print(results)
When I run this my Task manager populates with 16 new python tasks but no sign that anything is being done. Would it be better to use numpy.array.split() to break up my pandas df into 4 or 8 smaller ones and run the for index, row in bwedf.iterrows(): for each dataframe on it's own processor?
No one process needs to be done in any order; as long as I can store the outputs, which are geopanda dataframes, into a list to concatenate into geopackage layers at the end.
Should I have put the for loop in the function and just passed it the whole dataframe and gis data to search?
if you are running on windows/macOS then it's going to use spawn to create the workers, which means that any child MUST find the function it is going to execute when it imports your main script.
your code has the function definition inside your if __name__=='__main__': so the children don't have access to it.
simply moving the function def to before if __name__=='__main__': will make it work.
what is happening is that each child is crashing when it tries to run a function because it never saw its definition.
minimal code to reproduce the problem:
from multiprocessing import Pool
if __name__ == '__main__':
def task(item):
print(item)
return item
pool = Pool()
with Pool() as pool:
for results in pool.imap_unordered(task, range(10)):
print(results)
and the solution is to move the function definition to before the if __name__=='__main__': line.
Edit: now to iterate on rows in a dataframe, this simple example demonstrates how to do it, note that iterrows returns an index and a row, which is why it is unpacked.
import os
import pandas as pd
from multiprocessing import Pool
import time
# Sample dataframe
bwes = [['id', 7216], ['item_id', 3277841], ['Date', '2019-01-04T00:00:00.000Z'], ['start_lat', -56.92],
['start_lon', 45.87], ['End_lat', -59.87], ['End_lon', 44.67]]
bwef = pd.DataFrame(bwes)
def task(item):
time.sleep(1)
index, row = item
# print(os.getpid(), tuple(row))
return str(os.getpid()) + " " + str(tuple(row))
if __name__ == '__main__':
with Pool() as pool:
for results in pool.imap_unordered(task, bwef.iterrows()):
print(results)
the time.sleep(1) is only there because there is only a small amount of work and one worker might grab it all, so i am forcing every worker to wait for the others, you should remove it, the result is as follows:
13228 ('id', 7216)
11376 ('item_id', 3277841)
15580 ('Date', '2019-01-04T00:00:00.000Z')
10712 ('start_lat', -56.92)
11376 ('End_lat', -59.87)
13228 ('start_lon', 45.87)
10712 ('End_lon', 44.67)
it seems like your "example" dataframe is transposed, but you just have to construct the dataframe correctly, i'd recommend you first run the code serially with iterrows, before running it across multiple cores.
obviously sending data to the workers and back from them takes time, so make sure each worker is doing a lot of computational work and not just sending it back to the parent process.
Result Array is displayed as empty after trying to append values into it.
I have even declared result as global inside function.
Any suggestions?
Error Image
try this
res= []
inputData = [a,b,c,d]
def function(data):
values = [some_Number_1, some_Number_2]
return values
def parallel_run(function, inputData):
cpu_no = 4
if len(inputData) < cpu_no:
cpu_no = len(inputData)
p = multiprocessing.Pool(cpu_no)
global resultsAr
resultsAr = p.map(function, inputData, chunksize=1)
p.close()
p.join()
print ('res = ', res)
This happens since you're misunderstanding the basic point of multiprocessing: the child process spawned by multiprocessing.Process is separate from the parent process, and thus any modifications to data (including global variables) in the child process(es) are not propagated into the parent.
You will need to use multiprocessing-specific data types (queues and pipes), or the higher-level APIs provided by e.g. multiprocessing.Pool, to get data out of the child process(es).
For your application, the high-level recipe would be
def square(v):
return v * v
def main():
arr = [1, 2, 3, 4, 5]
with multiprocessing.Pool() as p:
squared = p.map(square, arr)
print(squared)
– however you'll likely find that this is massively slower than not using multiprocessing due to the overheads involved in such a small task.
Welcome to StackOverflow, Suyash !
The problem is that multiprocessing.Process is, as its name says, a separate process. You can imagine it almost as if you're running your script again from the terminal, with very little connection to the mother script.
Therefore, it has its own copy of the result array, which it modifies and prints.
The result in the "main" process is unmodified.
To convince yourself of this, try to print id(res) in both __main__ and in square(). You'll see they are different.
Is there a way to reduce memory consumption when working with Python's pool.map?
To give a short example: worker() does some heavy lifting and returns a larger array...
def worker():
# cpu time intensive tasks
return large_array
...and a Pool maps over some large sequence:
with mp.Pool(mp.cpu_count()) as p:
result = p.map(worker, large_sequence)
Considering this setup, obviously, result will allocate a large portion of the system's memory. However, the final operation on the result is:
final_result = np.sum(result, axis=0)
Thus, NumPy effectively does nothing else than reducing with a sum operation on the iterable:
final_result = reduce(lambda x, y: x + y, result)
This, of course, would make it possible to consume results of pool.map as they come in and garbage-collecting them after reducing to eliminate the need of storing all the values first.
I could write some mp.queue now where results go into and then write some queue-consuming worker that sums up the results but this would (1) require significantly more lines of code and (2) feel like a (potentially slower) hack-around to me rather than clean code.
Is there a way to reduce results returned by a mp.Pool operation directly as they come in?
The iterator mappers imap and imap_unordered seem to do the trick:
#!/usr/bin/env python3
import multiprocessing
import numpy as np
def worker( a ):
# cpu time intensive tasks
large_array = np.ones((20,30))+a
return large_array
if __name__ == '__main__':
arraysum = np.zeros((20,30))
large_sequence = range(20)
num_cpus = multiprocessing.cpu_count()
with multiprocessing.Pool( processes=num_cpus ) as p:
for large_array in p.imap_unordered( worker, large_sequence ):
arraysum += large_array
My problem is to execute something like :
multicore_apply(serie, func)
to run
So I tried to create a function doing it :
function used to run the apply method in a process :
def adaptator(func, queue) :
serie = queue.get().apply(func)
queue.put(serie)
the process management :
def parallel_apply(ncores, func, serie) :
series = [serie[i::ncores] for i in range(ncores)]
queues =[Queue() for i in range(ncores)]
for _serie, queue in zip(series, queues) :
queue.put(_serie)
result = []
jobs = []
for i in range(ncores) :
jobs.append(process(target = adaptator, args = (func, queues[i])))
for job in jobs :
job.start()
for queue, job in zip(queues, jobs) :
job.join()
result.append(queue.get())
return pd.concat(result, axis = 0).sort_index()
I know the i::ncores is not optimized but actually it's not the problem :
if the input len is greater than 30000 the processes never stop...
Is that a misunderstanding of Queue()?
I don't want to use multiprocessing.map : the func to apply is a method from a class very complex and with a pretty big size, so shared memory make it just too slow. Here I want to pass it in a queue when the problem of process will be solved.
Thank you for your advices
May be that will helps - you can use multiprocessing lib.
Your multicore_apply(serie, func) should look like:
from multiprocessing import Pool
pool = Pool()
pool.map(func, series)
pool.terminate()
You can specify count of process to be created like this pool = Pool(6), by default it equals to count of cores on the machine.
After many nights of intense search, I solved the problem with a post on the python development website about the max size of an object in a queue : the problem was here. I used another post on stackoverflow found here :
then I done the following program, but not as efficient as expected for large objects. I will do the same available for every axis.
Note this version allows to use complex class as function argument, that I cannot do with pool.map
def adaptator(series, results, ns, i) :
serie = series[i]
func = ns.func
result = serie.apply(func)
results[i] = result
def parallel_apply(ncores, func, serie) :
series = pd.np.array_split(serie, ncores, axis = 0)
M = Manager()
s_series = M.list()
s_series.extend(series)
results = M.list()
results.extend([None]*ncores)
ns = M.Namespace()
ns.func = func
jobs = []
for i in range(ncores) :
jobs.append(process(target = adaptator, args = (s_series, results, ns, i)))
for job in jobs :
job.start()
for job in jobs :
job.join()
print(results)
So if you put large objects between queues, Ipython freezes
I have a dataset df of trader transactions.
I have 2 levels of for loops as follows:
smartTrader =[]
for asset in range(len(Assets)):
df = df[df['Assets'] == asset]
# I have some more calculations here
for trader in range(len(df['TraderID'])):
# I have some calculations here, If trader is successful, I add his ID
# to the list as follows
smartTrader.append(df['TraderID'][trader])
# some more calculations here which are related to the first for loop.
I would like to parallelise the calculations for each asset in Assets, and I also want to parallelise the calculations for each trader for every asset. After ALL these calculations are done, I want to do additional analysis based on the list of smartTrader.
This is my first attempt at parallel processing, so please be patient with me, and I appreciate your help.
If you use pathos, which provides a fork of multiprocessing, you can easily nest parallel maps. pathos is built for easily testing combinations of nested parallel maps -- which are direct translations of nested for loops.
It provides a selection of maps that are blocking, non-blocking, iterative, asynchronous, serial, parallel, and distributed.
>>> from pathos.pools import ProcessPool, ThreadPool
>>> amap = ProcessPool().amap
>>> tmap = ThreadPool().map
>>> from math import sin, cos
>>> print amap(tmap, [sin,cos], [range(10),range(10)]).get()
[[0.0, 0.8414709848078965, 0.9092974268256817, 0.1411200080598672, -0.7568024953079282, -0.9589242746631385, -0.27941549819892586, 0.6569865987187891, 0.9893582466233818, 0.4121184852417566], [1.0, 0.5403023058681398, -0.4161468365471424, -0.9899924966004454, -0.6536436208636119, 0.2836621854632263, 0.9601702866503661, 0.7539022543433046, -0.14550003380861354, -0.9111302618846769]]
Here this example uses a processing pool and a thread pool, where the thread map call is blocking, while the processing map call is asynchronous (note the get at the end of the last line).
Get pathos here: https://github.com/uqfoundation
or with:
$ pip install git+https://github.com/uqfoundation/pathos.git#master
Nested parallelism can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.
Assume you want to parallelize the following nested program
def inner_calculation(asset, trader):
return trader
def outer_calculation(asset):
return asset, [inner_calculation(asset, trader) for trader in range(5)]
inner_results = []
outer_results = []
for asset in range(10):
outer_result, inner_result = outer_calculation(asset)
outer_results.append(outer_result)
inner_results.append(inner_result)
# Then you can filter inner_results to get the final output.
Bellow is the Ray code parallelizing the above code:
Use the #ray.remote decorator for each function that we want to execute concurrently in its own process. A remote function returns a future (i.e., an identifier to the result) rather than the result itself.
When invoking a remote function f() the remote modifier, i.e., f.remote()
Use the ids_to_vals() helper function to convert a nested list of ids to values.
Note the program structure is identical. You only need to add remote and then convert the futures (ids) returned by the remote functions to values using the ids_to_vals() helper function.
import ray
ray.init()
# Define inner calculation as a remote function.
#ray.remote
def inner_calculation(asset, trader):
return trader
# Define outer calculation to be executed as a remote function.
#ray.remote(num_return_vals = 2)
def outer_calculation(asset):
return asset, [inner_calculation.remote(asset, trader) for trader in range(5)]
# Helper to convert a nested list of object ids to a nested list of corresponding objects.
def ids_to_vals(ids):
if isinstance(ids, ray.ObjectID):
ids = ray.get(ids)
if isinstance(ids, ray.ObjectID):
return ids_to_vals(ids)
if isinstance(ids, list):
results = []
for id in ids:
results.append(ids_to_vals(id))
return results
return ids
outer_result_ids = []
inner_result_ids = []
for asset in range(10):
outer_result_id, inner_result_id = outer_calculation.remote(asset)
outer_result_ids.append(outer_result_id)
inner_result_ids.append(inner_result_id)
outer_results = ids_to_vals(outer_result_ids)
inner_results = ids_to_vals(inner_result_ids)
There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.
Probably threading, from standard python library, is most convenient approach:
import threading
def worker(id):
#Do you calculations here
return
threads = []
for asset in range(len(Assets)):
df = df[df['Assets'] == asset]
for trader in range(len(df['TraderID'])):
t = threading.Thread(target=worker, args=(trader,))
threads.append(t)
t.start()
#add semaphore here if you need synchronize results for all traders.
Instead of using for, use map:
import functools
smartTrader =[]
m=map( calculations_as_a_function,
[df[df['Assets'] == asset] \
for asset in range(len(Assets))])
functools.reduce(smartTradder.append, m)
From then on, you can try different parallel map implementations s.a. multiprocessing's, or stackless'