I have a script that loops over a pandas dataframe and outputs GIS data to a geopackage based on some searches and geometry manipulation. It works when I use a for loop but with over 4k records it takes a while. Since I have it built as it's own function that returns what I need based on a row iteration I tried to run it with multiprocessing with:
import pandas as pd, bwe_mapping
from multiprocessing import Pool
#Sample dataframe
bwes = [['id', 7216],['item_id', 3277841], ['Date', '2019-01-04T00:00:00.000Z'], ['start_lat', -56.92], ['start_lon', 45.87], ['End_lat', -59.87], ['End_lon', 44.67]]
bwedf = pd.read_csv(bwes)
geopackage = "datalocation\geopackage.gpkg"
tracklayer = "tracks"
if __name__=='__main__':
def task(item):
bwe_mapping.map_bwe(item, geopackage, tracklayer)
pool = Pool()
for index, row in bwedf.iterrows():
task(row)
with Pool() as pool:
for results in pool.imap_unordered(task, bwedf.iterrows()):
print(results)
When I run this my Task manager populates with 16 new python tasks but no sign that anything is being done. Would it be better to use numpy.array.split() to break up my pandas df into 4 or 8 smaller ones and run the for index, row in bwedf.iterrows(): for each dataframe on it's own processor?
No one process needs to be done in any order; as long as I can store the outputs, which are geopanda dataframes, into a list to concatenate into geopackage layers at the end.
Should I have put the for loop in the function and just passed it the whole dataframe and gis data to search?
if you are running on windows/macOS then it's going to use spawn to create the workers, which means that any child MUST find the function it is going to execute when it imports your main script.
your code has the function definition inside your if __name__=='__main__': so the children don't have access to it.
simply moving the function def to before if __name__=='__main__': will make it work.
what is happening is that each child is crashing when it tries to run a function because it never saw its definition.
minimal code to reproduce the problem:
from multiprocessing import Pool
if __name__ == '__main__':
def task(item):
print(item)
return item
pool = Pool()
with Pool() as pool:
for results in pool.imap_unordered(task, range(10)):
print(results)
and the solution is to move the function definition to before the if __name__=='__main__': line.
Edit: now to iterate on rows in a dataframe, this simple example demonstrates how to do it, note that iterrows returns an index and a row, which is why it is unpacked.
import os
import pandas as pd
from multiprocessing import Pool
import time
# Sample dataframe
bwes = [['id', 7216], ['item_id', 3277841], ['Date', '2019-01-04T00:00:00.000Z'], ['start_lat', -56.92],
['start_lon', 45.87], ['End_lat', -59.87], ['End_lon', 44.67]]
bwef = pd.DataFrame(bwes)
def task(item):
time.sleep(1)
index, row = item
# print(os.getpid(), tuple(row))
return str(os.getpid()) + " " + str(tuple(row))
if __name__ == '__main__':
with Pool() as pool:
for results in pool.imap_unordered(task, bwef.iterrows()):
print(results)
the time.sleep(1) is only there because there is only a small amount of work and one worker might grab it all, so i am forcing every worker to wait for the others, you should remove it, the result is as follows:
13228 ('id', 7216)
11376 ('item_id', 3277841)
15580 ('Date', '2019-01-04T00:00:00.000Z')
10712 ('start_lat', -56.92)
11376 ('End_lat', -59.87)
13228 ('start_lon', 45.87)
10712 ('End_lon', 44.67)
it seems like your "example" dataframe is transposed, but you just have to construct the dataframe correctly, i'd recommend you first run the code serially with iterrows, before running it across multiple cores.
obviously sending data to the workers and back from them takes time, so make sure each worker is doing a lot of computational work and not just sending it back to the parent process.
Related
I have two functions that I want to run concurrently to check performance, now a days I'm running one after another and it's taking quite some time.
Here it's how I'm running
import pandas as pd
import threading
df = pd.read_csv('data/Detalhado_full.csv', sep=',', dtype={'maquina':str})
def gerar_graph_36():
df_ordered = df.query(f'maquina=="3.6"')[['data', 'dia_semana', 'oee', 'ptg_ruins', 'prod_real_kg', 'prod_teorica_kg']].sort_values(by='data')
oee = df_ordered['oee'].iloc[-1:].iloc[0]
return oee
def gerar_graph_31():
df_ordered = df.query(f'maquina=="3.1"')[['data', 'dia_semana', 'oee', 'ptg_ruins', 'prod_real_kg', 'prod_teorica_kg']].sort_values(by='data')
oee = df_ordered['oee'].iloc[-1:].iloc[0]
return oee
oee_36 = gerar_graph_36()
oee_31 = gerar_graph_31()
print(oee_36, oee_31)
I tried to apply threading using this statement but it's not returning the variable, instead it's printing None value
print(oee_31, oee_36) -> Expecting: 106.3 99.7 // Returning None None
oee_31 = threading.Thread(target=gerar_graph_31, args=()).start()
oee_36 = threading.Thread(target=gerar_graph_36, args=()).start()
print(oee_31, oee_36)
For checking purpose, If I use the command below, returns 3 as expected
print(threading.active_count())
I need the return oee value from the function, something like 103.8.
Thanks in advance!!
Ordinarily creatign a new thread and starting it is not like calling a function which returns a variable: the Thread.start() call just "starts the code of the other thread", and returns imediatelly.
To colect results in the other threads you have to comunicate the computed results to the main thread using some data structure. An ordinary list or dictionary could do, or one could use a queue.Queue.
If you want to have something more like a function call and be able to not modify the gerar_graph() functions, you could use the concurrent.futures module instead of threading: that is higher level code that will wrap your calls in a "future" object, and you will be able to check when each future is done and fetch the value returned by the function.
Otherwise, simply have a top-level variable containign a list, wait for your threads to finish up running (they stop when the function called by "target" returns), and collect the results:
import pandas as pd
import threading
df = pd.read_csv('data/Detalhado_full.csv', sep=',', dtype={'maquina':str})
results = []
def gerar_graph_36():
df_ordered = df.query(f'maquina=="3.6"')[['data', 'dia_semana', 'oee', 'ptg_ruins', 'prod_real_kg', 'prod_teorica_kg']].sort_values(by='data')
oee = df_ordered['oee'].iloc[-1:].iloc[0]
results.append(oee)
def gerar_graph_31():
df_ordered = df.query(f'maquina=="3.1"')[['data', 'dia_semana', 'oee', 'ptg_ruins', 'prod_real_kg', 'prod_teorica_kg']].sort_values(by='data')
oee = df_ordered['oee'].iloc[-1:].iloc[0]
results.append(oee)
# We need to keep a reference to the threads themselves
# so that we can call both ".start()" (which always returns None)
# and ".join()" on them.
oee_31 = threading.Thread(target=gerar_graph_31); oee_31.start()
oee_36 = threading.Thread(target=gerar_graph_36); oee_36.start()
oee_31.join() # will block and return only when the task is done, but oee_36 will be running concurrently
oee_36.join()
print(results)
If you need more than 2 threads, (like all 36...), I strongly suggest using concurrent.futures: you can limit the number of workers to a number comparable to the logical CPUs you have. And, of course, manage your tasks and calls in a list or dictionary, instead of having a separate variable name for each.
I am trying to apply function on a large pd.dataframe on pyspark. My code was post below which uses multiprocessing.Pool but is not as fast as expected. It cost the same time as df.apply(f,axis=1).
There shall be some mistakes I didn't notice. I spend my day but find nothing out. That's why I finally come here for help.
def f(series):
# do something
return series
if __name__=='__main__':
#load(df)
output=pd.DataFrame()
pool = multiprocessing.Pool(8)
for name in df.colunms:
res=pool.apply_async(f,(df[name],),callback=logging.info("f with "+name))
output[name]=res.get()
pool.close()
pool.join()
After #Andriy Ivaneyko answers, I also tried this:
if __name__=='__main__':
#load(df)
output=pd.DataFrame()
res={}
pool = multiprocessing.Pool(8)
for name in df.colunms:
res[name]=pool.apply_async(f,(df[name],),callback=logging.info("f with "+name))
for name,val in res.items():
output[name]=val.get()
pool.close()
pool.join()
I change the number of cores from 4 to 8 to 16, however the function consumes almost the same time.
The get() method blocks until the function is completed, that's the reason for not getting performance benefit. Create list of the ApplyResult objects ( returned by apply_async) and perform get when you finish iteration over df.colunms
# ... Code before
apply_results = []
for name in df.colunms:
res=pool.apply_async(f,(df[name],),callback=logging.info("f with "+name))
apply_results[name] = res
for name, res in apply_results.items():
output[name]=res.get()
# ... Code after
I have a python function that requires the user enter an array of data; at which point the function works on the data and produces two arrays which are returned to the main program. In this question I am including a greatly simplified example from which I hope I can solicit some advice or help. I have created a function titled "Test_Function" that requires the programmer supply an array of data titled "Array", which in this case has a length of 5000. The function works on the data and produces two sets of arrays titled "Result1" and "Result2" which are returned to the user in the main program as the variables "Res1" and "Res2". I would like to thread the function so that the function "Test_Function" so that one thread will work on half of the input array and the other thread will work on the other half and then combine them back together in the main program for both output arrays "Result1" and "Result2"/"Res1" and "Res2". I described a scenario where I would produce two threads, but I would like to make it generic enough so that it could run a user defined number of threads. How do I do this with the thread functionality?
import numpy as np
def Test_Function(Array):
Result1 = Array*np.pi*(1-Array)
Result2 = Array+478.5 + (1/Array)
return(np.array(Result1,dtype=float), np.array(Result2,dtype=float))
#---------------------------------------------------------------------------
if __name__ == "__main__":
Dependent_Array = np.linspace(1.0,5000.0,num=5000)
Res1, Res2 = Test_Function(Dependent_Array)
# eof
You could use a ThreadPool to which assign async tasks:
import numpy as np
from multiprocessing.pool import ThreadPool
def sub1(Array):
return Array * np.pi * (1-Array)
def sub2(Array):
return Array + 478.5 + (1/Array)
def Test_Function(Array):
pool = ThreadPool(processes=2)
res1 = pool.apply_async(sub1, (Array,))
res2 = pool.apply_async(sub2, (Array,))
return (np.array(res1.get(),dtype=float), np.array(res2.get(),dtype=float))
if __name__ == "__main__":
Dependent_Array = np.linspace(1.0,5000.0,num=5000)
Res1, Res2 = Test_Function(Dependent_Array)
This module is not very well documented, but, basically, it creates a pool of workers to which you can assign subroutines to execute. The method get() of the result object created by apply_async will return the result only when the corresponding thread has finished its operations.
I have a dataset df of trader transactions.
I have 2 levels of for loops as follows:
smartTrader =[]
for asset in range(len(Assets)):
df = df[df['Assets'] == asset]
# I have some more calculations here
for trader in range(len(df['TraderID'])):
# I have some calculations here, If trader is successful, I add his ID
# to the list as follows
smartTrader.append(df['TraderID'][trader])
# some more calculations here which are related to the first for loop.
I would like to parallelise the calculations for each asset in Assets, and I also want to parallelise the calculations for each trader for every asset. After ALL these calculations are done, I want to do additional analysis based on the list of smartTrader.
This is my first attempt at parallel processing, so please be patient with me, and I appreciate your help.
If you use pathos, which provides a fork of multiprocessing, you can easily nest parallel maps. pathos is built for easily testing combinations of nested parallel maps -- which are direct translations of nested for loops.
It provides a selection of maps that are blocking, non-blocking, iterative, asynchronous, serial, parallel, and distributed.
>>> from pathos.pools import ProcessPool, ThreadPool
>>> amap = ProcessPool().amap
>>> tmap = ThreadPool().map
>>> from math import sin, cos
>>> print amap(tmap, [sin,cos], [range(10),range(10)]).get()
[[0.0, 0.8414709848078965, 0.9092974268256817, 0.1411200080598672, -0.7568024953079282, -0.9589242746631385, -0.27941549819892586, 0.6569865987187891, 0.9893582466233818, 0.4121184852417566], [1.0, 0.5403023058681398, -0.4161468365471424, -0.9899924966004454, -0.6536436208636119, 0.2836621854632263, 0.9601702866503661, 0.7539022543433046, -0.14550003380861354, -0.9111302618846769]]
Here this example uses a processing pool and a thread pool, where the thread map call is blocking, while the processing map call is asynchronous (note the get at the end of the last line).
Get pathos here: https://github.com/uqfoundation
or with:
$ pip install git+https://github.com/uqfoundation/pathos.git#master
Nested parallelism can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.
Assume you want to parallelize the following nested program
def inner_calculation(asset, trader):
return trader
def outer_calculation(asset):
return asset, [inner_calculation(asset, trader) for trader in range(5)]
inner_results = []
outer_results = []
for asset in range(10):
outer_result, inner_result = outer_calculation(asset)
outer_results.append(outer_result)
inner_results.append(inner_result)
# Then you can filter inner_results to get the final output.
Bellow is the Ray code parallelizing the above code:
Use the #ray.remote decorator for each function that we want to execute concurrently in its own process. A remote function returns a future (i.e., an identifier to the result) rather than the result itself.
When invoking a remote function f() the remote modifier, i.e., f.remote()
Use the ids_to_vals() helper function to convert a nested list of ids to values.
Note the program structure is identical. You only need to add remote and then convert the futures (ids) returned by the remote functions to values using the ids_to_vals() helper function.
import ray
ray.init()
# Define inner calculation as a remote function.
#ray.remote
def inner_calculation(asset, trader):
return trader
# Define outer calculation to be executed as a remote function.
#ray.remote(num_return_vals = 2)
def outer_calculation(asset):
return asset, [inner_calculation.remote(asset, trader) for trader in range(5)]
# Helper to convert a nested list of object ids to a nested list of corresponding objects.
def ids_to_vals(ids):
if isinstance(ids, ray.ObjectID):
ids = ray.get(ids)
if isinstance(ids, ray.ObjectID):
return ids_to_vals(ids)
if isinstance(ids, list):
results = []
for id in ids:
results.append(ids_to_vals(id))
return results
return ids
outer_result_ids = []
inner_result_ids = []
for asset in range(10):
outer_result_id, inner_result_id = outer_calculation.remote(asset)
outer_result_ids.append(outer_result_id)
inner_result_ids.append(inner_result_id)
outer_results = ids_to_vals(outer_result_ids)
inner_results = ids_to_vals(inner_result_ids)
There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.
Probably threading, from standard python library, is most convenient approach:
import threading
def worker(id):
#Do you calculations here
return
threads = []
for asset in range(len(Assets)):
df = df[df['Assets'] == asset]
for trader in range(len(df['TraderID'])):
t = threading.Thread(target=worker, args=(trader,))
threads.append(t)
t.start()
#add semaphore here if you need synchronize results for all traders.
Instead of using for, use map:
import functools
smartTrader =[]
m=map( calculations_as_a_function,
[df[df['Assets'] == asset] \
for asset in range(len(Assets))])
functools.reduce(smartTradder.append, m)
From then on, you can try different parallel map implementations s.a. multiprocessing's, or stackless'
I have a python process (2.7) that takes a key, does a bunch of calculations and returns a list of results. Here is a very simplified version.
I am using multiprocessing to create threads so this can be processed faster. However, my production data has several million rows and each loop takes progressively longer to complete. The last time I ran this each loop took over 6 minutes to complete while at the start it takes a second or less. I think this is because all the threads are adding results into resultset and that continues to grow until it contains all the records.
Is it possible to use multiprocessing to stream the results of each thread (a list) into a csv or batch resultset so it writes to the csv after a set number of rows?
Any other suggestions for speeding up or optimizing the approach would be appreciated.
import numpy as np
import pandas as pd
import csv
import os
import multiprocessing
from multiprocessing import Pool
global keys
keys = [1,2,3,4,5,6,7,8,9,10,11,12]
def key_loop(key):
test_df = pd.DataFrame(np.random.randn(1,4), columns=['a','b','c','d'])
test_list = test_df.ix[0].tolist()
return test_list
if __name__ == "__main__":
try:
pool = Pool(processes=8)
resultset = pool.imap(key_loop,(key for key in keys) )
loaddata = []
for sublist in resultset:
loaddata.append(sublist)
with open("C:\\Users\\mp_streaming_test.csv", 'w') as file:
writer = csv.writer(file)
for listitem in loaddata:
writer.writerow(listitem)
file.close
print "finished load"
except:
print 'There was a problem multithreading the key Pool'
raise
Here is an answer consolidating the suggestions Eevee and I made
import numpy as np
import pandas as pd
import csv
from multiprocessing import Pool
keys = [1,2,3,4,5,6,7,8,9,10,11,12]
def key_loop(key):
test_df = pd.DataFrame(np.random.randn(1,4), columns=['a','b','c','d'])
test_list = test_df.ix[0].tolist()
return test_list
if __name__ == "__main__":
try:
pool = Pool(processes=8)
resultset = pool.imap(key_loop, keys, chunksize=200)
with open("C:\\Users\\mp_streaming_test.csv", 'w') as file:
writer = csv.writer(file)
for listitem in resultset:
writer.writerow(listitem)
print "finished load"
except:
print 'There was a problem multithreading the key Pool'
raise
Again, the changes here are
Iterate over resultset directly, rather than needlessly copying it to a list first.
Feed the keys list directly to pool.imap instead of creating a generator comprehension out of it.
Providing a larger chunksize to imap than the default of 1. The larger chunksize reduces the cost of the inter-process communication required to pass the values inside keys to the sub-processes in your pool, which can give big performance boosts when keys is very large (as it is in your case). You should experiment with different values for chunksize (try something considerably larger than 200, like 5000, etc.) and see how it affects performance. I'm making a wild guess with 200, though it should definitely do better than 1.
The following very simple code collects many worker's data into a single CSV file. A worker takes a key and returns a list of rows. The parent processes several keys at a time, using several workers. When each key is done, the parent writes output rows, in order, to a CSV file.
Be careful about order. If each worker writes to the CSV file directly, they'll be out of order or will stomp on each others. Having each worker write to its own CSV file will be fast, but will require merging all the data files together afterward.
source
import csv, multiprocessing, sys
def worker(key):
return [ [key, 0], [key+1, 1] ]
pool = multiprocessing.Pool() # default 1 proc per CPU
writer = csv.writer(sys.stdout)
for resultset in pool.imap(worker, [1,2,3,4]):
for row in resultset:
writer.writerow(row)
output
1,0
2,1
2,0
3,1
3,0
4,1
4,0
5,1
My bet would be that dealing with the large structure at once using appending is what makes it slow. What I usually do is that I open up as many files as cores and use modulo to write to each file immediately such that the streams don't cause trouble compared to if you'd direct them all into the same file (write errors), and also not trying to store huge data. Probably not the best solution, but really quite easy. In the end you just merge back the results.
Define at start of the run:
num_cores = 8
file_sep = ","
outFiles = [open('out' + str(x) + ".csv", "a") for x in range(num_cores)]
Then in the key_loop function:
def key_loop(key):
test_df = pd.DataFrame(np.random.randn(1,4), columns=['a','b','c','d'])
test_list = test_df.ix[0].tolist()
outFiles[key % num_cores].write(file_sep.join([str(x) for x in test_list])
+ "\n")
Afterwards, don't forget to close: [x.close() for x in outFiles]
Improvements:
Iterate over blocks like mentioned in the comments. Writing/processing 1 line at a time is going to be much slower than writing blocks.
Handling errors (closing of files)
IMPORTANT: I'm not sure of the meaning of the "keys" variable, but the numbers there will not allow modulo to ensure you have each process write to each individual stream (12 keys, modulo 8 will make 2 processes write to the same file)