I built a little function that will gather some data using a 3rd party API. Call if def MyFunc(Symbol, Field) that will return some info based on the symbol given.
The idea was to fill a Pandas df with the returned value using something like:
df['MyNewField'] = df.apply(lamba x: MyFunc(x, 'FieldName'))
All this works BUT, each query takes around 100ms to run. This seems fast until you realize you may have 30,000 or more to do (3,000 Symbols with 10 fields each for starters).
I was wondering if there would be a way to run this concurrently as each request is independent? I am not looking for multi processor etc. libraries but instead a way to do multiple queries to the 3rd party at the same time to reduce the time taken to gather all the data. (Also, I suppose this will change the initial structure used to store all the received data - I do not mind not using Apply and my dataframe at first and instead save the data as it is received on a text or library type structure -).
NOTE: While I wish I could change MyFunc to request multiple symbols/fields at once this cannot be done for all cases (meaning some fields do not allow that and a single request is the only way to go). This is why I am looking at concurrent execution and not at changing MyFunc.
Thanks!
There are many libraries to parallelize pandas dataframe. However, I prefer native multi-processing pool to do the same. Also, I use tqdm along with it to know the progress.
import numpy as np
from multiprocessing import cpu_count, Pool
cores = 4 #Number of CPU cores on your system
partitions = cores #Define as many partitions as you want
def partition(data, num_partitions):
partition_len = int(len(data)/num_partitions)
partitions = []
num_rows = 0
for i in range(num_partitions-1):
partition = data.iloc[i*partition_len:i*partition_len+partition_len]
num_rows = num_rows + partition_len
partitions.append(partition)
partitions.append(data.iloc[num_rows:len(data)])
return partitions
def parallelize(data, func):
data_split = partition(data, partitions)
pool = Pool(cores)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
df['MyNewField'] = parallelize(df['FieldName'], MyFunc)
Related
I need to rewrite a simple for loop with MPI cause each step is time consuming. Lets say I have a list including several np.array and I want to apply some computation on each array. For example:
def myFun(x):
return x+2 # simple example, the real one would be complicated
dat = [np.random.rand(3,2), np.random.rand(3,2),np.random.rand(3,2),np.random.rand(3,2)] # real data would be much larger
result = []
for item in dat:
result.append(myFun(item))
Instead of using the simple for loop above, I want to use MPI to run the 'for loop' part of the above code in parallel with 24 different nodes also I want the order of items in the result list follow the same with dat list.
Note The data is read from other file which can be treated 'fix' for each processor.
I haven't use mpi before, so this stucks me for a while.
For simplicity let us assume that the master process (the process with rank = 0) is the one that will read the entire file from disk into memory. This problem can be solved only knowing about the following MPI routines, Get_size(), Get_rank(), scatter, and gather.
The Get_size():
Returns the number of processes in the communicator. It will return
the same number to every process.
The Get_rank():
Determines the rank of the calling process in the communicator.
In MPI to each process is assigned a rank, that varies from 0 to N - 1, where N is the total number of processes running.
The scatter:
MPI_Scatter involves a designated root process sending data to all
processes in a communicator. The primary difference between MPI_Bcast
and MPI_Scatter is small but important. MPI_Bcast sends the same piece
of data to all processes while MPI_Scatter sends chunks of an array to
different processes.
and the gather:
MPI_Gather is the inverse of MPI_Scatter. Instead of spreading
elements from one process to many processes, MPI_Gather takes elements
from many processes and gathers them to one single process.
Obviously, you should first follow a tutorial and read the MPI documentation to understand its parallel programming model, and its routines. Otherwise, you will find it very hard to understand how it all works. That being said your code could look like the following:
from mpi4py import MPI
def myFun(x):
return x+2 # simple example, the real one would be complicated
comm = MPI.COMM_WORLD
rank = comm.Get_rank() # get your process ID
data = # init the data
if rank == 0: # The master is the only process that reads the file
data = # something read from file
# Divide the data among processes
data = comm.scatter(data, root=0)
result = []
for item in data:
result.append(myFun(item))
# Send the results back to the master processes
newData = comm.gather(result,root=0)
In this way, each process will work (in parallel) in only a certain chunk of the data. After having finish their work, each process send back to the master process their data chunks (i.e., comm.gather(result,root=0)). This is just a toy example, now it is up to you to improved according to your testing environment and code.
You could either go the low-level MPI way as shown in the answer of #dreamcrash or you could go for a more Pythonic solution that uses an executor pool very similar to the one provided by the standard Python multiprocessing module.
First, you need to turn your code into a more functional-style one by noticing that you are actually doing a map operation, which applies myFun to each element of dat:
def myFun(x):
return x+2 # simple example, the real one would be complicated
dat = [
np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2)
] # real data would be much larger
result = map(myFun, dat)
map here runs sequentially in one Python interpreter process.
To run that map in parallel with the multiprocessing module, you only need to instantiate a Pool object and then call its map() method in place of the Python map() function:
from multiprocessing import Pool
def myFun(x):
return x+2 # simple example, the real one would be complicated
if __name__ == '__main__':
dat = [
np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2)
] # real data would be much larger
with Pool() as pool:
result = pool.map(myFun, dat)
Here, Pool() creates a new executor pool with as many interpreter processes as there are logical CPUs as seen by the OS. Calling the map() method of the pool runs the mapping in parallel by sending items to the different processes in the pool and waiting for completion. Since the worker processes import the Python script as a module, it is important to have the code that was previously at the top level moved under the if __name__ == '__main__': conditional so it doesn't run in the workers too.
Using multiprocessing.Pool() is very convenient because it requires only a slight change of the original code and the module handles for you all the work scheduling and the required data movement to and from the worker processes. The problem with multiprocessing is that it only works on a single host. Fortunately, mpi4py provides a similar interface through the mpi4py.futures.MPIPoolExecutor class:
from mpi4py.futures import MPIPoolExecutor
def myFun(x):
return x+2 # simple example, the real one would be complicated
if __name__ == '__main__':
dat = [
np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2), np.random.rand(3,2)
] # real data would be much larger
with MPIPoolExecutor() as pool:
result = pool.map(myFun, dat)
Like with the Pool object from the multiprocessing module, the MPI pool executor handles for you all the work scheduling and data movement.
There are two ways to run the MPI program. The first one starts the script as an MPI singleton and then uses the MPI process control facility to spawn a child MPI job with all the pool workers:
mpiexec -n 1 python program.py
You also need to specify the MPI universe size (the total number of MPI ranks in both the main and all child jobs). The specific way of doing so differs between the implementations, so you need to consult your implementation's manual.
The second option is to launch directly the desired number of MPI ranks and have them execute the mpi4py.futures module itself with the script name as argument:
mpiexec -n 24 python -m mpi4py.futures program.py
Keep in mind that no mater which way you launch the script one MPI rank will be reserved for the controller and will not be running mapping tasks. You are aiming at running on 24 hosts, so you should be having plenty of CPU cores and can probably afford to have one reserved. Or you could instruct MPI to oversubscribe the first host with one more rank.
One thing to note with both multiprocessing.Pool and mpi4py.futures.MPIPoolExecutor is that the map() method guarantees the order of the items in the output array, but it doesn't guarantee the order in which the different items are evaluated. This shouldn't be a problem in most cases.
A word of advise. If your data is actually chunks read from a file, you may be tempted to do something like this:
if __name__ == '__main__':
data = read_chunks()
with MPIPoolExecutor() as p:
result = p.map(myFun, data)
Don't do that. Instead, if possible, e.g., if enabled by the presence of a shared (and hopefully parallel) filesytem, delegate the reading to the workers:
NUM_CHUNKS = 100
def myFun(chunk_num):
# You may need to pass the value of NUM_CHUNKS to read_chunk()
# for it to be able to seek to the right position in the file
data = read_chunk(NUM_CHUNKS, chunk_num)
return ...
if __name__ == '__main__':
chunk_nums = range(NUM_CHUNKS) # 100 chunks
with MPIPoolExecutor() as p:
result = p.map(myFun, chunk_nums)
In my case I have several files in S3 and a custom function that read each one of them and process it using all threads. To simplify the example I just generate a dataframe df and I assume that my function is tsfresh.extract_features which use multiprocessing.
Generate Data
import pandas as pd
from tsfresh import extract_features
from tsfresh.examples.robot_execution_failures import download_robot_execution_failures, \
load_robot_execution_failures
download_robot_execution_failures()
ts, y = load_robot_execution_failures()
df = []
for i in range(5):
tts = ts.copy()
tts["id"] += 88 * i
df.append(tts)
df = pd.concat(df, ignore_index=True)
Function
def fun(df, n_jobs):
extracted_features = extract_features(df,
column_id="id",
column_sort="time",
n_jobs=n_jobs)
Cluster
import dask
from dask.distributed import Client, progress
from dask import compute, delayed
from dask_cloudprovider import FargateCluster
my_vpc = # your vpc
my_subnets = # your subnets
cpu = 2
ram = 4
cluster = FargateCluster(n_workers=1,
image='rpanai/feats-worker:2020-08-24',
vpc=my_vpc,
subnets=my_subnets,
worker_cpu=int(cpu * 1024),
worker_mem=int(ram * 1024),
cloudwatch_logs_group="my_log_group",
task_role_policies=['arn:aws:iam::aws:policy/AmazonS3FullAccess'],
scheduler_timeout='20 minutes'
)
cluster.adapt(minimum=1,
maximum=4)
client = Client(cluster)
client
Using all worker threads (FAIL)
to_process = [delayed(fun)(df, cpu) for i in range(10)]
out = compute(to_process)
AssertionError: daemonic processes are not allowed to have children
Using only one thread (OK)
In this case it works fine but I'm wasting resources.
to_process = [delayed(fun)(df, 0) for i in range(10)]
out = compute(to_process)
Question
I know that for this particular function I could eventually write a custom distributor using multithreading and few other tricks but I'd like to distribute a job where on every worker I can take advantages of all resources without having to worry too much.
Update
The function was just an example and actually it has some sort of cleaning before the actual feature extraction and after it save it to S3.
def fun(filename, bucket_name, filename_out, n_jobs):
#
df pd.read_parquet(f"s3://{bucket_name}/{filename}")
# do some cleaning
extracted_features = extract_features(df,
column_id="id",
column_sort="time",
n_jobs=n_jobs)
extract_features.to_parquet(f"s3://{bucket_name}/{filename_out}")
I can help answering your specific question for tsfresh, but if tsfresh was just a simple toy example that might not be what you want.
For tsfresh, you would typically not mix the multiprocessing of tsfresh and dask, but let dask do all the handling. This means, you start with a single dask.DataFrame (in your test case, you could just convert the pandas dataframe into a dask one - for your read use case you can read directly from S3 docu), and then distribute the feature extraction in the dask dataframe (the nice thing on the feature extraction is, that it works independently on every time series. Therefore we can generate a single job for every time series).
The current version of tsfresh (0.16.0) has a small helper function that will do this for you: see here.
In the next version, it might even be possible to just run extract_features on the dask dataframe directly.
I am not sure if this helps to solve your more general question. In my opinion, you (in most of the cases) do not want to mix dask's distribution function and "local" multicore calculation but just let dask handle everything. Because if you are on a dask cluster, you might not even know how many cores you will have on each of the machines (or you might only get a single one per job).
This means if your job can be distributed N times and each of them will start M sub-jobs, you just give "N x M" jobs to dask and let it figure out the rest (including data locality).
I have two large Pandas Dataframes (1GB+) with data that needs to be processed by multiple workers.
I'm able to perform the operations without issues in a toy example with much smaller Dataframes (DFs).
Below is my reproducible example.
I've tried several routes:
I am unable to take advantage of chunk. The DFs need to be sliced into specific pieces on an index before each piece is fed to the workers. And chunk can only slice them to an arbitrary length.
Using starmap: This is what you see in the code below. Pre-slicing the DFs on the indexes and storing the pieces in an iterable. The pieces can be passed as small frames (or dicts) to the worker processes. This solution is not feasible at the sizes of my real DFs- the iterable never finishes being created. I've tried and failed using a generator/ yield for starmap. I would appreciate an example of a workaround if this is an option.
Using imap: The entire input DFs end up going to each of the worker processes. I was able to use generators/yield through an intermediate function that would slice the DFs and make them available for each worker without having a huge iterable. But the process was taking longer than if I'd use a for loop. The overhead of passing the data to the workers was the bottleneck.
I am ready to conclude that multiprocessing cannot be applied to a large data table that needs to be sliced before being sent to workers.
import random
import numpy as np
import pandas as pd
import multiprocessing
def df_slicer(df1,df2):
while len(df1)>0:
value_to_slice_df_by=df1.iloc[0]['Number'] #Will use first found number as the index for this process.
a= df1[df1['Number']==value_to_slice_df_by].copy()
b= df2[df2['Number']==value_to_slice_df_by].copy()
print('len(df1): {}, len(df2): {}'.format(len(a),len(b)))
return[a,b]
if __name__ == '__main__':
#These are large and will be pulled in from Pandas pickles.
df1=pd.DataFrame(np.random.randint(3000, size=(100000, 2)), columns=['Number', 'Info_df1'])
df2=pd.DataFrame(np.random.randint(5000, size=(100000, 2)), columns=['Number', 'Info_df2'])
iterable= [[df1.loc[df1['Number']==i], df2.loc[df2['Number']==i]] for i in list(np.unique(df1['Number']))]
pool = multiprocessing.Pool( processes=multiprocessing.cpu_count() - 1)
for res in pool.starmap(df_slicer, [[i[0],i[1]] for i in iterable]):
result=res
pass
print('Done')
pool.close(); pool.join()
I'm trying to start a variable number of threads to compute the results of functions for one of my automated trading modules. I have about 14 functions all of which are computationally expensive. I've been calculating each function sequentially, but it takes around 3 minutes to complete, and my platform is high frequency, I have the need to cut that computation time down to 1 minute or less.
I've read up on multiprocessing and multithreading, but I can't find a solution that fits my need.
What I'm trying to do is define "n" number of threads to use, then divide my list of functions into "n" groups, then compute each group of functions in a separate thread. Essentially:
functionList = [func1,func2,func3,func4]
outputList = [func1out,func2out,func3out,func4out]
argsList = [func1args,func2args,func3args,func4args]
# number of threads
n = 3
functionSplit = np.array_split(np.array(functionList),n)
outputSplit = np.array_split(np.array(outputList),n)
argSplit = np.array_split(np.array(argsList),n)
Now I'd like to start "n" seperate threads, each processing the functions according to the split lists. Then I'd like to name the output of each function according to the outputList and create a master dict of the outputs from each function. I then will loop through the output dict and create a dataframe with column ID numbers according to the information in each column (already have this part worked out, just need the multithreading).
Is there any way to do something like this? I've been looking into creating a subclass of the threading.Thread class and passing the functions, output names, and arguments into the run() method, but I don't know how to name and output the results of the functions from each thread! Nor do I know how to call functions in a list according to their corresponding arguments!
The reason that I'm doing this is to discover the optimum thread number balance between computational efficiency and time. Like I said, this will be integrated into a high frequency trading platform I'm developing where time is my major constraint!
Any ideas?
You can use multiprocessing library like below
import multiprocessing
def callfns(fnList, argList, outList, d):
for i in range(len(fnList)):
d[somekey] = fnList[i](argList, outList)
...
manager = multiprocessing.Manager()
d = manager.dict()
processes = []
for i in range(len(functionSplit)):
process = multiprocessing.Process(target=callfns, args=(functionSplit[i], argSplit[i], outputSplit[i], d))
processes.append(process)
for j in processes:
j.start()
for j in processes:
j.join()
# use d here
You can use a server process to share the dictionary between these processes. To interact with the server process you need Manager. Then you can create a dictionary in server process manager.dict(). Once all process join back to the main process, you can use the dictionary d.
I hope this help you solve your problem.
You should use multiprocessing instead of threading for cpu bound tasks.
Manually creating and managing processes can be difficult and require more efforts. Do checkout the concurrent.futures and try the ProcessPool for maintaining a pool of processes. You can submit tasks to them and retrieve results.
The Pool.map method from multiprocessing module can take a function and iterable and then process them in chunks in parallel to compute faster. The iterable is broken into separate chunks. These chunks are passed to the function in separate processes. Then the results are then put back together.
I am using Python multiprocessing, more precisely
from multiprocessing import Pool
p = Pool(15)
args = [(df, config1), (df, config2), ...] #list of args - df is the same object in each tuple
res = p.map_async(func, args) #func is some arbitrary function
p.close()
p.join()
This approach has a huge memory consumption; eating up pretty much all my RAM (at which point it gets extremely slow, hence making the multiprocessing pretty useless). I assume the problem is that df is a huge object (a large pandas dataframe) and it gets copied for each process. I have tried using multiprocessing.Value to share the dataframe without copying
shared_df = multiprocessing.Value(pandas.DataFrame, df)
args = [(shared_df, config1), (shared_df, config2), ...]
(as suggested in Python multiprocessing shared memory), but that gives me TypeError: this type has no size (same as Sharing a complex object between Python processes?, to which I unfortunately don't understand the answer).
I am using multiprocessing for the first time and maybe my understanding is not (yet) good enough. Is multiprocessing.Value actually even the right thing to use in this case? I have seen other suggestions (e.g. queue) but am by now a bit confused. What options are there to share memory, and which one would be best in this case?
The first argument to Value is typecode_or_type. That is defined as:
typecode_or_type determines the type of the returned object: it is
either a ctypes type or a one character typecode of the kind used by
the array module. *args is passed on to the constructor for the type.
Emphasis mine. So, you simply cannot put a pandas dataframe in a Value, it has to be a ctypes type.
You could instead use a multiprocessing.Manager to serve your singleton dataframe instance to all of your processes. There's a few different ways to end up in the same place - probably the easiest is to just plop your dataframe into the manager's Namespace.
from multiprocessing import Manager
mgr = Manager()
ns = mgr.Namespace()
ns.df = my_dataframe
# now just give your processes access to ns, i.e. most simply
# p = Process(target=worker, args=(ns, work_unit))
Now your dataframe instance is accessible to any process that gets passed a reference to the Manager. Or just pass a reference to the Namespace, it's cleaner.
One thing I didn't/won't cover is events and signaling - if your processes need to wait for others to finish executing, you'll need to add that in. Here is a page with some Event examples which also cover with a bit more detail how to use the manager's Namespace.
(note that none of this addresses whether multiprocessing is going to result in tangible performance benefits, this is just giving you the tools to explore that question)
You can use Array instead of Value for storing your dataframe.
The solution below converts a pandas dataframe to an object that stores its data in shared memory:
import numpy as np
import pandas as pd
import multiprocessing as mp
import ctypes
# the origingal dataframe is df, store the columns/dtypes pairs
df_dtypes_dict = dict(list(zip(df.columns, df.dtypes)))
# declare a shared Array with data from df
mparr = mp.Array(ctypes.c_double, df.values.reshape(-1))
# create a new df based on the shared array
df_shared = pd.DataFrame(np.frombuffer(mparr.get_obj()).reshape(df.shape),
columns=df.columns).astype(df_dtypes_dict)
If now you share df_shared across processes, no additional copies will be made. For you case:
pool = mp.Pool(15)
def fun(config):
# df_shared is global to the script
df_shared.apply(config) # whatever compute you do with df/config
config_list = [config1, config2]
res = p.map_async(fun, config_list)
p.close()
p.join()
This is also particularly useful if you use pandarallel, for example:
# this will not explode in memory
from pandarallel import pandarallel
pandarallel.initialize()
df_shared.parallel_apply(your_fun, axis=1)
Note: with this solution you end up with two dataframes (df and df_shared), which consume twice the memory and are long to initialise. It might be possible to read the data directly in shared memory.
At least Python 3.6 supports to store a pandas DataFrame as a multiprocessing.Value. See below a working example:
import ctypes
import pandas as pd
from multiprocessing import Value
df = pd.DataFrame({'a': range(0,9),
'b': range(10,19),
'c': range(100,109)})
k = Value(ctypes.py_object)
k.value = df
print(k.value)
You can share a pandas dataframe between processes without any memory overhead by creating a data_handler child process. This process receives calls from the other children with specific data requests (i.e. a row, a specific cell, a slice etc..) from your very large dataframe object. Only the data_handler process keeps your dataframe in memory unlike a Manager like Namespace which causes the dataframe to be copied to all child processes. See below for a working example. This can be converted to pool.
Need a progress bar for this? see my answer here: https://stackoverflow.com/a/55305714/11186769
import time
import Queue
import numpy as np
import pandas as pd
import multiprocessing
from random import randint
#==========================================================
# DATA HANDLER
#==========================================================
def data_handler( queue_c, queue_r, queue_d, n_processes ):
# Create a big dataframe
big_df = pd.DataFrame(np.random.randint(
0,100,size=(100, 4)), columns=list('ABCD'))
# Handle data requests
finished = 0
while finished < n_processes:
try:
# Get the index we sent in
idx = queue_c.get(False)
except Queue.Empty:
continue
else:
if idx == 'finished':
finished += 1
else:
try:
# Use the big_df here!
B_data = big_df.loc[ idx, 'B' ]
# Send back some data
queue_r.put(B_data)
except:
pass
# big_df may need to be deleted at the end.
#import gc; del big_df; gc.collect()
#==========================================================
# PROCESS DATA
#==========================================================
def process_data( queue_c, queue_r, queue_d):
data = []
# Save computer memory with a generator
generator = ( randint(0,x) for x in range(100) )
for g in generator:
"""
Lets make a request by sending
in the index of the data we want.
Keep in mind you may receive another
child processes return call, which is
fine if order isnt important.
"""
#print(g)
# Send an index value
queue_c.put(g)
# Handle the return call
while True:
try:
return_call = queue_r.get(False)
except Queue.Empty:
continue
else:
data.append(return_call)
break
queue_c.put('finished')
queue_d.put(data)
#==========================================================
# START MULTIPROCESSING
#==========================================================
def multiprocess( n_processes ):
combined = []
processes = []
# Create queues
queue_data = multiprocessing.Queue()
queue_call = multiprocessing.Queue()
queue_receive = multiprocessing.Queue()
for process in range(n_processes):
if process == 0:
# Load your data_handler once here
p = multiprocessing.Process(target = data_handler,
args=(queue_call, queue_receive, queue_data, n_processes))
processes.append(p)
p.start()
p = multiprocessing.Process(target = process_data,
args=(queue_call, queue_receive, queue_data))
processes.append(p)
p.start()
for i in range(n_processes):
data_list = queue_data.get()
combined += data_list
for p in processes:
p.join()
# Your B values
print(combined)
if __name__ == "__main__":
multiprocess( n_processes = 4 )
I was pretty surprised that joblib's Parallel (since 1.0.1 at least) supports sharing pandas dataframes with multiprocess workers out of the box already. At least with the 'loky' backend.
One thing I figured out experimentally: parameters you pass to the function should not contain any large dict. If they do, turn the dict into a Series or Dataframe.
Some additional memory for sure gets used by each worker, but much less than the size of your supposedly 'big' dataframe residing in the main process. And the computation begins right away in all workers. Otherwise, joblib starts all your requested workers, but they are hanging idle while objects are copied into each one sequentially, which is taking a long time. I can provide a code sample if someone needs it. I have tested dataframes processing only in read-only mode. The feature is not mentioned in the docs but it works for Pandas.