Multiprocessing redis instance -thread_lock error - python

I've a large dataframe(million rows) out of which I create more dataframes by filtering according to a particular column. Now, I want to insert the data into redis where I would be performing some heavy calculations. I'm trying to create an instance of redis and trying to paralleize the insertion of data into redis database via multiprocessing. I'm new to multiprocessing and getting an error while inserting the data. I'm not sure whether it can be done or not and why I'm getting the thread_lock error. Can anyone explain and how can I proceed ahead for the solution.
Here is the code:
from multiprocessing import Process, Queue
r=redis.StrictRedis(host='localhost',port='6379',db=0)
q=Queue()
p1=Process(target=insertintoRedis,args=(df1,q,r))
p2=Process(target=insertintoRedis,args=(df2,q,r))
p1.start()
p2.start()
p1.join()
p2.join()
def insertintoRedis( df, q ,r):
for row in df.values:
r.hset(row[-5],row[0],row[-4])
return
I get this error at p1.start():
TypeError: cannot pickle '_thread.lock' object

I suspect the problem is that r, the instance of redis.StrictRedis, cannot be pickled and thus passed as an argument to insertIntoRedis and that each process must create its own instance of redis.StrictRedis. If you will be invoking insetintoRedis many times (unfortunately, your code is overly simplified), it might be best instead to use a process pool where you would only have to create an instance of redis.StrictRedis for each process in the pool, which can be reused repeatedly. This would also make it possible to get a return values back from your worker function insetintoRedis (if the reason for passing a multiprocessing.Queue instance to this function is for results, you would no longer need this -- but again this cannot be deduced from your lack of code).
Here is the general idea:
from multiprocessing import Pool, Queue
from functools import partial
def init_pool():
global r
r = redis.StrictRedis(host='localhost', port='6379', db=0)
# right now limit the pool size to 2 since we only have two tasks:
q = Queue()
pool = Pool(2, initializer=init_pool)
return_values = pool.map(partial(insertintoRedis, q), [df1, df2])
# note that the order of the arguments has been changed:
def insertintoRedis(q, df):
for row in df.values:
r.hset(row[-5],row[0],row[-4])
return # or return some_value
Perhaps multiprocessing.Pool.imap would be the more appropriate method to use if you have a very large collection of dataframes and you wish to avoid creating the large list required by map and can use instead a generator function or expression.

Related

Python: Pass Readonly Shelve to subprocesses

As discussed here: Python: Multiprocessing on Windows -> Shared Readonly Memory I have a heavily parallelized task.
Multiple workers do some stuff and in the end need to access some keys of a dictionary which contains several millions of key:value combinations. The keys which will be accessed, are only known within the worker after some further action also involving some file-processing (the example below is just for demonstration purposes, hence simplified in that way).
Before, my solution was to keep this big dictionary in memory, pass it once into shared memory and access it by the single workers. But it consumes a lot of RAM... So I wanted to use shelve (because the values of that dictionary are again dicts or lists).
So a simplified example of what I tried was:
def shelveWorker(tupArgs):
id, DB = tupArgs
return DB[id]
if __name__ == '__main__':
DB = shelve.open('file.db', flag='r', protocol=2)
joblist = []
for id in range(10000):
joblist.append((str(id), DB))
p = multiprocessing.Pool()
for returnValue in p.imap_unordered(shelveWorker, joblist):
# do something with returnValue
pass
p.close()
p.join()
Unfortunately I get:
"TypeError: can't pickle DB objects"
But IMHO it does not make any sense to open the shelve itself (DB = shelve.open('file.db', flag='r', protocol=2)) within each worker on its own because of slower runtime (I have several thousand workers).
How to go about it?
Thanks!

Python multi function multithreading with threading.Thread? (variable number of threads)

I'm trying to start a variable number of threads to compute the results of functions for one of my automated trading modules. I have about 14 functions all of which are computationally expensive. I've been calculating each function sequentially, but it takes around 3 minutes to complete, and my platform is high frequency, I have the need to cut that computation time down to 1 minute or less.
I've read up on multiprocessing and multithreading, but I can't find a solution that fits my need.
What I'm trying to do is define "n" number of threads to use, then divide my list of functions into "n" groups, then compute each group of functions in a separate thread. Essentially:
functionList = [func1,func2,func3,func4]
outputList = [func1out,func2out,func3out,func4out]
argsList = [func1args,func2args,func3args,func4args]
# number of threads
n = 3
functionSplit = np.array_split(np.array(functionList),n)
outputSplit = np.array_split(np.array(outputList),n)
argSplit = np.array_split(np.array(argsList),n)
Now I'd like to start "n" seperate threads, each processing the functions according to the split lists. Then I'd like to name the output of each function according to the outputList and create a master dict of the outputs from each function. I then will loop through the output dict and create a dataframe with column ID numbers according to the information in each column (already have this part worked out, just need the multithreading).
Is there any way to do something like this? I've been looking into creating a subclass of the threading.Thread class and passing the functions, output names, and arguments into the run() method, but I don't know how to name and output the results of the functions from each thread! Nor do I know how to call functions in a list according to their corresponding arguments!
The reason that I'm doing this is to discover the optimum thread number balance between computational efficiency and time. Like I said, this will be integrated into a high frequency trading platform I'm developing where time is my major constraint!
Any ideas?
You can use multiprocessing library like below
import multiprocessing
def callfns(fnList, argList, outList, d):
for i in range(len(fnList)):
d[somekey] = fnList[i](argList, outList)
...
manager = multiprocessing.Manager()
d = manager.dict()
processes = []
for i in range(len(functionSplit)):
process = multiprocessing.Process(target=callfns, args=(functionSplit[i], argSplit[i], outputSplit[i], d))
processes.append(process)
for j in processes:
j.start()
for j in processes:
j.join()
# use d here
You can use a server process to share the dictionary between these processes. To interact with the server process you need Manager. Then you can create a dictionary in server process manager.dict(). Once all process join back to the main process, you can use the dictionary d.
I hope this help you solve your problem.
You should use multiprocessing instead of threading for cpu bound tasks.
Manually creating and managing processes can be difficult and require more efforts. Do checkout the concurrent.futures and try the ProcessPool for maintaining a pool of processes. You can submit tasks to them and retrieve results.
The Pool.map method from multiprocessing module can take a function and iterable and then process them in chunks in parallel to compute faster. The iterable is broken into separate chunks. These chunks are passed to the function in separate processes. Then the results are then put back together.

Does a queue make sense for multiprocessing even when you only expect a single result back?

I want to execute a function in another process and get a single result back (either true or false). I know the common way of getting results back from multiprocessing is using a queue, but does it make sense if I only expect a single result back?
p = Process(target=my_function_that_returns_boolean, args=(self, args))
p.start()
p.join()
# success = p.somehow_get_the_result_back
If you prefer to use Process rather than Pool, the documentation tells us that there are two ways to exchange objects between processes.
The first is Queue which you have already seen.
The second is Pipe, which the documentation provides an example for. I have slightly modified the example to show your case of returning a boolean.
from multiprocessing import Process, Pipe
def Foo(conn):
# Do necessary processing here
# ....
# Instead of Return True, we send true
#return True
conn.send(True)
parent_conn, child_conn = Pipe()
p = Process(target=Foo, args=(child_conn,))
p.start()
print parent_conn.recv()
p.join()
Queues are used to synchronize access to shared resources in a parallel environment. Common scenarios are when many workers consume tasks from a shared pool or when one execution line creates tasks and another consumes them.
If I understand correctly, it isn't an issue here. So there is no need to use queues. The only synchronization mechanism you need is one that tells one process that the other is done. This is achieved by using join().
Unless there is a real problem just keep things as simple as possible.
You can use a Pool which returns an AsyncResult object
from multiprocessing import Pool
pool = Pool(processes=1)
result = pool.apply_async(my_function_that_returns_boolean, [(self, args),])
success = result.get(timeout=None)

multiprocessing in python - sharing large object (e.g. pandas dataframe) between multiple processes

I am using Python multiprocessing, more precisely
from multiprocessing import Pool
p = Pool(15)
args = [(df, config1), (df, config2), ...] #list of args - df is the same object in each tuple
res = p.map_async(func, args) #func is some arbitrary function
p.close()
p.join()
This approach has a huge memory consumption; eating up pretty much all my RAM (at which point it gets extremely slow, hence making the multiprocessing pretty useless). I assume the problem is that df is a huge object (a large pandas dataframe) and it gets copied for each process. I have tried using multiprocessing.Value to share the dataframe without copying
shared_df = multiprocessing.Value(pandas.DataFrame, df)
args = [(shared_df, config1), (shared_df, config2), ...]
(as suggested in Python multiprocessing shared memory), but that gives me TypeError: this type has no size (same as Sharing a complex object between Python processes?, to which I unfortunately don't understand the answer).
I am using multiprocessing for the first time and maybe my understanding is not (yet) good enough. Is multiprocessing.Value actually even the right thing to use in this case? I have seen other suggestions (e.g. queue) but am by now a bit confused. What options are there to share memory, and which one would be best in this case?
The first argument to Value is typecode_or_type. That is defined as:
typecode_or_type determines the type of the returned object: it is
either a ctypes type or a one character typecode of the kind used by
the array module. *args is passed on to the constructor for the type.
Emphasis mine. So, you simply cannot put a pandas dataframe in a Value, it has to be a ctypes type.
You could instead use a multiprocessing.Manager to serve your singleton dataframe instance to all of your processes. There's a few different ways to end up in the same place - probably the easiest is to just plop your dataframe into the manager's Namespace.
from multiprocessing import Manager
mgr = Manager()
ns = mgr.Namespace()
ns.df = my_dataframe
# now just give your processes access to ns, i.e. most simply
# p = Process(target=worker, args=(ns, work_unit))
Now your dataframe instance is accessible to any process that gets passed a reference to the Manager. Or just pass a reference to the Namespace, it's cleaner.
One thing I didn't/won't cover is events and signaling - if your processes need to wait for others to finish executing, you'll need to add that in. Here is a page with some Event examples which also cover with a bit more detail how to use the manager's Namespace.
(note that none of this addresses whether multiprocessing is going to result in tangible performance benefits, this is just giving you the tools to explore that question)
You can use Array instead of Value for storing your dataframe.
The solution below converts a pandas dataframe to an object that stores its data in shared memory:
import numpy as np
import pandas as pd
import multiprocessing as mp
import ctypes
# the origingal dataframe is df, store the columns/dtypes pairs
df_dtypes_dict = dict(list(zip(df.columns, df.dtypes)))
# declare a shared Array with data from df
mparr = mp.Array(ctypes.c_double, df.values.reshape(-1))
# create a new df based on the shared array
df_shared = pd.DataFrame(np.frombuffer(mparr.get_obj()).reshape(df.shape),
columns=df.columns).astype(df_dtypes_dict)
If now you share df_shared across processes, no additional copies will be made. For you case:
pool = mp.Pool(15)
def fun(config):
# df_shared is global to the script
df_shared.apply(config) # whatever compute you do with df/config
config_list = [config1, config2]
res = p.map_async(fun, config_list)
p.close()
p.join()
This is also particularly useful if you use pandarallel, for example:
# this will not explode in memory
from pandarallel import pandarallel
pandarallel.initialize()
df_shared.parallel_apply(your_fun, axis=1)
Note: with this solution you end up with two dataframes (df and df_shared), which consume twice the memory and are long to initialise. It might be possible to read the data directly in shared memory.
At least Python 3.6 supports to store a pandas DataFrame as a multiprocessing.Value. See below a working example:
import ctypes
import pandas as pd
from multiprocessing import Value
df = pd.DataFrame({'a': range(0,9),
'b': range(10,19),
'c': range(100,109)})
k = Value(ctypes.py_object)
k.value = df
print(k.value)
You can share a pandas dataframe between processes without any memory overhead by creating a data_handler child process. This process receives calls from the other children with specific data requests (i.e. a row, a specific cell, a slice etc..) from your very large dataframe object. Only the data_handler process keeps your dataframe in memory unlike a Manager like Namespace which causes the dataframe to be copied to all child processes. See below for a working example. This can be converted to pool.
Need a progress bar for this? see my answer here: https://stackoverflow.com/a/55305714/11186769
import time
import Queue
import numpy as np
import pandas as pd
import multiprocessing
from random import randint
#==========================================================
# DATA HANDLER
#==========================================================
def data_handler( queue_c, queue_r, queue_d, n_processes ):
# Create a big dataframe
big_df = pd.DataFrame(np.random.randint(
0,100,size=(100, 4)), columns=list('ABCD'))
# Handle data requests
finished = 0
while finished < n_processes:
try:
# Get the index we sent in
idx = queue_c.get(False)
except Queue.Empty:
continue
else:
if idx == 'finished':
finished += 1
else:
try:
# Use the big_df here!
B_data = big_df.loc[ idx, 'B' ]
# Send back some data
queue_r.put(B_data)
except:
pass
# big_df may need to be deleted at the end.
#import gc; del big_df; gc.collect()
#==========================================================
# PROCESS DATA
#==========================================================
def process_data( queue_c, queue_r, queue_d):
data = []
# Save computer memory with a generator
generator = ( randint(0,x) for x in range(100) )
for g in generator:
"""
Lets make a request by sending
in the index of the data we want.
Keep in mind you may receive another
child processes return call, which is
fine if order isnt important.
"""
#print(g)
# Send an index value
queue_c.put(g)
# Handle the return call
while True:
try:
return_call = queue_r.get(False)
except Queue.Empty:
continue
else:
data.append(return_call)
break
queue_c.put('finished')
queue_d.put(data)
#==========================================================
# START MULTIPROCESSING
#==========================================================
def multiprocess( n_processes ):
combined = []
processes = []
# Create queues
queue_data = multiprocessing.Queue()
queue_call = multiprocessing.Queue()
queue_receive = multiprocessing.Queue()
for process in range(n_processes):
if process == 0:
# Load your data_handler once here
p = multiprocessing.Process(target = data_handler,
args=(queue_call, queue_receive, queue_data, n_processes))
processes.append(p)
p.start()
p = multiprocessing.Process(target = process_data,
args=(queue_call, queue_receive, queue_data))
processes.append(p)
p.start()
for i in range(n_processes):
data_list = queue_data.get()
combined += data_list
for p in processes:
p.join()
# Your B values
print(combined)
if __name__ == "__main__":
multiprocess( n_processes = 4 )
I was pretty surprised that joblib's Parallel (since 1.0.1 at least) supports sharing pandas dataframes with multiprocess workers out of the box already. At least with the 'loky' backend.
One thing I figured out experimentally: parameters you pass to the function should not contain any large dict. If they do, turn the dict into a Series or Dataframe.
Some additional memory for sure gets used by each worker, but much less than the size of your supposedly 'big' dataframe residing in the main process. And the computation begins right away in all workers. Otherwise, joblib starts all your requested workers, but they are hanging idle while objects are copied into each one sequentially, which is taking a long time. I can provide a code sample if someone needs it. I have tested dataframes processing only in read-only mode. The feature is not mentioned in the docs but it works for Pandas.

Python multiprocessing is taking much longer than single processing

I am performing some large computations on 3 different numpy 2D arrays sequentially. The arrays are huge, 25000x25000 each. Each computation takes significant time so I decided to run 3 of them in parallel on 3 CPU cores on the server. I am following standard multiprocessing guideline and creating 2 processes and a worker function. Two computations are running through the 2 processes and the third one is running locally without separate process. I am passing the huge arrays as arguments of the processes like :
p1 = Process(target = Worker, args = (queue1, array1, ...)) # Some other params also going
p2 = Process(target = Worker, args = (queue2, array2, ...)) # Some other params also going
the Worker function sends back two numpy vectors (1D array) in a list appended in the queue like:
queue.put([v1, v2])
I am not using multiprocessing.pool
but surprisingly I am not getting speedup, it is actually running 3 times slower. Is passing large arrays taking time? I am unable to figure out what is going on. Should I use shared memory objects instead of passing arrays?
I shall be thankful if anybody can help.
Thank you.
my problem appears to be resolved. I was using a django module from inside which I was calling multiprocessing.pool.map_async. My worker function was a function inside the class itself. That was the problem. Multiprocessesing cannot call a function of the same class inside another process because subprocesses do not share memory. So inside the subprocess there is no live instance of the class. Probably that is why it is not getting called. As far as I understood. I removed the function from the class and put it in the same file but outside of the class, just before the class definition starts. It worked. I got moderate speedup also. And One more thing is people who are facing the same problem please do not read large arrays and pass between processes. Pickling and Unpickling would take a lot of time and you won't get speed up rather speed down. Try to read arrays inside the subprocess itself.
And if possible please use numpy.memmap arrays, they are quite fast.
Here is an example using np.memmap and Pool. See that you can define the number of processes and workers. In this case you don't have control over the queue, which can be achieved using multiprocessing.Queue:
from multiprocessing import Pool
import numpy as np
def mysum(array_file_name, col1, col2, shape):
a = np.memmap(array_file_name, shape=shape, mode='r+')
a[:, col1:col2] = np.random.random((shape[0], col2-col1))
ans = a[:, col1:col2].sum()
del a
return ans
if __name__ == '__main__':
nop = 1000 # number_of_processes
now = 3 # number of workers
p = Pool(now)
array_file_name = 'test.array'
shape = (250000, 250000)
a = np.memmap(array_file_name, shape=shape, mode='w+')
del a
cols = [[shape[1]*i/nop, shape[1]*(i+1)/nop] for i in range(nop)]
results = []
for c1, c2 in cols:
r = p.apply_async(mysum, args=(array_file_name, c1, c2, shape))
results.append(r)
p.close()
p.join()
final_result = sum([r.get() for r in results])
print final_result
You can achieve better performances using shared memory parallel processing, when possible. See this related question:
Shared-memory objects in python multiprocessing

Categories