I am trying to manipulate the lists inside the dictionary clean_txt in another function, but its not working and I end up with empty lists inside the dict.
My understading is that both lists and dicts are mutable objects so what is the problem here?
def process_questions(i, question_list, questions, question_list_name):
''' Transform questions and display progress '''
print('processing {}: process {}'.format(question_list_name, i))
for question in questions:
question_list.append(text_to_wordlist(str(question)))
#timeit
def multi(n_cores, tq, qln):
procs = []
clean_txt = {}
for i in range(n_cores):
clean_txt[i] = []
for index in range(n_cores):
tq_indexed = tq[index*len(tq)//n_cores:(index+1)*len(tq)//n_cores]
proc = Process(target=process_questions, args=(index, clean_txt[index], tq_indexed, qln, ))
procs.append(proc)
proc.start()
for proc in procs:
proc.join()
print('{} records processed from {}'.format(sum([len(x) for x in clean_txt.values()]), qln))
print('-'*100)
Your are using Processes not threads.
When the process is created the memory of your program is copied and each process work on its own set, therefore it is NOT shared.
Here's a question that can help you understand: Multiprocessing vs Threading Python
If you want to share memory between processes you should look into semaphores or use Threads instead. There are also other solutions to share data, like queues or database etc.
You are appending to clean_txt[index] from a different Process. clean_txt[index] belongs to the main python process who created it. Since a process can't access or modify another process's memory, you can't append to it. (Not really. See edit below)
You will need to create shared memory.
You can use Manager to create shared memory, something like this
from multiprocessing import Manager
manager = Manager()
...
clean_txt[i] = manager.list()
Now you can append to this list in the other process.
Edit -
My explanation about clean_txt wasn't clear. Thanks to #Maresh.
When a new Process is created the whole memory is copied. So modifying the list in the new process won't affect the copy in the main process. So you need a shared memory.
Related
Scenario
I have a long list in which each element is a string, and now I want to do the same operation for every element by using some functions. Considering the list's length, I want to use multiprocessing to create some Process to process this list. Each Process processes only a portion of the elements of the list. To save time in this way.
In the code below:
I create a multiprocessing.Manager to make my ls can be shared between the 2 child processes.
Then I create a multiprocessing.Pool, hoping for two processes to be created. One Process do func() on ls[0:100000], and another do func() on ls[100000:200000].
Every result of my_operation() on ls[index] is written in-place.
Example code (using 2 processes):
from multiprocessing import Pool, Manager
# len(ls) = 200000
ls = ['a', 'b', ...]
# Each element of the scopes below is the scope to be processed by the process.
# (I want to use 2 processes in this example)
scopes = [ (0, 100000), (100000, 200000) ]
m = Manager()
ls = m.list(ls) # create a shared list for child processes
def func(range_start, range_end):
for index in range(range_start, range_end):
ls[index] = my_operation(ls[index])
def my_operation(str):
return str
with Pool(2) as pool:
pool.starmap_async(func, scopes)
pool.close()
pool.join()
Questions
Is it a good idea to read and write on the same list when using multiprocessing to deal with different scopes of this list?
How to improve my code? (change Manager to something else? lesser shared state and how to achieve that? ...)
Thank you!
I am using mac book and therefore, multiprocessing will use fork system call instead of spawning a new process. Also, I am using Python (with multiprocessing or Dask).
I have a very big pandas dataframe. I need to have many parallel subprocesses work with a portion of this one big dataframe. Let's say I have 100 partitions of this table that needs to be worked on in parallel. I want to avoid having to need to make 100 copies of this big dataframe as that will overwhelm memory. So the current approach I am taking is to partition it, save each partition to disk, and have each process read them in to process the portion each of them are responsible for. But this read/write is very expensive for me, and I would like to avoid it.
But if I make one global variable of this dataframe, then due to COW behavior, each process will be able to read from this dataframe without making an actual physical copy of it (as long as it does not modify it). Now the question I have is, if I make this one global dataframe and name it:
global my_global_df
my_global_df = one_big_df
and then in one of the subprocess I do:
a_portion_of_global_df_readonly = my_global_df.iloc[0:10]
a_portion_of_global_df_copied = a_portion_of_global_df_readonly.reset_index(drop=True)
# reset index will make a copy of the a_portion_of_global_df_readonly
do something with a_portion_of_global_df_copied
If I do the above, will I have created a copy of the entire my_global_df or just a copy of the a_portion_of_global_df_readonly, and thereby, in extension, avoided making copies of 100 one_big_df?
One additional, more general question is, why do people have to deal with Pickle serialization and/or read/write to disk to transfer the data across multiple processes when (assuming people are using UNIX) setting the data as global variable will effectively make it available at all child processes so easily? Is there danger in using COW as a means to make any data available to subprocesses in general?
[Reproducible code from the thread below]
from multiprocessing import Process, Pool
import contextlib
import pandas as pd
def my_function(elem):
return id(elem)
num_proc = 4
num_iter = 10
df = pd.DataFrame(np.asarray([1]))
print(id(df))
with contextlib.closing(Pool(processes=num_proc)) as p:
procs = [p.apply_async(my_function, args=(df, )) for elem in range(num_iter)]
results = [proc.get() for proc in procs]
p.close()
p.join()
print(results)
Summarizing the comments, on a forking system such as Mac or Linux, a child process has a copy-on-write (COW) view of the parent address space, including any DataFrames that it may hold. It is safe to use and modify the dataframe in child processes without changing the data in the parent or other sibling child processses.
That means that it is unnecessary to serialize the dataframe to pass it to the child. All you need is the reference to the dataframe. For a Process, you can just pass the reference directly
p = multiprocessing.Process(target=worker_fctn, args=(my_dataframe,))
p.start()
p.join()
If you use a Queue or another tool such as a Pool then the data will likely be serialized. You can use a global variable known to the worker but not actually passed to the worker to get around that problem.
What remains is the return data. It is in the child only and still needs to be serialized to be returned to the parent.
I am trying to create workers for a task that involves reading a lot of files and analyzing them.
I want something like this:
list_of_unique_keys_from_csv_file = [] # About 200mb array (10m rows)
# a list of uniquekeys for comparing inside worker processes to a set of flatfiles
I need more threads as it is going very slow, doing the comparison with one process (10 minutes per file).
I have another set of flat-files that I compare the CSV file to, to see if unique keys exist. This seems like a map reduce type of problem.
main.py:
def worker_process(directory_glob_of_flat_files, list_of_unique_keys_from_csv_file):
# Do some parallel comparisons "if not in " type stuff.
# generate an array of
# lines of text like : "this item_x was not detected in CSV list (from current_flatfile)"
if current_item not in list_of_unique_keys_from_csv_file:
all_lines_this_worker_generated.append(sometext + current_item)
return all_lines_this_worker_generated
def main():
all_results = []
pool = Pool(processes=6)
partitioned_flat_files = [] # divide files from glob by 6
results = pool.starmap(worker_process, partitioned_flat_files, {{{{i wanna pass in my read-only parameter}}}})
pool.close()
pool.join()
all_results.extend(results )
resulting_file.write(all_results)
I am using both a linux and a windows environment, so perhaps I need something cross-platform compatible (the whole fork() discussion).
Main Question: Do I need some sort of Pipe or Queue, I can't seem to find good examples of how to transfer around a big read-only string array, a copy for each worker process?
You can just split your read-only parameters and then pass them in. The multiprocessing module is cross-platform compatible, so don't worry about it.
Actually, every process, even sub-process, has its own resources, that means no matter how you pass the parameters to it, it will keep a copy of the original one instead of sharing it. In this simple case, when you pass the parameters from main process into sub-processes, Pool automatically makes a copy of your variables. Because sub-processes just have the copies of original one, so the modification cannot be shared. It doesn't matter in this case as your variables are read-only.
But be careful about your code, you need to wrap the parameters you need into an
iterable collection, for example:
def add(a, b):
return a + b
pool = Pool()
results = pool.starmap(add, [(1, 2), (3, 4)])
print(results)
# [3, 7]
I'm using multiprocessing to create sub-process to my application.
I also share a dictionary between the process and the sub-process.
Example of my code:
Main process:
from multiprocessing import Process, Manager
manager = Manager()
shared_dict = manager.dict()
p = Process(target=mysubprocess, args=(shared_dict,))
p.start()
p.join()
print shared_dict
my sub-process:
def mysubprocess(shared_dict):
shared_dict['list_item'] = list()
shared_dict['list_item'].append('test')
print shared_dict
In both cases the printed value is :
{'list_item': []}
What could be the problem?
Thanks
Manager.dict will give you a dict where direct changes will be propagated between the processes, but it doesn't detect if you change objects contained in the dict (like the list stored under "list_item"). See the note at the bottom of the SyncManager documentation:
Note: Modifications to mutable values or items in dict and list proxies will not be propagated through the manager, because the proxy has no way of knowing when its values or items are modified. To modify such an item, you can re-assign the modified object to the container proxy.
So in your example the list gets synced when you set it in the dict, but the append doesn't trigger another sync.
You can get around that by re-assigning the key in the dict:
from multiprocessing import Process, Manager
def mysubprocess(shared_dict):
item = shared_dict['list_item'] = list()
item.append('test')
shared_dict['list_item'] = item
print 'subprocess:', shared_dict
manager = Manager()
shared_dict = manager.dict()
p = Process(target=mysubprocess, args=(shared_dict,))
p.start()
p.join()
print 'main process:', shared_dict
But that might get inefficient if the list is going to grow long - the whole list will be serialised and sent to the manager process for each append. A better way in that case would be to make the shared list directly with SyncManager.list (although you'll still have the same problem if the elements of the list are mutable - you need to reset them in the list to send them between the processes).
I am aware of multiprocessing.Manager() and how it can be used to create shared objects, in particular queues which can be shared between workers. There is this question, this question, this question and even one of my own questions.
However, I need to define a great many queues, each of which is linking a specific pair of processes. Say that each pair of processes and its linking queue is identified by the variable key.
I want to use a dictionary to access my queues when I need to put and get data. I cannot make this work. I've tried a number of things. With multiprocessing imported as mp:
Defining a dict like for key in all_keys: DICT[key] = mp.Queue in a config file which is imported by the multiprocessing module (call it multi.py) does not return errors, but the queue DICT[key] is not shared between the processes, each one seems to have their own copy of the queue and thus no communication happens.
If I try to define the DICT at the beginning of the main multiprocessing function that defines the processes and starts them, like
DICT = mp.Manager().dict()
for key in all_keys:
DICT[key] = mp.Queue()
I get the error
RuntimeError: Queue objects should only be shared between processes through
inheritance
Changing to
DICT = mp.Manager().dict()
for key in all_keys:
DICT[key] = mp.Manager().Queue()
only makes everything worse. Trying similar definitions at the head of multi.py rather than inside the main function returns similar errors.
There must be a way to share many queues between processes without explicitly naming each one in the code. Any ideas?
Edit
Here is a basic schema of the program:
1- load the first module, which defines some variables, imports multi, launches multi.main(), and loads another module which starts a cascade of module loads and code execution. Meanwhile...
2- multi.main looks like this:
def main():
manager = mp.Manager()
pool = mp.Pool()
DICT2 = manager.dict()
for key in all_keys:
DICT2[key] = manager.Queue()
proc_1 = pool.apply_async(targ1,(DICT1[key],) ) #DICT1 is defined in the config file
proc_2 = pool.apply_async(targ2,(DICT2[key], otherargs,)
Rather than use pool and manager, I was also launching processes with the following:
mp.Process(target=targ1, args=(DICT[key],))
3 - The function targ1 takes input data that is coming in (sorted by key) from the main process. It is meant to pass the result to DICT[key] so targ2 can do its work. This is the part that is not working. There are an arbitrary number of targ1s, targ2s, etc. and therefore an arbitrary number of queues.
4 - The results of some of these processes will be sent to a bunch of different arrays / pandas dataframes which are also indexed by key, and which I would like to be accessible from arbitrary processes, even ones launched in a different module. I have yet to write this part and it might be a different question. (I mention it here because the answer to 3 above might also solve 4 nicely.)
It sounds like your issues started when you tried to share a multiprocessing.Queue() by passing it as an argument. You can get around this by creating a managed queue instead:
import multiprocessing
manager = multiprocessing.Manager()
passable_queue = manager.Queue()
When you use a manager to create it, you are storing and passing around a proxy to the queue, rather than the queue itself, so even when the object you pass to your worker processes is a copied, it will still point at the same underlying data structure: your queue. It's very similar (in concept) to pointers in C/C++. If you create your queues this way, you will be able to pass them when you launch a worker process.
Since you can pass queues around now, you no longer need your dictionary to be managed. Keep a normal dictionary in main that will store all the mappings, and only give your worker processes the queues they need, so they won't need access to any mappings.
I've written an example of this here. It looks like you are passing objects between your workers, so that's what's done here. Imagine we have two stages of processing, and the data both starts and ends in the control of main. Look at how we can create the queues that connect the workers like a pipeline, but by giving them only they queues they need, there's no need for them to know about any mappings:
import multiprocessing as mp
def stage1(q_in, q_out):
q_out.put(q_in.get()+"Stage 1 did some work.\n")
return
def stage2(q_in, q_out):
q_out.put(q_in.get()+"Stage 2 did some work.\n")
return
def main():
pool = mp.Pool()
manager = mp.Manager()
# create managed queues
q_main_to_s1 = manager.Queue()
q_s1_to_s2 = manager.Queue()
q_s2_to_main = manager.Queue()
# launch workers, passing them the queues they need
results_s1 = pool.apply_async(stage1, (q_main_to_s1, q_s1_to_s2))
results_s2 = pool.apply_async(stage2, (q_s1_to_s2, q_s2_to_main))
# Send a message into the pipeline
q_main_to_s1.put("Main started the job.\n")
# Wait for work to complete
print(q_s2_to_main.get()+"Main finished the job.")
pool.close()
pool.join()
return
if __name__ == "__main__":
main()
The code produces this output:
Main started the job.
Stage 1 did some work.
Stage 2 did some work.
Main finished the job.
I didn't include an example of storing the queues or AsyncResults objects in dictionaries, because I still don't quite understand how your program is supposed to work. But now that you can pass your queues freely, you can build your dictionary to store the queue/process mappings as needed.
In fact, if you really do build a pipeline between multiple workers, you don't even need to keep a reference to the "inter-worker" queues in main. Create the queues, pass them to your workers, then only retain references to queues that main will use. I would definitely recommend trying to let old queues be garbage collected as quickly as possible if you really do have "an arbitrary number" of queues.