Merge using threads not working in python - python

I have to merge two lists and every time a full the lists in order to merge them , but what is happening that I did it like this :
def repeated_fill_buffer(self):
"""
repeat the operation until reaching the end of file
"""
# clear buffers from last data
self.block = [[] for file in self.files]
filling_buffer_thread = threading.Thread(self.fill_buffer())
filling_buffer_thread.start()
# create inverted index thread
create_inverted_index_thread = threading.Thread(self.create_inverted_index())
create_inverted_index_thread.start()
# check if buffers are not empty to merge and start the thread
if any(self.block):
self.block = [[] for file in self.files]
filling_buffer_thread.join()
create_inverted_index_thread.join()
but what is happening that filling_buffer_thread and create_inverted_index_thread just called one time, and not working again, when I debugged the code I see that
filling_buffer_thread stopped
I don't know if I explain my question good, but what I want that I can called same thread multi time and run them..

If there is any operation which is CPU Bound then, using thread is of no use. Because of Python GIL, which prevents multiple byte-code instruction to be executed at a time. use multiprocessing module since, every process has its own GIL.
All number crunching or any operations which depends on CPU for its completion are CPU-Bound. Threads are useful for I/O Bound Operations (like Database Calls, Network Calls)
To summarize your error, your filling_buffer_thread got blocked due to create_inverted_index_thread

Related

Why does not multithreading speed up my program?

I have a big text file that needs to be processed. I first read all text into a list and then use ThreadPoolExecutor to start multiple threads to process it. The two functions called in process_text() are not listed here: is_channel and get_relations().
I am on Mac and my observations show that it doesn't really speed up the processing (cpu with 8 cores, only 15% cpu is used). If there is a performance bottleneck in either the function is_channel or get_relations, then the multithreading won't help much. Is that the reason for no performance gain? Should I try to use multiprocessing to speed up instead of multithreading?
def process_file(file_name):
all_lines = []
with open(file_name, 'r', encoding='utf8') as f:
for index, line in enumerate(f):
line = line.strip()
all_lines.append(line)
# Classify text
all_results = []
with ThreadPoolExecutor(max_workers=10) as executor:
for index, result in enumerate(executor.map(process_text, all_lines, itertools.repeat(channel))):
all_results.append(result)
for index, entities_relations_list in enumerate(all_results):
# print out results
def process_text(text, channel):
global channel_text
global non_channel_text
is_right_channel = is_channel(text, channel)
entities = ()
relations = None
entities_relations_list = set()
entities_relations_list.add((entities, relations))
if is_right_channel:
channel_text += 1
entities_relations_list = get_relations(text, channel)
return (text, entities_relations_list, is_right_channel)
non_channel_text += 1
return (text, entities_relations_list, is_right_channel)
The first thing that should be done is finding out how much time it takes to:
Read the file in memory (T1)
Do all processing (T2)
Printing result (T3)
The third point (printing), if you are really doing it, can slow down things. It's fine as long as you are not printing it to terminal and just piping the output to a file or something else.
Based on timings, we'll get to know:
T1 >> T2 => IO bound
T2 >> T1 => CPU bound
T1 and T2 are close => Neither.
by x >> y I mean x is significantly greater than y.
Based on above and the file size, you can try a few approaches:
Threading based
Even this can be done 2 ways, which one would work faster can be found out by again benchmarking/looking at the timings.
Approach-1 (T1 >> T2 or even when T1 and T2 are similar)
Run the code to read the file itself in a thread and let it push the lines to a queue instead of the list.
This thread inserts a None at end when it is done reading from file. This will be important to tell the worker that they can stop
Now run the processing workers and pass them the queue
The workers keep reading from the queue in a loop and processing the results. Similar to the reader thread, these workers put results in a queue.
Once a thread encounters a None, it stops the loop and re-inserts the None into the queue (so that other threads can stop themselves).
The printing part can again be done in a thread.
The above is example of single Producer and multiple consumer threads.
Approach-2 (This is just another way of doing what is being already done by the code snippet in the question)
Read the entire file into a list.
Divide the list into index ranges based on no. of threads.
Example: if the file has 100 lines in total and we use 10 threads
then 0-9, 10-19, .... 90-99 are the index ranges
Pass the complete list and these index ranges to the threads to process each set. Since you are not modifying original list, hence this works.
This approach can give results better than running the worker for each individual line.
Multiprocessing based
(CPU bound)
Split the file into multiple files before processing.
Run a new process for each file.
Each process gets the path of the file it should read and process
This requires additional step of combining all results/files at end
The process creation part can be done from within python using multiprocessing module
or from a driver script to spawn a python process for each file, like a shell script
Just by looking at the code, it seems to be CPU bound. Hence, I would prefer multiprocessing for doing that. I have used both approaches in practice.
Multiprocessing: when processing huge text files(GBs) stored on disk (like what you are doing).
Threading (Approach-1): when reading from multiple databases. As that is more IO bound than CPU (I used multiple producer and multiple consumer threads).

Taking advantage of fork system call to avoid read/writing or serializing altogether?

I am using mac book and therefore, multiprocessing will use fork system call instead of spawning a new process. Also, I am using Python (with multiprocessing or Dask).
I have a very big pandas dataframe. I need to have many parallel subprocesses work with a portion of this one big dataframe. Let's say I have 100 partitions of this table that needs to be worked on in parallel. I want to avoid having to need to make 100 copies of this big dataframe as that will overwhelm memory. So the current approach I am taking is to partition it, save each partition to disk, and have each process read them in to process the portion each of them are responsible for. But this read/write is very expensive for me, and I would like to avoid it.
But if I make one global variable of this dataframe, then due to COW behavior, each process will be able to read from this dataframe without making an actual physical copy of it (as long as it does not modify it). Now the question I have is, if I make this one global dataframe and name it:
global my_global_df
my_global_df = one_big_df
and then in one of the subprocess I do:
a_portion_of_global_df_readonly = my_global_df.iloc[0:10]
a_portion_of_global_df_copied = a_portion_of_global_df_readonly.reset_index(drop=True)
# reset index will make a copy of the a_portion_of_global_df_readonly
do something with a_portion_of_global_df_copied
If I do the above, will I have created a copy of the entire my_global_df or just a copy of the a_portion_of_global_df_readonly, and thereby, in extension, avoided making copies of 100 one_big_df?
One additional, more general question is, why do people have to deal with Pickle serialization and/or read/write to disk to transfer the data across multiple processes when (assuming people are using UNIX) setting the data as global variable will effectively make it available at all child processes so easily? Is there danger in using COW as a means to make any data available to subprocesses in general?
[Reproducible code from the thread below]
from multiprocessing import Process, Pool
import contextlib
import pandas as pd
def my_function(elem):
return id(elem)
num_proc = 4
num_iter = 10
df = pd.DataFrame(np.asarray([1]))
print(id(df))
with contextlib.closing(Pool(processes=num_proc)) as p:
procs = [p.apply_async(my_function, args=(df, )) for elem in range(num_iter)]
results = [proc.get() for proc in procs]
p.close()
p.join()
print(results)
Summarizing the comments, on a forking system such as Mac or Linux, a child process has a copy-on-write (COW) view of the parent address space, including any DataFrames that it may hold. It is safe to use and modify the dataframe in child processes without changing the data in the parent or other sibling child processses.
That means that it is unnecessary to serialize the dataframe to pass it to the child. All you need is the reference to the dataframe. For a Process, you can just pass the reference directly
p = multiprocessing.Process(target=worker_fctn, args=(my_dataframe,))
p.start()
p.join()
If you use a Queue or another tool such as a Pool then the data will likely be serialized. You can use a global variable known to the worker but not actually passed to the worker to get around that problem.
What remains is the return data. It is in the child only and still needs to be serialized to be returned to the parent.

Is modifying a pre-allocated Python list thread-safe?

I have a function that dispatches calls to several Redis shards and stores the result in a pre-allocated Python list.
Basically, the code goes as follow:
def myfunc(calls):
results = [None] * len(calls)
for index, (connection, call) in enumerate(calls.iteritems()):
results[index] = call(connection)
return results
Obviously, as of now this calls the various Redis shards in sequence. I intend to use a thread pool and to make those calls happen in parallel as they can take quite some time.
My question is: given that the results list is preallocated and that each calls has a dedicated slot, do I need to lock it to put the results from each thread or is there any guarantee that it will work without ?
I will obviously profile the result in the end but I wouldn't want to lock if I don't really need to.

python multiprocessing - how to act on interim results

I'm using pandas to calculate statistics etc on a lot of data but it ends up running for hours, and I get new data frequently. I've tried to optimize already but I'd like to make it faster, so I'm trying to make it use multiple processes. The problem I'm having is that I need to perform some interim work with the results as they're getting done, and the examples I've seen for multiprocessing.Process and Pool all wait for everything to finish before working with the results.
This is the heavily trimmed code I'm using now. The piece I want to put into separate processes is generateAnalytics().
for counter, symbol in enumerate(queuelist): # queuelist
if needQueueLoad: # set by another thread that's monitoring for new data (in the form of a new file that arrives a couple times a day)
log.info('Shutting down analyticsRunner thread')
break
dfDay = generateAnalytics(symbol) # slow running function (15s+)
astore[analyticsTable(symbol)] = dfDay # astore is a pandas store (HDF5). analyticsTable() returns the name of the appropriate table, which gets overwritten
dfLatest.loc[symbol] = dfDay.iloc[-1] # update with the latest results (dfLatest is the latest results for each symbol, which is loaded as a global at startup and periodically saved back to the store in another thread)
log.info('Processed {}/{} securities in queue.'.format(counter+1, len(queuelist)))
# do some stuff to update progress GUI
I can't figure out how to get the last lines to work with the results while it's ongoing and would appreciate suggestions.
I'm considering running it all in a Pool and having the processes add the results to a Queue (instead of returning them), and then have a while loop sit in the main process pulling off the queue as the results come in - would that be a reasonable way to do it? Something like:
mpqueue = multiprocessing.Queue()
pool = multiprocessing.Pool()
pool.map(generateAnalytics, [queuelist, mpqueue])
while not needQueueLoad: # set by another thread that's monitoring for new data (in the form of a new file that arrives a couple times a day)
while not mpqueue.empty():
dfDay = mpqueue.get()
astore[analyticsTable(symbol)] = dfDay # astore is a pandas store (HDF5). analyticsTable() returns the name of the appropriate table, which gets overwritten
dfLatest.loc[symbol] = dfDay.iloc[-1] # update with the latest results (dfLatest is the latest results for each symbol, which is loaded as a global at startup and periodically saved back to the store in another thread)
log.info('Processed {}/{} securities in queue.'.format(counter+1, len(queuelist)))
# do some stuff to update GUI that shows progress
sleep(0.1)
# do some bookkeeping to see if queue has finished
pool.join()
Using a Queue looks like a reasonable way to do it, with two remarks.
Since it looks from the code that you're using a GUI, checking for results is probably better done in a timeout function or idle function rather than in a while-loop. Using a while-loop to check for results would block the GUI's event loop.
If the worker processes need to return a lot of data to the main process via the Queue, this will add significant overhead. You might want to consider using shared memory or even an intermediate file.

Does pool.map() from multiprocessing lock process to CPU core automatically?

I've submitted several questions over last few days trying to understand how to use the multiprocessing python library properly.
Current method I'm using is to split a task over a number of processes that is equal to the number of available CPU cores on the machine, as follows:
from multiprocessing import Pool
from contextlib import closing
def myFunction(row):
# row function
with closing(Pool(processes=multiprocessing.cpu_count())) as pool:
pool.map(myFunction, rowList)
However, when the map part is reached in the program it seems to actually slow down, not speed up. One of my functions for example moves through only 60 records (the first function) and it prints a result at the end of each record. The record printing seems to slow down to an eventual stop and do not much! I am wondering if the program is loading the next function into memory async or whether there's something wrong with my methodology.
So I am wondering - are the child processes automatically 'LOCKED' to each CPU core with the pool.map() or do I need to do something extra?
EDIT:
So the program does not actually stop, it just begins to print the values very slowly.
here is an example of myFunction in very simplified terms (row is from a list object):
def myFunction(row):
d = string
j=0
for item in object:
d+= row[j]
j=j+1
d += row[x] + string
d += row[y] + string
print row[z]
return
As I said, the above function is for a very small list, however the function proceeding it deals with a much much larger list.
The problem is that you don't appear to be doing enough work in each call to the worker function. All you seem to be doing is pasting together list of strings being passed as argument. However this is pretty much exactly what the multiprocessing module needs to do in the parent process to pass the list of strings to the worker process. It pickles them, writes them to a pipe, which the child process then reads, unpickles and then passes as argument to myFunction.
Since in order to pass the argument to the worker process the parent process has to do at least as much work as the worker process needs to do, you gain no benefit from using the multiprocessing module in this case.

Categories