Processing multiple data files simultaneously using multiple cores

Processing multiple data files simultaneously using multiple cores - python

I have multiple data files that I process using python Pandas libraries. Each file is processed one by one, and only one logical processor is used when I look at Task manager (it is at ~95%, and the rest are within 5%)
Is there a way to process data files simultaneously?
If so, is there a way to utilize the other logic processors to do that?
(Edits are welcome)

If your file names are in a list, you could use this code:
from multiprocessing import Process
def YourCode(filename, otherdata):
# Do your stuff
if __name__ == '__main__':
#Post process files in parallel
ListOfFilenames = ['file1','file2', ..., 'file1000']
ListOfProcesses = []
Processors = 20 # n of processors you want to use
#Divide the list of files in 'n of processors' Parts
Parts = [ListOfFilenames[i:i + Processors] for i in xrange(0, len(ListOfFilenames), Processors)]
for part in Parts:
for f in part:
p = multiprocessing.Process(target=YourCode, args=(f, otherdata))
p.start()
ListOfProcesses.append(p)
for p in ListOfProcesses:
p.join()

You can process the different files in different threads or in different processes.
The good thing of python is that its framework provides tools for you to do this:
from multiprocessing import Process
def process_panda(filename):
# this function will be started in a different process
process_panda_import()
write_results()
if __name__ == '__main__':
p1 = Process(target=process_panda, args=('file1',))
# start process 1
p1.start()
p2 = Process(target=process_panda, args=('file2',))
# starts process 2
p2.start()
# waits if process 2 is finished
p2.join()
# waits if process 1 is finished
p1.join()
The program will start 2 child-processes, which can be used do process your files.
Of cource you can do something similar with threads.
You can find the documentation here:
https://docs.python.org/2/library/multiprocessing.html
and here:
https://pymotw.com/2/threading/

Related

When running two functions simultaneously how to return the first result and use it for further processes

So I have two webscrapers that collect data from two different sources. I am running them both simultaneously to collect a specific piece of data (e.g. covid numbers).
When one of the functions finds data I want to use that data without waiting for the other one to finish.
So far I tried the multiprocessing - pool module and to return the results with get() but by definition I have to wait for both get() to finish before I can continue with my code. My goal is to have the code as simple and as short as possible.
My webscraper functions can be run with arguments and return a result if found. It is also possible to modify them.
The code I have so far which waits for both get() to finish.
from multiprocessing import Pool
from scraper1 import main_1
from scraper2 import main_2
from twitter import post_tweet
if __name__ == '__main__':
with Pool(processes=2) as pool:
r1 = pool.apply_async(main_1, ('www.website1.com','June'))
r2 = pool.apply_async(main_2, ())
data = r1.get()
data2 = r2.get()
post_tweet("New data is {}".format(data))
post_tweet("New data is {}".format(data2))
From here I have seen that threading might be a better option since webscraping involves a lot of waiting and only little parsing but I am not sure how I would implement this.
I think the solution is fairly easy but I have been searching and trying different things all day without much success so I think I will just ask here. (I only started programming 2 months ago)

As always there are many ways to accomplish this task.
you have already mentioned using a Queue:
from multiprocessing import Process, Queue
from scraper1 import main_1
from scraper2 import main_2
def simple_worker(target, args, ret_q):
ret_q.put(target(*args)) # mp.Queue has it's own mutex so we don't need to worry about concurrent read/write
if __name__ == "__main__":
q = Queue()
p1 = Process(target=simple_worker, args=(main_1, ('www.website1.com','June'), q))
p2 = Process(target=simple_worker, args=(main_2, ('www.website2.com','July'), q))
p1.start()
p2.start()
first_result = q.get()
do_stuff(first_result)
#don't forget to get() the second result before you quit. It's not a good idea to
#leave things in a Queue and just assume it will be properly cleaned up at exit.
second_result = q.get()
p1.join()
p2.join()
You could also still use a Pool by using imap_unordered and just taking the first result:
from multiprocessing import Pool
from scraper1 import main_1
from scraper2 import main_2
def simple_worker2(args):
target, arglist = args #unpack args
return target(*arglist)
if __name__ == "__main__":
tasks = ((main_1, ('www.website1.com','June')),
(main_2, ('www.website2.com','July')))
with Pool() as p: #Pool context manager handles worker cleanup (your target function may however be interrupted at any point if the pool exits before a task is complete
for result in p.imap_unordered(simple_worker2, tasks, chunksize=1):
do_stuff(result)
break #don't bother with further results

I've seen people use queues in such cases: create one and pass it to both parsers so that they put their results in queue instead of returning them. Then do a blocking pop on the queue to retrieve the first available result.
I have seen that threading might be a better option
Almost true but not quite. I'd say that asyncio and async-based libraries is much better than both threading and multiprocessing when we're talking about code with a lot of blocking I/O. If it's applicable in your case, I'd recommend rewriting both your parsers in async.

How to post-process multiple datasets while reading new data and updating a graph

I have the following situation:
Datasets are generated by an external device, at varying intervals (between 0.1s and 90s). The code sleeps between acquisitions.
Each dataset needs to be post-processed (which is CPU-bound, single-threaded and requires 10s to 20s). Post-processing should not block (1).
Acquisition and post-processing should work asynchronously and whenever one dataset is done, I want to update a pyplot graph in a Jupyter notebook (currently using ipython widgets), with the data from the post-processing. The plotting should also not block (1).
Doing (1) and (2) serially is easy to do: I acquire all datasets, storing it in a list, then process each item, then display.
I don't know how to set this up in a parallel way and how to start. Do I use callback functions? Do callbacks work across processes? How do I set up the correct amount of processes (acquisition in one, processing and plotting the rest for each core). Can all processes modify the same list of all datasets? Is there a better data structure to use? Can it be done in Python?

This is a general outline of the classes you need and how you put them together along the idea of (more or less) what I described in my comment. There are other approaches, but I think this is the easiest to understand. There are also more "industrial strength" products that implement message queueing but with even steeper learning curves.
from multiprocessing import Process, Queue, cpu_count
def acquirer_process(post_process_queue):
while True:
# get next file and put in on the post processing queue
info_about_file_just_acquired = acquire_next_file()
post_process_queue.put(info_about_file_just_acquired)
def post_process_process(post_process_queue, plotting_queue):
while True:
info_about_file_just_acquired = post_process_queue.get()
# post process this file:
info_about_post_processed_file = post_process(info_about_file_just_acquired)
plotting_queue.put(info_about_post_processed_file)
def plotting_process(plotting_queue):
while True:
# Get plotting info for next post-processed file:
info_about_post_processed_file = plotting_queue.get()
# Plot it:
plot(info_about_post_processed_file)
def main():
"""
The main program.
"""
n_processors = cpu_count()
# We need one acquirer process
# We need one plotting process since the assumption is
# that only a single process (thread) can be plotting at a time
# That leaves n_processors - 2 free to work in parallel post processing acquired files:
post_process_queue = Queue()
plotting_queue = Queue()
processes = []
# All these processes that follow are "daemon" processes and will automatically
# terminate when the main process terminates:
processes.append(Process(target=acquirer_process, args=(post_process_queue,), daemon=True))
processes.append(Process(target=plotting_process, args=(plotting_queue,), daemon=True))
for _ in range(n_processors - 2):
processes.append(Process(target=post_process_process, args=(post_process_queue, plotting_queue), daemon=True))
# Start the processes:
for process in processes:
process.start()
# Pause the main process:
input('Hit enter to terminate:')
# Required for Windows:
if __name__ == '__main__':
main()

Python Multiprocessing for a Single Output File (CSV)

I am looking for some good example code of MultiProcessing in Python that would take in a large array (broken into different sections of the same main array) to speed up the processing of the subsequent Output file. I am noticing that there are other things like Lock() functions to make sure it comes back in a certain order but not a good example of how to get the resulting arrays back out when the jobs are run so I can output a single CSV file in the correct time series order.
Below is what I have been working with so far with the queue. How can one assign the results of q1.get() or others to recombine later? It just spins when I try assigning it with temp = q1.get()... And good examples of splitting out an array, sending it to multiple processes, then recombining the results of the function(s) called would be appreciated. I am using Python 3.7 and Windows 10.
import time
import multiprocessing
from multiprocessing import Process, Queue
def f1(q, testArray):
testArray2 = [[41, None, 'help'], [42, None, 'help'], [43, None, 'help']]
testArray = testArray + testArray2
q.put(testArray)
def f2(q, testArray):
#testArray.append([43, None, 'goodbye'])
testArray = testArray + ([44, None, 'goodbye'])
q.put(testArray)
return testArray
if __name__ == '__main__':
print("Number of cpu : ", multiprocessing.cpu_count())
testArray1 = [1]
testArray2 = [2]
q1 = Queue()
q2 = Queue()
p1 = multiprocessing.Process(target=f1, args=(q1, testArray1,))
p2 = multiprocessing.Process(target=f2, args=(q2, testArray2,))
p1.start()
p2.start()
print(q1.get()) # prints whatever you set in function above
print(q2.get()) # prints whatever you set in function above
print(testArray1)
print(testArray2)
p1.join()
p2.join()

I believe you only need one queue for all of your processes. The queue is designed for inter-process communication.
For the ordering you can pass in a process id and sort based on that after the results are joined. Or you can try and use multiprocessing pool as furas suggests.
Which sounds like a better approach. Worker pools in general allocate a pool of workers up front then run a set of jobs on the pool. This is more efficient because the processes / threads are set up initially and reused for jobs. Where your implementation is going the process is created per job / function which is costly depending on how much data you're crunching.

How to correctly use queues in python?

I am a beginner when it comes to python threading and multiprocessing so please bear with me.
I want to make a system that consists of three python scripts. The first one creates some data and sends this data to the second script continuously. The second script takes the data and saves on some file until the file exceeds defined memory limit. When that happens, the third script sends the data to an external device and gets rid of this "cache". I need all of this to happen concurrently. The pseudo code sums up what I am trying to do.
def main_1():
data = [1,2,3]
send_to_second_script(data)
def main_2():
rec_data = receive_from_first_script()
save_to_file(rec_data)
if file>limit:
signal_third_script()
def main_3():
if signal is true:
send_data_to_external_device()
remove_data_from_disk()
I understand that I can use queues to make this happen but I am not sure how.
Also , so far to do this, I tried a different approach where I created one python script and used threading to spawn threads for each part of the process. Is this correct or using queues is better?

Firstly, for Python you need to be really aware what the benefits of multithreading/multiprocessing gives you. IMO you should be considering multiprocessing instead of multithreading. Threading in Python is not actually concurrent due to GIL and there are many explanations out on which one to use. Easiest way to choose is to see if your program is IO-bound or CPU-bound. Anyways on to the Queue which is a simple way to work with multiple processes in python.
Using your pseudocode as an example, here is how you would use a Queue.
import multiprocessing
def main_1(output_queue):
test = 0
while test <=10: # simple limit to not run forever
data = [1,2,3]
print("Process 1: Sending data")
output_queue.put(data) #Puts data in queue FIFO
test+=1
output_queue.put("EXIT") # triggers the exit clause
def main_2(input_queue,output_queue):
file = 0 # Dummy psuedo variables
limit = 1
while True:
rec_data = input_queue.get() # Get the latest data from queue. Blocking if empty
if rec_data == "EXIT": # Exit clause is a way to cleanly shut down your processes
output_queue.put("EXIT")
print("Process 2: exiting")
break
print("Process 2: saving to file:", rec_data, "count = ", file)
file += 1
#save_to_file(rec_data)
if file>limit:
file = 0
output_queue.put(True)
def main_3(input_queue):
while(True):
signal = input_queue.get()
if signal is True:
print("Process 3: Data sent and removed")
#send_data_to_external_device()
#remove_data_from_disk()
elif signal == "EXIT":
print("Process 3: Exiting")
break
if __name__== '__main__':
q1 = multiprocessing.Queue() # Intializing the queues and the processes
q2 = multiprocessing.Queue()
p1 = multiprocessing.Process(target = main_1,args = (q1,))
p2 = multiprocessing.Process(target = main_2,args = (q1,q2,))
p3 = multiprocessing.Process(target = main_3,args = (q2,))
p = [p1,p2,p3]
for i in p: # Start all processes
i.start()
for i in p: # Ensure all processes are finished
i.join()
The prints may be a little off because I did not bother to lock the std_out. But using a queue ensures that stuff moves from one process to another.
EDIT: DO be aware that you should also have a look at multiprocessing locks to ensure that your file is 'thread-safe' when performing the move/delete. The pseudo code above only demonstrates how to use queue

Python multiprocessing module not calling function

I have a program that needs to create several graphs, with each one often taking hours. Therefore I want to run these simultaneously on different cores, but cannot seem to get these processes to run with the multiprocessing module. Here is my code:
if __name__ == '__main__':
jobs = []
for i in range(5):
p = multiprocessing.Process(target=full_graph)
jobs.append(p)
p.start()
p.join()
(full_graph() has been defined earlier in the program, and is simply a function that runs a collection of other functions)
The function normally outputs some graphs, and saves the data to a .txt file. All data is saved to the same 2 text files. However, calling the functions using the above code gives no console output, nor any output to the text file. All that happens is a few second long pause, and then the program exits.
I am using the Spyder IDE with WinPython 3.6.3

Without a simple full_graph sample nobody can tell you what's happening. But your code is inherently wrong.
if __name__ == '__main__':
jobs = []
for i in range(5):
p = multiprocessing.Process(target=full_graph)
jobs.append(p)
p.start()
p.join() # <- This would block until p is done
See the comment after p.join(). If your processes really take hours to complete, you would run one process for hours and then the 2nd, the 3rd. Serially and using a single core.
From the docs: https://docs.python.org/3/library/multiprocessing.html
Process.join: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process.join
If the optional argument timeout is None (the default), the method blocks until the process whose join() method is called terminates. If timeout is a positive number, it blocks at most timeout seconds. Note that the method returns None if its process terminates or if the method times out. Check the process’s exitcode to determine if it terminated.
If each process does something different, you should then also have some args for full_graph(hint: may that be the missing factor?)
You probably want to use an interface like map from Pool
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool
And do (from the docs again)
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
with Pool(5) as p:
print(p.map(f, [1, 2, 3]))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Processing multiple data files simultaneously using multiple cores - python

Related

When running two functions simultaneously how to return the first result and use it for further processes

How to post-process multiple datasets while reading new data and updating a graph

Python Multiprocessing for a Single Output File (CSV)

How to correctly use queues in python?

Python multiprocessing module not calling function

Categories

Resources