Multiprocessing using maximum CPU power in Python-3.x

Multiprocessing using maximum CPU power in Python-3.x - python

I'm working on human genome which consists of 3.2 billions of characters and i have a list of objects which need to be searched within this data. Something like this:
result_final=[]
objects=['obj1','obj2','obj3',...]
def function(obj):
result_1=search_in_genome(obj)
return(result_1)
for item in objects:
result_2=function(item)
result_final.append(result_2)
Each object's search within the data takes nearly 30 seconds and i have few thousands of objects. I noticed that while doing this serially just 7% of CPU and 5% of RAM is being used. As i searched, for reducing the computation time i should do parallel computation using queuing , threading or multiprocessing. but they seem complicated for non-experts. could anybody help me how i can code for python to run 10 simultaneous searches and is it possible to make python to use maximum available CPU and RAM for multiprocessing? (I'm using Python33 on windows 7 with 64Gb RAM,COREI7 and 3.5 GH CPU)

You can use the multiprocessing module for this:
from multiprocessing import Pool
objects=['obj1','obj2','obj3',...]
def function(obj):
result_1=search_in_genome(obj)
return(result)
if __name__ == "__main__":
pool = Pool()
result_final = pool.map(function, objects)
This will allow you to scale the work across all available CPUs on your machine, because processes aren't affected by the GIL. You wouldn't want to run too many more tasks than there are CPUs available. Once you do that, you actually start slowing things down, because then the CPUs have to constantly switch between processes, which has a performance penalty.

Ok I'm not sure of your question, but I would do this (Note that there may be a better solution because I'm not an expert with the Queue Object) :
If you want to multithread your searches :
class myThread (threading.Thread):
def __init__(self, obj):
threading.Thread.__init__(self)
self.result = None
self.obj = obj
#Function who is called when you start your Thread
def run(self)
#Execute your function here
self.result = search_in_genome(self.obj)
if __name__ == '__main__':
result_final=[]
objects=['obj1','obj2','obj3',...]
#List of Thread
listThread = []
#Count number of potential thread
allThread = objects.len()
allThreadDone = 0
for item in objects:
#Create one thread
thread = myThread(item)
#Launch that Thread
thread.start()
#Stock it into the list
listThread.append(thread)
while True:
for thread in listThread:
#Count number of Thread who are finished
if thread.result != None:
#If a Thread is finished, count it
allThreadDone += 1
#If all thread are finished, then stop program
if allThreadDone == allThread:
break
#Else initialyse flag to count again
else:
allThreadDone = 0
If someone can check and validate this code that would be better. (Sorry for my english btw)

Related

Fastest way to call a function millions of times in Python

I have a function readFiles that I need to call 8.5 million times (essentially stress-testing a logger to ensure the log rotates correctly). I don't care about the output/result of the function, only that I run it N times as quickly as possible.
My current solution is this:
from threading import Thread
import subprocess
def readFile(filename):
args = ["/usr/bin/ls", filename]
subprocess.run(args)
def main():
filename = "test.log"
threads = set()
for i in range(8500000):
thread = Thread(target=readFile, args=(filename,)
thread.start()
threads.add(thread)
# Wait for all the reads to finish
while len(threads):
# Avoid changing size of set while iterating
for thread in threads.copy():
if not thread.is_alive():
threads.remove(thread)
readFile has been simplified, but the concept is the same. I need to run readFile 8.5 million times, and I need to wait for all the reads to finish. Based on my mental math, this spawns ~60 threads per second, which means it will take ~40 hours to finish. Ideally, this would finish within 1-8 hours.
Is this possible? Is the number of iterations simply too high for this to be done in a reasonable span of time?
Oddly enough, when I wrote a test script, I was able to generate a thread about every ~0.0005 seconds, which should equate to ~2000 threads per second, but this is not the case here.
I considered iteration 8500000 / 10 times, and spawning a thread which then runs the readFile function 10 times, which should decrease the amount of time by ~90%, but it caused some issues with blocking resources, and I think passing a lock around would be a bit complicated insofar as keeping the function usable by methods that don't incorporate threading.
Any tips?

Based on #blarg's comment, and scripts I've used using multiprocessing, the following can be considered.
It simply reads the same file based on the size of the list. Here I'm looking at 1M reads.
With 1 core it takes around 50 seconds. With 8 cores it's down to around 22 seconds. this is on a windows PC, but I use these scripts on linux EC2 (AWS) instances as well.
just put this in a python file and run:
import os
import time
from multiprocessing import Pool
from itertools import repeat
def readfile(fn):
f = open(fn, "r")
def _multiprocess(mylist, num_proc):
with Pool(num_proc) as pool:
r = pool.starmap(readfile, zip(mylist))
pool.close()
pool.join()
return r
if __name__ == "__main__":
__spec__=None
# use the system cpus or change explicitly
num_proc = os.cpu_count()
num_proc = 1
start = time.time()
mylist = ["test.txt"]*1000000 # here you'll want to 8.5M, but test first that it works with smaller number. note this module is slow with low number of reads, meaning 8 cores is slower than 1 core until you reach a certain point, then multiprocessing is worth it
rs = _multiprocess(mylist, num_proc=num_proc)
print('total seconds,', time.time()-start )

I think you should considering using subprocess here, if you just want to execute ls command I think it's better to use os.system since it will reduce the resource consumption of your current GIL
also you have to put a little delay with time.sleep() while waiting the thread to be finished to reduce resource consumption
from threading import Thread
import os
import time
def readFile(filename):
os.system("/usr/bin/ls "+filename)
def main():
filename = "test.log"
threads = set()
for i in range(8500000):
thread = Thread(target=readFile, args=(filename,)
thread.start()
threads.add(thread)
# Wait for all the reads to finish
while len(threads):
time.sleep(0.1) # put this delay to reduce resource consumption while waiting
# Avoid changing size of set while iterating
for thread in threads.copy():
if not thread.is_alive():
threads.remove(thread)

why python multithreading runs like a single thread on macos?

I have a similiar and simple computation task with three different parameters. So I take this chance to test how much time I can save by using multithreading.
Here is my code:
import threading
import time
from Crypto.Hash import MD2
def calc_func(text):
t1 = time.time()
h = MD2.new()
total = 10000000
old_text =text
for n in range(total):
h.update(text)
text = h.hexdigest()
print(f"thread done: old_text={old_text} new_text={text}, time={time.time()-t1}sec")
def do_3threads():
t0 = time.time()
texts = ["abcd", "abcde", "abcdef"]
ths = []
for text in texts:
th = threading.Thread(target=calc_func, args=(text,))
th.start()
ths.append(th)
for th in ths:
th.join()
print(f"main done: {time.time()-t0}sec")
def do_single():
texts = ["abcd", "abcde", "abcdef"]
for text in texts:
calc_func(text)
if __name__ == "__main__":
print("=== 3 threads ===")
do_3threads()
print("=== 1 thread ===")
do_single()
The result is astonishing, each thread is taking roughly 4x time it takes if single threaded:
=== 3 threads ===
thread done: old_text=abcdef new_text=e8f636b1893f12abe956dc019294e923, time=25.460321187973022sec
thread done: old_text=abcd new_text=0d6cae713809c923475ea50dbfbb2c13, time=25.47859835624695sec
thread done: old_text=abcde new_text=cd028131bc5e161671a1c91c62e80f6a, time=25.4807870388031sec
main done: 25.481309175491333sec
=== 1 thread ===
thread done: old_text=abcd new_text=0d6cae713809c923475ea50dbfbb2c13, time=6.393985033035278sec
thread done: old_text=abcde new_text=cd028131bc5e161671a1c91c62e80f6a, time=6.5472939014434814sec
thread done: old_text=abcdef new_text=e8f636b1893f12abe956dc019294e923, time=6.483690977096558sec
This is totally not what I expected. This task is obviously a CPU intensive task, so I expect that, with multithreading, each thread could just take around 6.5 seconds and the whole process takes slightly over that, instead it took actually ~25.5 seconds, even worse than single threaded mode, which is ~20seconds.
The environment is python 3.7.7, macos 10.15.5, CPU is 8-core Intel i9, 16G memory.
Can someone explain that to me? Any input is appreciated.

This task is obviously a CPU intensive task
Multithreading is not the proper tool for CPU bound tasks, but rather for something like network requests. This is because each Python process is limited to a single core due to the Global Interpreter Lock (GIL). All threads spawned by a process will run on the same core as the parent process.
Multiprocessing is what you are looking for, as it allows you to spawn multiple processes on, potentially, multiple cores.

Turning multithreading code with unlimited threads into multithreading code with max number of simultaneously running threads

I have a script that executes a certain function by multi-threading. Now, it is of interest to have only as much threads running parallel as having CPU-cores.
Now the current code (1:) using the threading.thread statement creates 1000 threads and runs them all simultaneously.
I want to turn this into something that runs only a fixed number of threads at the same time (e.g., 8) and puts the rest into a queue till a executing thread/cpu core is free for usage.
1:
import threading
nSim = 1000
def simulation(i):
print(str(threading.current_thread().getName()) + ': '+ str(i))
if __name__ == '__main__':
threads = [threading.Thread(target=simulation,args=(i,)) for i in range(nSim)]
for t in threads:
t.start()
for t in threads:
t.join()
Q1: Is code 2: doing what I described? (multithreading with a max number of threads running simultaneously) Is it correct? (I think so but I'm not 100% sure)
Q2: Now the code initiates 1000 threads at the same time and executes them on 8 threads. Is there a way to only initiate a new thread when a executing thread/cpu core is free for usage (in order that I don't have 990 threadcalls waiting from the beginning to be executed when possible?
Q3: Is there a way to track which cpu-core executed which thread? Just to proof that the code is doing what it should do.
2:
import threading
import multiprocessing
print(multiprocessing.cpu_count())
from concurrent.futures import ThreadPoolExecutor
nSim = 1000
def simulation(i):
print(str(threading.current_thread().getName()) + ': '+ str(i))
if __name__ == '__main__':
with ThreadPoolExecutor(max_workers=8) as executor:
for i in range (nSim):
res = executor.submit(simulation, i)
print(res.result())

A1: In order to limit number of threads which can simultaneously have access to some resource, you can use threading.Semaphore Actually 1000 threads will not give you tremendous speed boost, recomended number of threads per process is mp.cpu_count()*1 or mp.cpu_count()*2 in some articles. Also note that Threads are good for IO operations in python, but not for computing due to GIL.
A2. Why do you need so many threads if you want to run only 8 of them simultaneously? Create just 8 threads and then supply them with Tasks when the Tasks are ready, to do so you need to use queue.Queue() which is thread safe. But in your concrete example you can do just the following to run your test 250 times per thread using while inside simulation function, by the way you do not need Semaphore in the case.
A3. When we are talking about multithreading, you have one process with multiple threads.
import threading
import time
import multiprocessing as mp
def simulation(i, _s):
# s is threading.Semaphore()
with _s:
print(str(threading.current_thread().getName()) + ': ' + str(i))
time.sleep(3)
if name == 'main':
print("Cores number: {}".format(mp.cpu_count()))
# recommended number of threading is mp.cpu_count()*1 or mp.cpu_count()*2 in some articles
nSim = 25
s = threading.Semaphore(4) # max number of threads which can work simultaneously with resource is 4
threads = [threading.Thread(target=simulation, args=(i, s, )) for i in range(nSim)]
for t in threads:
t.start()
# just to prove that all threads are active in the start and then their number decreases when the work is done
for i in range(6):
print("Active threads number {}".format(threading.active_count()))
time.sleep(3)

A1: No, your code submits a task, receives a Future in res and then calls result which waits for the result. Only after previous task was done a new task is given to a thread. Only one of the worker threads is really working at a time.
Take a look at ThreadPool.map (actually Pool.map) instead of submit to distribute tasks among the workers.
A2: Only 8 threads (the number of workers) are used here at most. If using map the input data of the 1000 tasks may be stored (needs memory) but no additional threads are created.
A3: Not that I know of. A thread is not bound to a core, it may switch between them fast.

What is the easiest way to make maximum cpu usage for nested for-loops?

I have code that makes unique combinations of elements. There are 6 types, and there are about 100 of each. So there are 100^6 combinations. Each combination has to be calculated, checked for relevance and then either be discarded or saved.
The relevant bit of the code looks like this:
def modconffactory():
for transmitter in totaltransmitterdict.values():
for reciever in totalrecieverdict.values():
for processor in totalprocessordict.values():
for holoarray in totalholoarraydict.values():
for databus in totaldatabusdict.values():
for multiplexer in totalmultiplexerdict.values():
newconfiguration = [transmitter, reciever, processor, holoarray, databus, multiplexer]
data_I_need = dosomethingwith(newconfiguration)
saveforlateruse_if_useful(data_I_need)
Now this takes a long time and that is fine, but now I realize this process (making the configurations and then calculations for later use) is only using 1 of my 8 processor cores at a time.
I've been reading up about multithreading and multiprocessing, but I only see examples of different processes, not how to multithread one process. In my code I call two functions: 'dosomethingwith()' and 'saveforlateruse_if_useful()'. I could make those into separate processes and have those run concurrently to the for-loops, right?
But what about the for-loops themselves? Can I speed up that one process? Because that is where the time consumption is. (<-- This is my main question)
Is there a cheat? for instance compiling to C and then the os multithreads automatically?

I only see examples of different processes, not how to multithread one process
There is multithreading in Python, but it is very ineffective because of GIL (Global Interpreter Lock). So if you want to use all of your processor cores, if you want concurrency, you have no other choice than use multiple processes, which can be done with multiprocessing module (well, you also could use another language without such problems)
Approximate example of multiprocessing usage for your case:
import multiprocessing
WORKERS_NUMBER = 8
def modconffactoryProcess(generator, step, offset, conn):
"""
Function to be invoked by every worker process.
generator: iterable object, the very top one of all you are iterating over,
in your case, totalrecieverdict.values()
We are passing a whole iterable object to every worker, they all will iterate
over it. To ensure they will not waste time by doing the same things
concurrently, we will assume this: each worker will process only each stepTH
item, starting with offsetTH one. step must be equal to the WORKERS_NUMBER,
and offset must be a unique number for each worker, varying from 0 to
WORKERS_NUMBER - 1
conn: a multiprocessing.Connection object, allowing the worker to communicate
with the main process
"""
for i, transmitter in enumerate(generator):
if i % step == offset:
for reciever in totalrecieverdict.values():
for processor in totalprocessordict.values():
for holoarray in totalholoarraydict.values():
for databus in totaldatabusdict.values():
for multiplexer in totalmultiplexerdict.values():
newconfiguration = [transmitter, reciever, processor, holoarray, databus, multiplexer]
data_I_need = dosomethingwith(newconfiguration)
saveforlateruse_if_useful(data_I_need)
conn.send('done')
def modconffactory():
"""
Function to launch all the worker processes and wait until they all complete
their tasks
"""
processes = []
generator = totaltransmitterdict.values()
for i in range(WORKERS_NUMBER):
conn, childConn = multiprocessing.Pipe()
process = multiprocessing.Process(target=modconffactoryProcess, args=(generator, WORKERS_NUMBER, i, childConn))
process.start()
processes.append((process, conn))
# Here we have created, started and saved to a list all the worker processes
working = True
finishedProcessesNumber = 0
try:
while working:
for process, conn in processes:
if conn.poll(): # Check if any messages have arrived from a worker
message = conn.recv()
if message == 'done':
finishedProcessesNumber += 1
if finishedProcessesNumber == WORKERS_NUMBER:
working = False
except KeyboardInterrupt:
print('Aborted')
You can adjust WORKERS_NUMBER to your needs.
Same with multiprocessing.Pool:
import multiprocessing
WORKERS_NUMBER = 8
def modconffactoryProcess(transmitter):
for reciever in totalrecieverdict.values():
for processor in totalprocessordict.values():
for holoarray in totalholoarraydict.values():
for databus in totaldatabusdict.values():
for multiplexer in totalmultiplexerdict.values():
newconfiguration = [transmitter, reciever, processor, holoarray, databus, multiplexer]
data_I_need = dosomethingwith(newconfiguration)
saveforlateruse_if_useful(data_I_need)
def modconffactory():
pool = multiprocessing.Pool(WORKERS_NUMBER)
pool.map(modconffactoryProcess, totaltransmitterdict.values())
You probably would like to use .map_async instead of .map
Both snippets do the same, but I would say in the first one you have more control over the program.
I suppose the second one is the easiest, though :)
But the first one should give you the idea of what is happening in the second one
multiprocessing docs: https://docs.python.org/3/library/multiprocessing.html

you can run your function in this way:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(5)
print(p.map(f, [1, 2, 3]))
https://docs.python.org/2/library/multiprocessing.html#using-a-pool-of-workers

Multithreading works slower

Good day!
I'm trying to learn multithreading features in python and I wrote the following code:
import time, argparse, threading, sys, subprocess, os
def item_fun(items, indices, lock):
for index in indices:
items[index] = items[index]*items[index]*items[index]
def map(items, cores):
count = len(items)
cpi = count/cores
threads = []
lock = threading.Lock()
for core in range(cores):
thread = threading.Thread(target=item_fun, args=(items, range(core*cpi, core*cpi + cpi), lock))
threads.append(thread)
thread.start()
item_fun(items, range((core+1)*cpi, count), lock)
for thread in threads:
thread.join()
parser = argparse.ArgumentParser(description='cube', usage='%(prog)s [options] -n')
parser.add_argument('-n', action='store', help='number', dest='n', default='1000000', metavar = '')
parser.add_argument('-mp', action='store_true', help='multi thread', dest='mp', default='True')
args = parser.parse_args()
items = range(NUMBER_OF_ITEMS)
# print 'items before:'
# print items
mp = args.mp
if mp is True:
NUMBER_OF_PROCESSORS = int(os.getenv("NUMBER_OF_PROCESSORS"))
NUMBER_OF_ITEMS = int(args.n)
start = time.time()
map(items, NUMBER_OF_PROCESSORS)
end = time.time()
else:
NUMBER_OF_ITEMS = int(args.n)
start = time.time()
item_fun(items, range(NUMBER_OF_ITEMS), None)
end = time.time()
#print 'items after:'
#print items
print 'time elapsed: ', (end - start)
When I use mp argument, it works slower, on my machine with 4 cpus, it takes about 0.5 secs to compute result, while if I use a single thread it takes about 0.3 secs.
Am I doing something wrong?
I know there's Pool.map() and e.t.c but it spawns subprocess not threads and it works faster as far as I know, but I'd like to write my own thread pool.

Python has no true multithreading, due to an implementation detail called the "GIL". Only one thread actually runs at a time, and Python switches between the threads. (Third party implementations of Python, such as Jython, can actually run parallel threads.)
As to why actually your program is slower in the multithreaded version depends, but when coding for Python, one needs to be aware of the GIL, so one does not believe that CPU bound loads are more efficiently processed by adding threads to the program.
Other things to be aware of are for instance multiprocessing and numpy for solving CPU bound loads, and PyEv (minimal) and Tornado (huge kitchen sink) for solving I/O bound loads.

You'll only see an increase in throughput with threads in Python if you have threads which are IO bound. If what you're doing is CPU bound then you won't see any throughput increase.
Turning on the thread support in Python (by starting another thread) also seems to make some things slower so you may find that overall performance still suffers.
This is all cpython of course, other Python implementations have different behaviour.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multiprocessing using maximum CPU power in Python-3.x - python

Related

Fastest way to call a function millions of times in Python

why python multithreading runs like a single thread on macos?

Turning multithreading code with unlimited threads into multithreading code with max number of simultaneously running threads

What is the easiest way to make maximum cpu usage for nested for-loops?

Multithreading works slower

Categories

Resources