Fastest way to call a function millions of times in Python - python

I have a function readFiles that I need to call 8.5 million times (essentially stress-testing a logger to ensure the log rotates correctly). I don't care about the output/result of the function, only that I run it N times as quickly as possible.
My current solution is this:
from threading import Thread
import subprocess
def readFile(filename):
args = ["/usr/bin/ls", filename]
subprocess.run(args)
def main():
filename = "test.log"
threads = set()
for i in range(8500000):
thread = Thread(target=readFile, args=(filename,)
thread.start()
threads.add(thread)
# Wait for all the reads to finish
while len(threads):
# Avoid changing size of set while iterating
for thread in threads.copy():
if not thread.is_alive():
threads.remove(thread)
readFile has been simplified, but the concept is the same. I need to run readFile 8.5 million times, and I need to wait for all the reads to finish. Based on my mental math, this spawns ~60 threads per second, which means it will take ~40 hours to finish. Ideally, this would finish within 1-8 hours.
Is this possible? Is the number of iterations simply too high for this to be done in a reasonable span of time?
Oddly enough, when I wrote a test script, I was able to generate a thread about every ~0.0005 seconds, which should equate to ~2000 threads per second, but this is not the case here.
I considered iteration 8500000 / 10 times, and spawning a thread which then runs the readFile function 10 times, which should decrease the amount of time by ~90%, but it caused some issues with blocking resources, and I think passing a lock around would be a bit complicated insofar as keeping the function usable by methods that don't incorporate threading.
Any tips?

Based on #blarg's comment, and scripts I've used using multiprocessing, the following can be considered.
It simply reads the same file based on the size of the list. Here I'm looking at 1M reads.
With 1 core it takes around 50 seconds. With 8 cores it's down to around 22 seconds. this is on a windows PC, but I use these scripts on linux EC2 (AWS) instances as well.
just put this in a python file and run:
import os
import time
from multiprocessing import Pool
from itertools import repeat
def readfile(fn):
f = open(fn, "r")
def _multiprocess(mylist, num_proc):
with Pool(num_proc) as pool:
r = pool.starmap(readfile, zip(mylist))
pool.close()
pool.join()
return r
if __name__ == "__main__":
__spec__=None
# use the system cpus or change explicitly
num_proc = os.cpu_count()
num_proc = 1
start = time.time()
mylist = ["test.txt"]*1000000 # here you'll want to 8.5M, but test first that it works with smaller number. note this module is slow with low number of reads, meaning 8 cores is slower than 1 core until you reach a certain point, then multiprocessing is worth it
rs = _multiprocess(mylist, num_proc=num_proc)
print('total seconds,', time.time()-start )

I think you should considering using subprocess here, if you just want to execute ls command I think it's better to use os.system since it will reduce the resource consumption of your current GIL
also you have to put a little delay with time.sleep() while waiting the thread to be finished to reduce resource consumption
from threading import Thread
import os
import time
def readFile(filename):
os.system("/usr/bin/ls "+filename)
def main():
filename = "test.log"
threads = set()
for i in range(8500000):
thread = Thread(target=readFile, args=(filename,)
thread.start()
threads.add(thread)
# Wait for all the reads to finish
while len(threads):
time.sleep(0.1) # put this delay to reduce resource consumption while waiting
# Avoid changing size of set while iterating
for thread in threads.copy():
if not thread.is_alive():
threads.remove(thread)

Related

Multithreading inside Multiprocessing in Python

I am using concurrent.futures module to do multiprocessing and multithreading. I am running it on a 8 core machine with 16GB RAM, intel i7 8th Gen processor. I tried this on Python 3.7.2 and even on Python 3.8.2
import concurrent.futures
import time
takes list and multiply each elem by 2
def double_value(x):
y = []
for elem in x:
y.append(2 *elem)
return y
multiply an elem by 2
def double_single_value(x):
return 2* x
define a
import numpy as np
a = np.arange(100000000).reshape(100, 1000000)
function to run multiple thread and multiple each elem by 2
def get_double_value(x):
with concurrent.futures.ThreadPoolExecutor() as executor:
results = executor.map(double_single_value, x)
return list(results)
code shown below ran in 115 seconds. This is using only multiprocessing. CPU utilization for this piece of code is 100%
t = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
my_results = executor.map(double_value, a)
print(time.time()-t)
Below function took more than 9 min and consumed all the Ram of system and then system kill all the process. Also CPU utilization during this piece of code is not upto 100% (~85%)
t = time.time()
with concurrent.futures.ProcessPoolExecutor() as executor:
my_results = executor.map(get_double_value, a)
print(time.time()-t)
I really want to understand:
1) why the code that first split do multiple processing and then run tried multi-threading is not running faster than the code that runs only multiprocessing ?
(I have gone through many post that describe multiprocessing and multi-threading and one of the crux that I got is multi-threading is for I/O process and multiprocessing for CPU processes ? )
2) Is there any better way of doing multi-threading inside multiprocessing for max utilization of allotted core(or CPU) ?
3) Why that last piece of code consumed all the RAM ? Was it due to multi-threading ?
You can mix concurrency with parallelism.
Why? You can have your valid reasons. Imagine a bunch of requests you have to make while processing their responses (e.g., converting XML to JSON) as fast as possible.
I did some tests and here are the results.
In each test, I mix different workarounds to make a print 16000 times (I have 8 cores and 16 threads).
Parallelism with multiprocessing, concurrency with asyncio
The fastest, 1.1152372360229492 sec.
import asyncio
import multiprocessing
import os
import psutil
import threading
import time
async def print_info(value):
await asyncio.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
async def await_async_logic(values):
await asyncio.gather(
*(
print_info(value)
for value in values
)
)
def run_async_logic(values):
asyncio.run(await_async_logic(values))
def multiprocessing_executor():
start = time.time()
with multiprocessing.Pool() as multiprocessing_pool:
multiprocessing_pool.map(
run_async_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with asyncio I can spam tasks as much as I want. For example, I can change the value from 1000 to 10000 to generate 160000 prints and there is no problem (I tested it and it took me 2.0210490226745605 sec).
Parallelism with multiprocessing, concurrency with threading
An alternative option, 1.6983509063720703 sec.
import multiprocessing
import os
import psutil
import threading
import time
def print_info(value):
time.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
def multithreading_logic(values):
threads = []
for value in values:
threads.append(threading.Thread(target=print_info, args=(value,)))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
def multiprocessing_executor():
start = time.time()
with multiprocessing.Pool() as multiprocessing_pool:
multiprocessing_pool.map(
multithreading_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with this method I can NOT spam as many tasks as I want. If I change the value from 1000 to 10000 I get RuntimeError: can't start new thread.
I also want to say that I am impressed because I thought that this method would be better in every aspect compared to asyncio, but quite the opposite.
Parallelism and concurrency with concurrent.futures
Extremely slow, 50.08251595497131 sec.
import os
import psutil
import threading
import time
from concurrent.futures import thread, process
def print_info(value):
time.sleep(1)
print(
f"THREAD: {threading.get_ident()}",
f"PROCESS: {os.getpid()}",
f"CORE_ID: {psutil.Process().cpu_num()}",
f"VALUE: {value}",
)
def multithreading_logic(values):
with thread.ThreadPoolExecutor() as multithreading_executor:
multithreading_executor.map(
print_info,
values,
)
def multiprocessing_executor():
start = time.time()
with process.ProcessPoolExecutor() as multiprocessing_executor:
multiprocessing_executor.map(
multithreading_logic,
(range(1000 * x, 1000 * (x + 1)) for x in range(os.cpu_count())),
)
end = time.time()
print(end - start)
multiprocessing_executor()
Very important note: with this method, as with asyncio, I can spam as many tasks as I want. For example, I can change the value from 1000 to 10000 to generate 160000 prints and there is no problem (except for the time).
Extra notes
To make this comment, I modified the test so that it only makes 1600 prints (modifying the 1000 value with 100 in each test).
When I remove the parallelism from asyncio, the execution takes me 16.090194702148438 sec.
In addition, if I replace the await asyncio.sleep(1) with time.sleep(1), it takes 160.1889989376068 sec.
Removing the parallelism from the multithreading option, the execution takes me 16.24941658973694 sec.
Right now I am impressed. Multithreading without multiprocessing gives me good performance, very similar to asyncio.
Removing parallelism from the third option, execution takes me 80.15227723121643 sec.
As you say: "I have gone through many post that describe multiprocessing and multi-threading and one of the crux that I got is multi-threading is for I/O process and multiprocessing for CPU processes".
You need to figure out, if your program is IO-bound or CPU-bound, then apply the correct method to solve your problem. Applying various methods at random or all together at the same time usually makes things only worse.
Use of threading in clean Python for CPU-bound problems is a bad approach regardless of using multiprocessing or not. Try to redesign your app to use only multiprocessing or use third-party libs such as Dask and so on
I believe you figured it out, but I wanted to answer. Obviously, your function double_single_value is CPU bound. It has nothing to do with Io. In CPU bound tasks using multi-thread will make it worse than using a single thread, because GIL does not allow you actually run on multi-thread and you will eventually run on single thread. Also, you may not finish a task and go to another and when you get back you should load it to the CPU again, which will make this even slower.
Based off your code, I see most of your code is dealing with computations(calculations) so it's most encouraged to use multiprocessing to solve your problem since it's CPU-bound and NOT I/O bound(things like sending requests to websites and then waiting for some response from the server in exchange, writing to disk or even reading from disk). This is true for Python programming as far as I know. The python GIL(Global Interpreter Lock) will make your code run slowly as it is a mutex (or a lock) that allows only one thread to take the control of the Python interpreter meaning it won't achieve parallelism but will give you concurrency instead. But it's very fine to use threading for I/O bound tasks because they'll outcompete multiprocessing in execution times but for your case i would encourage you to use multiprocessing because each Python process will get its own Python interpreter and memory space so the GIL won’t be a problem to you.
I am not so sure about integrating multithreading with multiprocessing but what i know it can cause inconsistency in the processed results since you will need more bolierplate code for data synchronization if you want the processes to communicate(IPC) and also threads are kinda unpredictable(thus inconsistent at times) since they're controlled by the OS so anytime they can be scooped out(pre-emptive scheduling) for kernel level threads(due to time sharing). i don't stop you from writing that code but be really sure of what you are doing. You never know you would propose a solution to it one day.

Turning multithreading code with unlimited threads into multithreading code with max number of simultaneously running threads

I have a script that executes a certain function by multi-threading. Now, it is of interest to have only as much threads running parallel as having CPU-cores.
Now the current code (1:) using the threading.thread statement creates 1000 threads and runs them all simultaneously.
I want to turn this into something that runs only a fixed number of threads at the same time (e.g., 8) and puts the rest into a queue till a executing thread/cpu core is free for usage.
1:
import threading
nSim = 1000
def simulation(i):
print(str(threading.current_thread().getName()) + ': '+ str(i))
if __name__ == '__main__':
threads = [threading.Thread(target=simulation,args=(i,)) for i in range(nSim)]
for t in threads:
t.start()
for t in threads:
t.join()
Q1: Is code 2: doing what I described? (multithreading with a max number of threads running simultaneously) Is it correct? (I think so but I'm not 100% sure)
Q2: Now the code initiates 1000 threads at the same time and executes them on 8 threads. Is there a way to only initiate a new thread when a executing thread/cpu core is free for usage (in order that I don't have 990 threadcalls waiting from the beginning to be executed when possible?
Q3: Is there a way to track which cpu-core executed which thread? Just to proof that the code is doing what it should do.
2:
import threading
import multiprocessing
print(multiprocessing.cpu_count())
from concurrent.futures import ThreadPoolExecutor
nSim = 1000
def simulation(i):
print(str(threading.current_thread().getName()) + ': '+ str(i))
if __name__ == '__main__':
with ThreadPoolExecutor(max_workers=8) as executor:
for i in range (nSim):
res = executor.submit(simulation, i)
print(res.result())
A1: In order to limit number of threads which can simultaneously have access to some resource, you can use threading.Semaphore Actually 1000 threads will not give you tremendous speed boost, recomended number of threads per process is mp.cpu_count()*1 or mp.cpu_count()*2 in some articles. Also note that Threads are good for IO operations in python, but not for computing due to GIL.
A2. Why do you need so many threads if you want to run only 8 of them simultaneously? Create just 8 threads and then supply them with Tasks when the Tasks are ready, to do so you need to use queue.Queue() which is thread safe. But in your concrete example you can do just the following to run your test 250 times per thread using while inside simulation function, by the way you do not need Semaphore in the case.
A3. When we are talking about multithreading, you have one process with multiple threads.
import threading
import time
import multiprocessing as mp
def simulation(i, _s):
# s is threading.Semaphore()
with _s:
print(str(threading.current_thread().getName()) + ': ' + str(i))
time.sleep(3)
if name == 'main':
print("Cores number: {}".format(mp.cpu_count()))
# recommended number of threading is mp.cpu_count()*1 or mp.cpu_count()*2 in some articles
nSim = 25
s = threading.Semaphore(4) # max number of threads which can work simultaneously with resource is 4
threads = [threading.Thread(target=simulation, args=(i, s, )) for i in range(nSim)]
for t in threads:
t.start()
# just to prove that all threads are active in the start and then their number decreases when the work is done
for i in range(6):
print("Active threads number {}".format(threading.active_count()))
time.sleep(3)
A1: No, your code submits a task, receives a Future in res and then calls result which waits for the result. Only after previous task was done a new task is given to a thread. Only one of the worker threads is really working at a time.
Take a look at ThreadPool.map (actually Pool.map) instead of submit to distribute tasks among the workers.
A2: Only 8 threads (the number of workers) are used here at most. If using map the input data of the 1000 tasks may be stored (needs memory) but no additional threads are created.
A3: Not that I know of. A thread is not bound to a core, it may switch between them fast.

Python: How can I stop Threading/Multiprocessing from using 100% of my CPU?

I have code that reads data from 7 devices every second for an infinite amount of time. Each loop, a thread is created which starts 7 processes. After each process is done the program waits 1 second and starts again. Here is a snippet the code:
def all_thread(): #function that handels the threading
thread = threading.Thread(target=all_process) #prepares a thread for the devices
thread.start() #starts a thread for the devices
def all_process(): #function that prepares and runs processes
processes = [] #empty list for the processes to be stored
while len(gas_list) > 0: #this gaslist holds the connection information for my devices
for sen in gas_list: #for each sen(sensor) in the gas list
proc = multiprocessing.Process(target=main_reader, args=(sen, q)) #declaring a process variable that sends the gas object, value and queue information to reading function
processes.append(proc) #adding the process to the processes list
proc.start() #start the process
for sen in processes: #for each sensor in the processes list
sen.join() #wait for all the processes to complete before starting again
time.sleep(1) #wait one second
However, this uses 100% of my CPU. Is this by design of threading and multiprocessing or just bad coding? Is there a way I can limit the CPU usage? Thanks!
Update:
The comments were mentioning the main_reader() function so I will put it into the question. All it does is read each device, takes all the data and appends it to a list. Then the list is put into a queue to be displayed in the tkinter GUI.
def main_reader(data, q): #this function reads the device which takes less than a second
output_list = get_registry(data) #this function takes the device information, reads the registry and returns a list of data
q.put(output_list) #put the output list into the queue
As you state in the comments, your main_reader takes only a fraction of a second to run, which means process creation overhead might cause your problem.
Here is an example with multiprocessing.Pool. This creates a pool of workers and submits your tasks to them. Processes are started only once and never shut down or joined if this is meant to be an infinite loop. If you want to shut your pool down, you can do so by joining and closing it (see documentation for that).
from multiprocessing import Pool, Manager
from time import sleep
import threading
from random import random
gas_list = [1,2,3,4,5,6,7,8,9,10]
def main_reader(sen, rqu):
output = "%d/%f" % (sen, random())
rqu.put(output)
def all_processes(rq):
p = Pool(len(gas_list) + 1)
while True:
for sen in gas_list:
p.apply_async(main_reader, args=(sen, rq))
sleep(1)
m = Manager()
q = m.Queue()
t = threading.Thread(target=all_processes, args=(q,))
t.daemon = True
t.start()
while True:
r = q.get()
print r
If this does not help, you need to start digging deeper. I would first increase the sleep in your infinite loop to 10 seconds or even longer. This would allow you to monitor the behaviour of your program. If CPU peaks for a moment and then settles down for 10 seconds or so, you know the problem is in your main_reader. If it is still 100%, your problem must be elsewhere.
Is it possible your problem is not in this part of your program at all? You seem to launch this all in a thread, which indicates your main program is doing something else. Can it be this something else that peaks the CPU?

Python multiprocessing run time per process increases with number of processes

I have a pool of workers which perform the same identical task, and I send each a distinct clone of the same data object. Then, I measure the run time separately for each process inside the worker function.
With one process, run time is 4 seconds. With 3 processes, the run time for each process goes up to 6 seconds.
With more complex tasks, this increase is even more nuanced.
There are no other cpu-hogging processes running on my system, and the workers don't use shared memory (as far as I can tell). The run times are measured inside the worker function, so I assume the forking overhead shouldn't matter.
Why does this happen?
def worker_fn(data):
t1 = time()
data.process()
print time() - t1
return data.results
def main( n, num_procs = 3):
from multiprocessing import Pool
from cPickle import dumps, loads
pool = Pool(processes = num_procs)
data = MyClass()
data_pickle = dumps(data)
list_data = [loads(data_pickle) for i in range(n)]
results = pool.map(worker_fn,list_data)
Edit: Although I can't post the entire code for MyClass(), I can tell you that it involves a lot of numpy matrix operations. It seems that numpy's use of OpenBlass may somehow be to blame.

Multiprocessing in python

I am writing a Python script (in Python 2.7) wherein I need to generate around 500,000 uniform random numbers within a range. I need to do this 4 times, perform some calculations on them and write out the 4 files.
At the moment I am doing: (this is just part of my for loop, not the entire code)
random_RA = []
for i in xrange(500000):
random_RA.append(np.random.uniform(6.061,6.505)) # FINAL RANDOM RA
random_dec = []
for i in xrange(500000):
random_dec.append(np.random.uniform(min(data_dec_1),max(data_dec_1))) # FINAL RANDOM 'dec'
to generate the random numbers within the range. I am running Ubuntu 14.04 and when I run the program I also open my system manager to see how the 8 CPU's I have are working. I seem to notice that when the program is running, only 1 of the 8 CPU's seem to work at 100% efficiency. So the entire program takes me around 45 minutes to complete.
I noticed that it is possible to use all the CPU's to my advantage using the module Multiprocessing
I would like to know if this is enough in my example:
random_RA = []
for i in xrange(500000):
multiprocessing.Process()
random_RA.append(np.random.uniform(6.061,6.505)) # FINAL RANDOM RA
i.e adding just the line multiprocessing.Process(), would that be enough?
If you use multiprocessing, you should avoid shared state (like your random_RA list) as much as possible.
Instead, try to use a Pool and its map method:
from multiprocessing import Pool, cpu_count
def generate_random_ra(x):
return np.random.uniform(6.061, 6.505)
def generate_random_dec(x):
return np.random.uniform(min(data_dec_1), max(data_dec_1))
pool = Pool(cpu_count())
random_RA = pool.map(generate_random_ra, xrange(500000))
random_dec = pool.map(generate_random_dec, xrange(500000))
To get you started:
import multiprocessing
import random
def worker(i):
random.uniform(1,100000)
print i,'done'
if __name__ == "__main__":
for i in range(4):
t = multiprocessing.Process(target = worker, args=(i,))
t.start()
print 'All the processes have been started.'
You must gate the t = multiprocess.Process(...) with __name__ == "__main__" as each worker calls this program (module) again to find out what it has to do. If the gating didn't happen it would spawn more processes ...
Just for completeness, generating 500000 random numbers is not going to take you 45 minutes so i assume there are some intensive calculations going on here: you may want to look at them closely.

Categories