RuneTimeError: Can't start new thread, threading library - python

So, in this code I am testing some multi threading to speed up my code. If I have a large number of tasks in queue I get RuneTimeError: Can't start new thread error. For example, range(0,100) works, but range(0,1000) won't work. I am using threading.Semaphore(4) and this is correctly working, only processing 4 threads at a time, tested this is working. I know why I am getting this error, because even though I am using threading.Semaphore it still technically starts all the threads at the start, but just pauses them until it's the threads turn to run and starting 1000 threads at the same time is to much for the PC to handle. Is there anyway to fix this problem? (Also, yes, I know about GIL)
def thread_test():
threads = []
for t in self.tasks:
t = threading.Thread(target=utils.compareV2.run_compare_temp, args=t)
t.start()
threads.append(t)
for t in threads:
t.join()
for x in range(0,100):
self.tasks.append(("arg1","arg2"))
thread_test()

Instead of starting 1000 threads and then only letting 4 do any work at a time, start 4 threads and let them all do work.
What are the extra 996 threads buying you?
They're using memory and putting pressure on the system's scheduler.
The reason you get the RuntimeError is probably that you've run out of memory for the call stacks for all of those threads. The default limit varies by platform but it's probably something close to 8MiB. A thousand of those and you're up around 8GiB... just for stack space.
You can reduce the amount of memory available for the call stack of each thread with threading.stack_size(...). This will let you fit more threads on your system (but be sure you don't set it below the amount of stack space you actually need or you'll crash your process).
Most likely, what you actually want is a number of processes close to the number of physical cores on your host or a threading system that's more light-weight than what the threading module gives you.

Related

Python multiprocessing: dealing with 2000 processes

Following is my multi processing code. regressTuple has around 2000 items. So, the following code creates around 2000 parallel processes. My Dell xps 15 laptop crashes when this is run.
Can't python multi processing library handle the queue according to hardware availability and run the program without crashing in minimal time? Am I not doing this correctly?
Is there a API call in python to get the possible hardware process count?
How can I refactor the code to use an input variable to get the parallel thread count(hard coded) and loop through threading several times till completion - In this way, after few experiments, I will be able to get the optimal thread count.
What is the best way to run this code in minimal time without crashing. (I cannot use multi-threading in my implementation)
Hereby my code:
regressTuple = [(x,) for x in regressList]
processes = []
for i in range(len(regressList)):
processes.append(Process(target=runRegressWriteStatus,args=regressTuple[i]))
for process in processes:
process.start()
for process in processes:
process.join()
There are multiple things that we need to keep in mind
Spinning the number of processes are not limited by number of cores on your system but the ulimit for your user id on your system that controls total number of processes that be launched by your user id.
The number of cores determine how many of those launched processes can actually be running in parallel at one time.
Crashing of your system can be due to the fact your target function that these processes are running is doing something heavy and resource intensive, which system is not able to handle when multiple processes run simultaneously or nprocs limit on the system has exhausted and now kernel is not able to spin new system processes.
That being said it is not a good idea to spawn as many as 2000 processes, no matter even if you have a 16 core Intel Skylake machine, because creating a new process on the system is not a light weight task because there are number of things like generating the pid, allocating memory, address space generation, scheduling the process, context switching and managing the entire life cycle of it that happen in the background. So it is a heavy operation for the kernel to generate a new process,
Unfortunately I guess what you are trying to do is a CPU bound task and hence limited by the hardware you have on the machine. Spinning more number of processes than the number of cores on your system is not going to help at all, but creating a process pool might. So basically you want to create a pool with as many number of processes as you have cores on the system and then pass the input to the pool. Something like this
def target_func(data):
# process the input data
with multiprocessing.pool(processes=multiprocessing.cpu_count()) as po:
res = po.map(f, regressionTuple)
Can't python multi processing library handle the queue according to hardware availability and run the program without crashing in
minimal time? Am I not doing this correctly?
I don't think it's python's responsibility to manage the queue length. When people reach out for multiprocessing they tend to want efficiency, adding system performance tests to the run queue would be an overhead.
Is there a API call in python to get the possible hardware process count?
If there were, would it know ahead of time how much memory your task will need?
How can I refactor the code to use an input variable to get the parallel thread count(hard coded) and loop through threading several
times till completion - In this way, after few experiments, I will be
able to get the optimal thread count.
As balderman pointed out, a pool is a good way forward with this.
What is the best way to run this code in minimal time without crashing. (I cannot use multi-threading in my implementation)
Use a pool, or take the available system memory, divide by ~3MB and see how many tasks you can run at once.
This is probably more of a sysadmin task to balance the bottlenecks against the queue length, but generally, if your tasks are IO bound, then there isn't much point in having a long task queue if all the tasks are waiting at a the same T-junction to turn into the road. The tasks will then fight with each other for the next block of IO.

Maximum limit on number of threads in python

I am using Threading module in python. How to know how many max threads I can have on my system?
I am using Threading module in python. How to know how many max
threads I can have on my system?
There doesn't seem to be a hard-coded or configurable MAX value that I've ever found, but there is definitely a limit. Run the following program:
import threading
import time
def mythread():
time.sleep(1000)
def main():
threads = 0 #thread counter
y = 1000000 #a MILLION of 'em!
for i in range(y):
try:
x = threading.Thread(target=mythread, daemon=True)
threads += 1 #thread counter
x.start() #start each thread
except RuntimeError: #too many throws a RuntimeError
break
print("{} threads created.\n".format(threads))
if __name__ == "__main__":
main()
I suppose I should mention that this is using Python 3.
The first function, mythread(), is the function which will be executed as a thread. All it does is sleep for 1000 seconds then terminate.
The main() function is a for-loop which tries to start one million threads. The daemon property is set to True simply so that we don't have to clean up all the threads manually.
If a thread cannot be created Python throws a RuntimeError. We catch that to break out of the for-loop and display the number of threads which were successfully created.
Because daemon is set True, all threads terminate when the program ends.
If you run it a few times in a row you're likely to see that a different number of threads will be created each time. On the machine from which I'm posting this reply, I had a minimum 18,835 during one run, and a maximum of 18,863 during another run. And the more you fiddle with the code, as in, the more code you add to this in order to experiment or find more information, you'll find the fewer threads can/will be created.
So, how to apply this to real world.
Well, a server may need the ability to start a triple-digit number of threads, but in most other cases you should re-evaluate your game plan if you think you're going to be generating a large number of threads.
One thing you need to consider if you're using Python: if you're using a standard distribution of Python, your system will only execute one Python thread at a time, including the main thread of your program, so adding more threads to your program or more cores to your system doesn't really get you anything when using the threading module in Python. You can research all of the pedantic details and ultracrepidarian opinions regarding the GIL / Global Interpreter Lock for more info on that.
What that means is that cpu-bound (computationally-intensive) code doesn't benefit greatly from factoring it into threads.
I/O-bound (waiting for file read/write, network read, or user I/O) code, however, benefits greatly from multithreading! So, start a thread for each network connection to your Python-based server.
Threads can also be great for triggering/throwing/raising signals at set periods, or simply to block out the processing sections of your code more logically.

Why are threads spread between CPUs?

I am trying to get my head around threading vs. CPU usage. There are plenty of discussions about threading vs. multiprocessing (a good overview being this answer) so I decided to test this out by launching a maximum number of threads on my 8 CPU laptop running Windows 10, Python 3.4.
My assumption was that all the threads would be bound to a single CPU.
EDIT: it turns out that it was not a good assumption. I now understand that for multithreaded code, only one piece of python code can run at once (no matter where/on which core). This is different for multiprocessing code (where processes are independent and run indeed independently).
While I read about these differences, it is one answer which actually clarified this point.
I think it also explains the CPU view below: that it is an average view of many threads spread out on many CPUs, but only one of them running at one given time (which "averages" to all of them running all the time).
It is not a duplicate of the linked question (which addresses the opposite problem, i.e. all threads on one core) and I will leave it hanging in case someone has a similar question one day and is hopefully helped by my enlightenment.
The code
import threading
import time
def calc():
time.sleep(5)
while True:
a = 2356^36
n = 0
while True:
try:
n += 1
t = threading.Thread(target=calc)
t.start()
except RuntimeError:
print("max threads: {n}".format(n=n))
break
else:
print('.')
time.sleep(100000)
Led to 889 threads being started.
The load on the CPUs was however distributed (and surprisingly low for a pure CPU calculation, the laptop is otherwise idle with an empty load when not running my script):
Why is it so? Are the threads constantly moved as a pack between CPUs and what I see is just an average (the reality being that at a given moment all threads are on one CPU)? Or are they indeed distributed?
As of today it is still the case that 'one thread holds the GIL'. So one thread is running at a time.
The threads are managed on the operating system level. What happens is that every 100 'ticks' (=interpreter instruction) the running thread releases the GIL and resets the tick counter.
Because the threads in this example do continuous calculations, the tick limit of 100 instructions is reached very fast, leading to an almost immediate release of the GIL and a 'battle' between threads starts to acquire the GIL.
So, my assumption is that your operating system has a higher than expected load , because of (too) fast thread switching + almost continuous releasing and acquiring the GIL. The OS spends more time on switching than actually doing any useful calculation.
As you mention yourself, for using more than one core at a time, it's better to look at multiprocessing modules (joblib/Parallel).
Interesting read:
http://www.dabeaz.com/python/UnderstandingGIL.pdf
Um. The point of multithreading is to make sure they work gets spread out. A really easy cheat is to use as many threads as you have CPU cores. The point is they are all independent so they can actually run at the same time. If they were on the same core only one thread at a time could really run at all. They'd pass that core back and forth for processing at the OS level.
Your assumption is wrong and bizarre. What would ever lead you to think they should run on the same CPU and consequently go at 1/8th speed? As the only reason to thread them is typically to get the whole batch to go faster than a single core alone.
In fact, what the hell do you think writing parallel code is for if not to run independently on several cores at the same time? Like this would be pointless and hard to do, let's make complex fetching, branching, and forking routines to accomplish things slower than one core just plugging away at the data?

Persistent Processes Post Python Pool

I have a Python program that takes around 10 minutes to execute. So I use Pool from multiprocessing to speed things up:
from multiprocessing import Pool
p = Pool(processes = 6) # I have an 8 thread processor
results = p.map( function, argument_list ) # distributes work over 6 processes!
It runs much quicker, just from that. God bless Python! And so I thought that would be it.
However I've noticed that each time I do this, the processes and their considerably sized state remain, even when p has gone out of scope; effectively, I've created a memory leak. The processes show up in my System Monitor application as Python processes, which use no CPU at this point, but considerable memory to maintain their state.
Pool has functions close, terminate, and join, and I'd assume one of these will kill the processes. Does anyone know which is the best way to tell my pool p that I am finished with it?
Thanks a lot for your help!
From the Python docs, it looks like you need to do:
p.close()
p.join()
after the map() to indicate that the workers should terminate and then wait for them to do so.

A programming strategy to bypass the os thread limit?

The scenario: We have a python script that checks thousands of proxys simultaneously.
The program uses threads, 1 per proxy, to speed the process. When it reaches the 1007 thread, the script crashes because of the thread limit.
My solution is: A global variable that gets incremented when a thread spawns and decrements when a thread finishes. The function which spawns the threads monitors the variable so that the limit is not reached.
What will your solution be, friends?
Thanks for the answers.
You want to do non-blocking I/O with the select module.
There are a couple of different specific techniques. select.select should work for every major platform. There are other variations that are more efficient (and could matter if you are checking tens of thousands of connections simultaneously) but you will then need to write the code for you specific platform.
I've run into this situation before. Just make a pool of Tasks, and spawn a fixed number of threads that run an endless loop which grabs a Task from the pool, run it, and repeat. Essentially you're implementing your own thread abstraction and using the OS threads to implement it.
This does have drawbacks, the major one being that if your Tasks block for long periods of time they can prevent the execution of other Tasks. But it does let you create an unbounded number of Tasks, limited only by memory.
Does Python have any sort of asynchronous IO functionality? That would be the preferred answer IMO - spawning an extra thread for each outbound connection isn't as neat as having a single thread which is effectively event-driven.
Using different processes, and pipes to transfer data. Using threads in python is pretty lame. From what I heard, they don't actually run in parallel, even if you have a multi-core processor... But maybe it was fixed in python3.
My solution is: A global variable that gets incremented when a thread spawns and decrements when a thread finishes. The function which spawns the threads monitors the variable so that the limit is not reached.
The standard way is to have each thread get next tasks in a loop instead of dying after processing just one. This way you don't have to keep track of the number of threads, since you just fire a fixed number of them. As a bonus, you save on thread creation/destruction.
A counting semaphore should do the trick.
from socket import *
from threading import *
maxthreads = 1000
threads_sem = Semaphore(maxthreads)
class MyThread(Thread):
def __init__(self, conn, addr):
Thread.__init__(self)
self.conn = conn
self.addr = addr
def run(self):
try:
read = conn.recv(4096)
if read == 'go away\n':
global running
running = False
conn.close()
finally:
threads_sem.release()
sock = socket()
sock.bind(('0.0.0.0', 2323))
sock.listen(1)
running = True
while running:
conn, addr = sock.accept()
threads_sem.acquire()
MyThread(conn, addr).start()
Make sure your threads get destroyed properly after they've been used or use a threadpool, although per what I see they're not that effective in Python
see here:
http://code.activestate.com/recipes/203871/
Using the select module or a similar library would most probably be a more efficient solution, but that would require bigger architectural changes.
If you just want to limit the number of threads, a global counter should be fine, as long as you access it in a thread safe way.
Be careful to minimize the default thread stack size. At least on Linux, the default limit puts severe restrictions on the number of created threads. Linux allocates a chunk of the process virtual address space to the stack (usually 10MB). 300 threads x 10MB stack allocation = 3GB of virtual address space dedicated to stack, and on a 32 bit system you have a 3GB limit. You can probably get away with much less.
Twisted is a perfect fit for this problem. See http://twistedmatrix.com/documents/current/core/howto/clients.html for a tutorial on writing a client.
If you don't mind using alternate Python implmentations, Stackless has light-weight (non-native) threads. The only company I know doing much with it though is CCP--they use it for tasklets in their game on both the client and server. You still need to do async I/O with Stackless because if a thread blocks, the process blocks.
As mentioned in another thread, why do you spawn off a new thread for each single operation? This is a classical producer - consumer problem, isn't it? Depending a bit on how you look at it, the proxy checkers might be comsumers or producers.
Anyway, the solution is to make a "queue" of "tasks" to process, and make the threads in a loop check if there are any more tasks to perform in the queue, and if there isn't, wait a predefined interval, and check again.
You should protect your queue with some locking mechanisms, i.e. semaphores, to prevent race conditions.
It's really not that difficult. But it requires a bit of thinking getting it right. Good luck!

Categories