Why do we need locks for threads, if we have GIL? - python

I believe it is a stupid question but I still can't find it. Actually it's better to separate it into two questions:
1) Am I right that we could have a lot of threads but because of GIL in one moment only one thread is executing?
2) If so, why do we still need locks? We use locks to avoid the case when two threads are trying to read/write some shared object, because of GIL twi threads can't be executed in one moment, can they?

GIL protects the Python interals. That means:
you don't have to worry about something in the interpreter going wrong because of multithreading
most things do not really run in parallel, because python code is executed sequentially due to GIL
But GIL does not protect your own code. For example, if you have this code:
self.some_number += 1
That is going to read value of self.some_number, calculate some_number+1 and then write it back to self.some_number.
If you do that in two threads, the operations (read, add, write) of one thread and the other may be mixed, so that the result is wrong.
This could be the order of execution:
thread1 reads self.some_number (0)
thread2 reads self.some_number (0)
thread1 calculates some_number+1 (1)
thread2 calculates some_number+1 (1)
thread1 writes 1 to self.some_number
thread2 writes 1 to self.some_number
You use locks to enforce this order of execution:
thread1 reads self.some_number (0)
thread1 calculates some_number+1 (1)
thread1 writes 1 to self.some_number
thread2 reads self.some_number (1)
thread2 calculates some_number+1 (2)
thread2 writes 2 to self.some_number
EDIT: Let's complete this answer with some code which shows the explained behaviour:
import threading
import time
total = 0
lock = threading.Lock()
def increment_n_times(n):
global total
for i in range(n):
total += 1
def safe_increment_n_times(n):
global total
for i in range(n):
lock.acquire()
total += 1
lock.release()
def increment_in_x_threads(x, func, n):
threads = [threading.Thread(target=func, args=(n,)) for i in range(x)]
global total
total = 0
begin = time.time()
for thread in threads:
thread.start()
for thread in threads:
thread.join()
print('finished in {}s.\ntotal: {}\nexpected: {}\ndifference: {} ({} %)'
.format(time.time()-begin, total, n*x, n*x-total, 100-total/n/x*100))
There are two functions which implement increment. One uses locks and the other does not.
Function increment_in_x_threads implements parallel execution of the incrementing function in many threads.
Now running this with a big enough number of threads makes it almost certain that an error will occur:
print('unsafe:')
increment_in_x_threads(70, increment_n_times, 100000)
print('\nwith locks:')
increment_in_x_threads(70, safe_increment_n_times, 100000)
In my case, it printed:
unsafe:
finished in 0.9840562343597412s.
total: 4654584
expected: 7000000
difference: 2345416 (33.505942857142855 %)
with locks:
finished in 20.564176082611084s.
total: 7000000
expected: 7000000
difference: 0 (0.0 %)
So without locks, there were many errors (33% of increments failed). On the other hand, with locks it was 20 times slower.
Of course, both numbers are blown up because I used 70 threads, but this shows the general idea.

At any moment, yes, only one thread is executing Python code (other threads may be executing some IO, NumPy, whatever). That is mostly true. However, this is trivially true on any single-processor system, and yet people still need locks on single-processor systems.
Take a look at the following code:
queue = []
def do_work():
while queue:
item = queue.pop(0)
process(item)
With one thread, everything is fine. With two threads, you might get an exception from queue.pop() because the other thread called queue.pop() on the last item first. So you would need to handle that somehow. Using a lock is a simple solution. You can also use a proper concurrent queue (like in the queue module)--but if you look inside the queue module, you'll find that the Queue object has a threading.Lock() inside it. So either way you are using locks.
It is a common newbie mistake to write multithreaded code without the necessary locks. You look at code and think, "this will work just fine" and then find out many hours later that something truly bizarre has happened because threads weren't synchronized properly.
Or in short, there are many places in a multithreaded program where you need to prevent another thread from modifying a structure until you're done applying some changes. This allows you to maintain the invariants on your data, and if you can't maintain invariants, then it's basically impossible to write code that is correct.
Or put in the shortest way possible, "You don't need locks if you don't care if your code is correct."

the GIL prevents simultaneous execution of multiple threads, but not in all situations.
The GIL is temporarily released during I/O operations executed by threads. That means, multiple threads can run at the same time. That's one reason you still need locks.
I don't know where I found this reference.... in a video or something - hard to look it up, but you can investigate further yourself
UPDATE:
The few thumbs down I got signal to me that people think memory is not a good enough reference, and google not a good enough database. While I'd disagree with that, let me provide one of the first URLs I looked up (and checked!), so the people who disliked my answer can live happily from how on:
https://wiki.python.org/moin/GlobalInterpreterLock

the GIL does not protect you from modification of the internal states of the objects that you are accessing concurrently from different threads, meaning that you can still mess things up if you don't take measures.
So, despite the fact that two threads may not be running at the same exact time, they can still be trying to manipulate the internal state of an object (one at a time, intermittently), and if you don't prevent that from happening (with some locking mechanism) your code could/will eventually fail.
Regards.

Related

Maximum limit on number of threads in python

I am using Threading module in python. How to know how many max threads I can have on my system?
I am using Threading module in python. How to know how many max
threads I can have on my system?
There doesn't seem to be a hard-coded or configurable MAX value that I've ever found, but there is definitely a limit. Run the following program:
import threading
import time
def mythread():
time.sleep(1000)
def main():
threads = 0 #thread counter
y = 1000000 #a MILLION of 'em!
for i in range(y):
try:
x = threading.Thread(target=mythread, daemon=True)
threads += 1 #thread counter
x.start() #start each thread
except RuntimeError: #too many throws a RuntimeError
break
print("{} threads created.\n".format(threads))
if __name__ == "__main__":
main()
I suppose I should mention that this is using Python 3.
The first function, mythread(), is the function which will be executed as a thread. All it does is sleep for 1000 seconds then terminate.
The main() function is a for-loop which tries to start one million threads. The daemon property is set to True simply so that we don't have to clean up all the threads manually.
If a thread cannot be created Python throws a RuntimeError. We catch that to break out of the for-loop and display the number of threads which were successfully created.
Because daemon is set True, all threads terminate when the program ends.
If you run it a few times in a row you're likely to see that a different number of threads will be created each time. On the machine from which I'm posting this reply, I had a minimum 18,835 during one run, and a maximum of 18,863 during another run. And the more you fiddle with the code, as in, the more code you add to this in order to experiment or find more information, you'll find the fewer threads can/will be created.
So, how to apply this to real world.
Well, a server may need the ability to start a triple-digit number of threads, but in most other cases you should re-evaluate your game plan if you think you're going to be generating a large number of threads.
One thing you need to consider if you're using Python: if you're using a standard distribution of Python, your system will only execute one Python thread at a time, including the main thread of your program, so adding more threads to your program or more cores to your system doesn't really get you anything when using the threading module in Python. You can research all of the pedantic details and ultracrepidarian opinions regarding the GIL / Global Interpreter Lock for more info on that.
What that means is that cpu-bound (computationally-intensive) code doesn't benefit greatly from factoring it into threads.
I/O-bound (waiting for file read/write, network read, or user I/O) code, however, benefits greatly from multithreading! So, start a thread for each network connection to your Python-based server.
Threads can also be great for triggering/throwing/raising signals at set periods, or simply to block out the processing sections of your code more logically.

what's the best practice for accessing shared data with python multi-threading

In python multi-threading, there are some atomic types that can be accessed
by multiple threads without protection(list, dict, etc). There are also some types need protected by lock.
My question is:
where can I find official document that list all atomic types, I can google some answers, but they are not "official" and out of date.
some book suggest that we should protect all shared data with lock, because atomic type may because non-atomic, we shouldn't rely on it. Is this correct?
because lock surely have overhead, is this overhead negligible even with big program?
Locks are used for making an operation atomic. This means only one thread can access some resource. Using many locks causes your application lose the benefit of threading, as only one thread can access the resource.
If you think about it, it doesn't make much sense. It will make your program slower, because of the python needs to manage and context switch between the threads.
When using threads, you should look for minimizing the number of locks as much as possible. Try use local variables whenever possible. Make your function do some work, and return a value instead of updating an existing one.
Then you can create a Queue and collect the results.
Besides locks, there are Semaphores. These are basically Locks, with a limited number of threads can use it:
A semaphore manages an internal counter which is decremented by each acquire() call and incremented by each release() call. The counter can never go below zero; when acquire() finds that it is zero, it blocks, waiting until some other thread calls release().
Python has a good documentation for threading module.
Here is a small example of a dummy function tested using single thread vs 3 threads. Pay attention to the impact Lock makes on the running time:
threads (no locks) duration: 1.0949997901
threads (with locks) duration: 3.1289999485
single thread duration: 3.09899997711
def work():
x = 0
for i in range(100):
x += i
lock.acquire()
print 'acquried lock, do some calculations'
time.sleep(1)
print x
lock.release()
print 'lock released'
I think you are looking for this link.
From above link :
An operation acting on shared memory is atomic if it completes in a
single step relative to other threads. When an atomic store is
performed on a shared variable, no other thread can observe the
modification half-complete. When an atomic load is performed on a
shared variable, it reads the entire value as it appeared at a single
moment in time. Non-atomic loads and stores do not make those
guarantees.
Any manipulation on list won't be atomic operation, so extra care need to be taken to make it thread safe using Lock, Event, Condition or Semaphores etc.
For example, you can check this answer which explains how list are thread safe.

Multithreaded parsing is slower then sequential

I am parsing 4 large XML files through threads and somehow the multithreaded code is slower then the sequential code?
Here is my multithreaded code:
def parse():
thread_list = []
for file_name in cve_file:
t = CVEParser(file_name)
t.start()
thread_list.append(t)
for t in thread_list:
t.join()
result = t.result
for res in result:
print res
PersistenceService.insert_data_from_file(res[0], res[1])
os.remove(res[0])
and thats the "faster" code:
def parse:
thread_list = []
for file_name in cve_file:
t = CVEParser(file_name)
t.start()
t.join()
thread_list.append(t)
for t in thread_list:
result = t.result
for res in result:
print res
PersistenceService.insert_data_from_file(res[0], res[1])
os.remove(res[0])
The sequential code is faster by 10 whole minutes, how is this possible?
Python uses the GIL (Global Interpreter Lock) to ensure only one thread executes Python code at a time. This is done to prevent data races and for some other reasons. That, however, means that multithreading in the default CPython will barely give you any code speedup (if it won't slow it down, as it did in your case).
To efficiently parallelize your workload, look into Python's multiprocessing module, which instead launches separate processes that are not affected by each other's GIL
Here's a SO question on that topic
Where did you read that multi-threading or even multi-processing should be always faster that sequential? That is simply wrong. Which one of the 3 modes is faster highly depends on the problem to solve, and where the bottleneck is.
if the algo needs plenty of memory, or if processing multiple parralel operation requires locking, sequential processing is often the best bet
if the bottleneck is IO, Python multithreading is the way to go: even if only one thread can be active at a time, the others will be waiting for io completion during that time and you will get a much better throughput - even if the really faster way is normally polling io with select when possible
only if the bottleneck is CPU processing - which IMHO is not the most common use case - parallelization over different cores is the winner. In Python that means multi-processing (*). That mainly concerns heavy computations
In your use case, there is one other potential cause: you wait for the threads in sequence in the join part. That means that if thread2 ends much before thread0, you will only process it after thread0 has ended which is subobtimal.
This kind of code is often more efficient because it allows processing as soon as one thread has finished:
active_list = thread_list[:]
while len(active_list) > 0:
for t in active_list:
if not t.is_active():
t.join()
active_list.remove[t]
# process t results
...
time.sleep(0.1)
(*) Some libraries specialized in heavy or parallel computation can allow Python threads to run simultaneously. A well knows example for that is numpy: complex operations using numpy and executed in multiple threads can actually run simultaneously on different cores. Thechnically this means releasing the Global Interpreter Lock.
If you're reading these files from a spinning disk, then trying to read 4 at once can really slow down the process.
The disk can only really read one at a time, and will have to physically move the read/write head back and forth between them many many times to service different reading threads. This takes a lot longer than actually reading the data, and you will have to wait for it.
If you're using an SSD, on the other hand, then you won't have this problem. You'll probably still be limited by I/O speed, but the 4-thread case should take about the same amount of time as the single-thread case.

threading.Lock() performance issues

I have multiple threads:
dispQ = Queue.Queue()
stop_thr_event = threading.Event()
def worker (stop_event):
while not stop_event.wait(0):
try:
job = dispQ.get(timeout=1)
job.waitcount -= 1
dispQ.task_done()
except Queue.Empty, msg:
continue
# create job objects and put into dispQ here
for j in range(NUM_OF_JOBS):
j = Job()
dispQ.put(j)
# NUM_OF_THREADS could be 10-20 ish
running_threads = []
for t in range(NUM_OF_THREADS):
t1 = threading.Thread( target=worker, args=(stop_thr_event,) )
t1.daemon = True
t1.start()
running_threads.append(t1)
stop_thr_event.set()
for t in running_threads:
t.join()
The code above was giving me some very strange behavior.
I've ended up finding out that it was due to decrementing waitcount with out a lock
I 've added an attribute to Job class self.thr_lock = threading.Lock()
Then I've changed it to
with job.thr_lock:
job.waitcount -= 1
This seems to fix the strange behavior but it looks like it has degraded in performance.
Is this expected? is there way to optimize locking?
Would it be better to have one global lock rather than one lock per job object?
About the only way to "optimize" threading would be to break the processing down in blocks or chunks of work that can be performed at the same time. This mostly means doing input or output (I/O) because that is the only time the interpreter will release the Global Interpreter Lock, aka the GIL.
In actuality there is often no gain or even a net slow-down when threading is added due to the overhead of using it unless the above condition is met.
It would probably be worse if you used a single global lock for all the shared resources because it would make parts of the program wait when they really didn't need to do so since it wouldn't distinguish what resource was needed so unnecessary waiting would occur.
You might find the PyCon 2015 talk David Beasley gave titled Python Concurrency From the Ground Up of interest. It covers threads, event loops, and coroutines.
It's hard to answer your question based on your code. Locks do have some inherent cost, nothing is free, but normally it is quite small. If your jobs are very small, you might want to consider "chunking" them, that way you have many fewer acquire/release calls relative to the amount of work being done by each thread.
A related but separate issue is one of threads blocking each other. You might notice large performance issues if many threads are waiting on the same lock(s). Here your threads are sitting idle waiting on each other. In some cases this cannot be avoided because there is a shared resource which is a performance bottlenecking. In other cases you can re-organize your code to avoid this performance penalty.
There are some things in your example code that make me thing that it might be very different from actual application. First, your example code doesn't share job objects between threads. If you're not sharing job objects you shouldn't need locks on them. Second, as written your example code might not empty the queue before finishing. It will exit as soon as you hit stop_thr_event.set() leaving any remaining jobs in queue, is this by design?

Is this multi-threaded function asynchronous

I'm afraid I'm still a bit confused (despite checking other threads) whether:
all asynchronous code is multi-threaded
all multi-threaded functions are asynchronous
My initial guess is no to both and that proper asynchronous code should be able to run in one thread - however it can be improved by adding threads for example like so:
So I constructed this toy example:
from threading import *
from queue import Queue
import time
def do_something_with_io_lag(in_work):
out = in_work
# Imagine we do some work that involves sending
# something over the internet and processing the output
# once it arrives
time.sleep(0.5) # simulate IO lag
print("Hello, bee number: ",
str(current_thread().name).replace("Thread-",""))
class WorkerBee(Thread):
def __init__(self, q):
Thread.__init__(self)
self.q = q
def run(self):
while True:
# Get some work from the queue
work_todo = self.q.get()
# This function will simiulate I/O lag
do_something_with_io_lag(work_todo)
# Remove task from the queue
self.q.task_done()
if __name__ == '__main__':
def time_me(nmbr):
number_of_worker_bees = nmbr
worktodo = ['some input for work'] * 50
# Create a queue
q = Queue()
# Fill with work
[q.put(onework) for onework in worktodo]
# Launch processes
for _ in range(number_of_worker_bees):
t = WorkerBee(q)
t.start()
# Block until queue is empty
q.join()
# Run this code in serial mode (just one worker)
%time time_me(nmbr=1)
# Wall time: 25 s
# Basically 50 requests * 0.5 seconds IO lag
# For me everything gets processed by bee number: 59
# Run this code using multi-tasking (launch 50 workers)
%time time_me(nmbr=50)
# Wall time: 507 ms
# Basically the 0.5 second IO lag + 0.07 seconds it took to launch them
# Now everything gets processed by different bees
Is it asynchronous?
To me this code does not seem asynchronous because it is Figure 3 in my example diagram. The I/O call blocks the thread (although we don't feel it because they are blocked in parallel).
However, if this is the case I am confused why requests-futures is considered asynchronous since it is a wrapper around ThreadPoolExecutor:
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
future_to_url = {executor.submit(load_url, url, 10): url for url in get_urls()}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
Can this function on just one thread?
Especially when compared to asyncio, which means it can run single-threaded
There are only two ways to have a program on a single processor do
“more than one thing at a time.” Multi-threaded programming is the
simplest and most popular way to do it, but there is another very
different technique, that lets you have nearly all the advantages of
multi-threading, without actually using multiple threads. It’s really
only practical if your program is largely I/O bound. If your program
is processor bound, then pre-emptive scheduled threads are probably
what you really need. Network servers are rarely processor bound,
however.
First of all, one note: concurrent.futures.Future is not the same as asyncio.Future. Basically it's just an abstraction - an object, that allows you to refer to job result (or exception, which is also a result) in your program after you assigned a job, but before it is completed. It's similar to assigning common function's result to some variable.
Multithreading: Regarding your example, when using multiple threads you can say that your code is "asynchronous" as several operations are performed in different threads at the same time without waiting for each other to complete, and you can see it in the timing results. And you're right, your function due to sleep is blocking, it blocks the worker thread for the specified amount of time, but when you use several threads those threads are blocked in parallel. So if you would have one job with sleep and the other one without and run multiple threads, the one without sleep would perform calculations while the other would sleep. When you use single thread, the jobs are performed in in a serial manner one after the other, so when one job sleeps the other jobs wait for it, actually they just don't exist until it's their turn. All this is pretty much proven by your time tests. The thing happened with print has to do with "thread safety", i.e. print uses standard output, which is a single shared resource. So when your multiple threads tried to print at the same time the switching happened inside and you got your strange output. (This also show "asynchronicity" of your multithreaded example.) To prevent such errors there are locking mechanisms, e.g. locks, semaphores, etc.
Asyncio: To better understand the purpose note the "IO" part, it's not 'async computation', but 'async input/output'. When talking about asyncio you usually don't think about threads at first. Asyncio is about event loop and generators (coroutines). The event loop is the arbiter, that governs the execution of coroutines (and their callbacks), that were registered to the loop. Coroutines are implemented as generators, i.e. functions that allow to perform some actions iteratively, saving state at each iteration and 'returning', and on the next call continuing with the saved state. So basically the event loop is while True: loop, that calls all coroutines/generators, assigned to it, one after another, and they provide result or no-result on each such call - this provides possibility for "asynchronicity". (A simplification, as there's scheduling mechanisms, that optimize this behavior.) The event loop in this situation can run in single thread and if coroutines are non-blocking it will give you true "asynchronicity", but if they are blocking then it's basically a linear execution.
You can achieve the same thing with explicit multithreading, but threads are costly - they require memory to be assigned, switching them takes time, etc. On the other hand asyncio API allows you to abstract from actual implementation and just consider your jobs to be performed asynchronously. It's implementation may be different, it includes calling the OS API and the OS decides what to do, e.g. DMA, additional threads, some specific microcontroller use, etc. The thing is it works well for IO due to lower level mechanisms, hardware stuff. On the other hand, performing computation will require explicit breaking of computation algorithm into pieces to use as asyncio coroutine, so a separate thread might be a better decision, as you can launch the whole computation as one there. (I'm not talking about algorithms that are special to parallel computing). But asyncio event loop might be explicitly set to use separate threads for coroutines, so this will be asyncio with multithreading.
Regarding your example, if you'll implement your function with sleep as asyncio coroutine, shedule and run 50 of them single threaded, you'll get time similar to the first time test, i.e. around 25s, as it is blocking. If you will change it to something like yield from [asyncio.sleep][3](0.5) (which is a coroutine itself), shedule and run 50 of them single threaded, it will be called asynchronously. So while one coroutine will sleep the other will be started, and so on. The jobs will complete in time similar to your second multithreaded test, i.e. close to 0.5s. If you will add print here you'll get good output as it will be used by single thread in serial manner, but the output might be in different order then the order of coroutine assignment to the loop, as coroutines could be run in different order. If you will use multiple threads, then the result will obviously be close to the last one anyway.
Simplification: The difference in multythreading and asyncio is in blocking/non-blocking, so basicly blocking multithreading will somewhat come close to non-blocking asyncio, but there're a lot of differences.
Multithreading for computations (i.e. CPU bound code)
Asyncio for input/output (i.e. I/O bound code)
Regarding your original statement:
all asynchronous code is multi-threaded
all multi-threaded functions are asynchronous
I hope that I was able to show, that:
asynchronous code might be both single threaded and multi-threaded
all multi-threaded functions could be called "asynchronous"
I think the main confusion comes from the meaning of asynchronous. From the Free Online Dictionary of Computing, "A process [...] whose execution can proceed independently" is asynchronous. Now, apply that to what your bees do:
Retrieve an item from the queue. Only one at a time can do that, while the order in which they get an item is undefined. I wouldn't call that asynchronous.
Sleep. Each bee does so independently of all others, i.e. the sleep duration runs on all, otherwise the time wouldn't go down with multiple bees. I'd call that asynchronous.
Call print(). While the calls are independent, at some point the data is funneled into the same output target, and at that point a sequence is enforced. I wouldn't call that asynchronous. Note however that the two arguments to print() and also the trailing newline are handled independently, which is why they can be interleaved.
Lastly, the call to q.join(). Here of course the calling thread is blocked until the queue is empty, so some kind of synchronization is enforced and wanted. I don't see why this "seems to break" for you.

Categories