threading.Lock() performance issues

threading.Lock() performance issues - python

I have multiple threads:
dispQ = Queue.Queue()
stop_thr_event = threading.Event()
def worker (stop_event):
while not stop_event.wait(0):
try:
job = dispQ.get(timeout=1)
job.waitcount -= 1
dispQ.task_done()
except Queue.Empty, msg:
continue
# create job objects and put into dispQ here
for j in range(NUM_OF_JOBS):
j = Job()
dispQ.put(j)
# NUM_OF_THREADS could be 10-20 ish
running_threads = []
for t in range(NUM_OF_THREADS):
t1 = threading.Thread( target=worker, args=(stop_thr_event,) )
t1.daemon = True
t1.start()
running_threads.append(t1)
stop_thr_event.set()
for t in running_threads:
t.join()
The code above was giving me some very strange behavior.
I've ended up finding out that it was due to decrementing waitcount with out a lock
I 've added an attribute to Job class self.thr_lock = threading.Lock()
Then I've changed it to
with job.thr_lock:
job.waitcount -= 1
This seems to fix the strange behavior but it looks like it has degraded in performance.
Is this expected? is there way to optimize locking?
Would it be better to have one global lock rather than one lock per job object?

About the only way to "optimize" threading would be to break the processing down in blocks or chunks of work that can be performed at the same time. This mostly means doing input or output (I/O) because that is the only time the interpreter will release the Global Interpreter Lock, aka the GIL.
In actuality there is often no gain or even a net slow-down when threading is added due to the overhead of using it unless the above condition is met.
It would probably be worse if you used a single global lock for all the shared resources because it would make parts of the program wait when they really didn't need to do so since it wouldn't distinguish what resource was needed so unnecessary waiting would occur.
You might find the PyCon 2015 talk David Beasley gave titled Python Concurrency From the Ground Up of interest. It covers threads, event loops, and coroutines.

It's hard to answer your question based on your code. Locks do have some inherent cost, nothing is free, but normally it is quite small. If your jobs are very small, you might want to consider "chunking" them, that way you have many fewer acquire/release calls relative to the amount of work being done by each thread.
A related but separate issue is one of threads blocking each other. You might notice large performance issues if many threads are waiting on the same lock(s). Here your threads are sitting idle waiting on each other. In some cases this cannot be avoided because there is a shared resource which is a performance bottlenecking. In other cases you can re-organize your code to avoid this performance penalty.
There are some things in your example code that make me thing that it might be very different from actual application. First, your example code doesn't share job objects between threads. If you're not sharing job objects you shouldn't need locks on them. Second, as written your example code might not empty the queue before finishing. It will exit as soon as you hit stop_thr_event.set() leaving any remaining jobs in queue, is this by design?

Related

Correctness of modified consumer/producer

I am creating a Sound class to play notes and would like feedback on the correctness and conciseness of my design. This class differs from the typical consumer/producer in two ways:
The consumer should respond to events, such as to shut down the thread, or otherwise continue forever. The typical consumer/producer exits when the queue is empty. For example, a thread waiting in queue.get cannot handle additional notifications.
Each set of notes submitted by the producer should overwrite any unprocessed notes remaining on the queue.
Originally I had the consumer process one note at a time using the queue module. I found continually acquiring and releasing the lock without any competition to be inefficient, and as previously noted, queue.get prevents waiting on additional events. So instead of building upon that, I rewrote it into:
import threading
queue = []
condition = threading.Condition()
interrupt = threading.Event()
stop = threading.Event()
def producer():
while some_condition:
ns = get_notes() # [(float,float)]
with condition:
queue.clear()
queue.append(ns)
interrupt.set()
condition.notify()
with condition:
stop.set()
condition.notify()
consumer.join()
def consumer():
while not stop.is_set():
with condition:
while not (queue or stop.is_set()):
condition.wait()
if stop.is_set():
break
interrupt.clear()
ns = queue.pop()
ss = gen_samples(ns) # iterator/fast
for b in grouper(ss, size/2):
if interrupt.is_set() or stop.is_set()
break
stream.write(b)
thread = threading.Thread(target=consumer)
thread.start()
producer()
My questions are as follows:
Is this thread-safe? I want to specifically point out my use of is_set without locks or synchronization (in the for-loop).
Can the events be replaced with boolean variables? I believe so as conflicting writes in both threads (data race) are guarded by the condition variable. There is a race condition between setting and checking events but I do not believe it affects program flow.
Is there a more efficient approach/algorithm utilizing different synchronization primitives from the threading module?
edit: Found and fixed a possible deadlock described in Why does Python threading.Condition() notify() require a lock?

Analyzing thread-safety in Python can take into account the Global Interpreter Lock (GIL): no two threads will execute Python code simultaneously. Assignments to variables or object fields are effectively atomic (there are no half-assigned variables) and changes propagate effectively immediately to other threads.
This means that your use of Event.is_set() is already equivalent to using plain booleans. An event is a bool guarded by a Condition. The is_set() method checks the boolean directly. The set() method acquires the Condition, sets the boolean, and notifies all waiting threads. The wait() methods waits until the set() method is invoked. The clear() method acquires the Condition and unsets the boolean. Since you never wait() for any Event, and setting the boolean is atomic, the Condition in the Event is effectively unused.
This might get rid of a couple of locks, but isn't really a huge efficiency win. A Condition is still an abstraction over a lock, but the built-in Queue type uses locks directly. Thus, I would assume that the built-in queue is no less performant than your solution, even for a single consumer.
Your main issue with the built-in queue is that “continually acquiring and releasing the lock without any competition [is] inefficient”. This is wrong on two counts:
Due to Python's GIL, there is little competition in either case.
Acquiring uncontested locks is very efficient.
So while your solution is probably sufficiently correct (I can see no opportunity for deadlock) it is unlikely to be particularly efficient. (There are just some small mistakes, like using stop instead of stop.is_set() and some syntax errors.)
If you are seeing poor performance with Python threads that's probably because of CPython, not because of the Queue type. I already mentioned that only one thread can run at a time due to the GIL. If multiple threads want to run, they must be scheduled by the operating system to do so and acquire the GIL. Each thread will wait for 5ms before asking the running thread to give up the GIL (in a manner quite similar to your interrupt flag). And then the thread can do useful work like acquiring a lock for a critical section that must not be interrupted by other threads.
Possibly, the solution could be to avoid CPython's threads.
If you have multiple CPU-bound tasks, you must use multiple processes. CPython's threads will not run in parallel. However, communication between processes is more expensive.
Consider whether you can combine the producer+consumer directly, possibly using features such as generators.
For an easier time with juggling multiple tasks in the same thread, consider using async/await. Event loops are provided by the asyncio module. This is just as fast as Python's threads, with the caveat that tasks don't pre-empt (interrupt) each other. But this can be advantage: since a task can only be suspended at an await, you don't need most locks and it is easier to reason about correctness of the code. The downside is that async/await might have even higher latency than using threads.
Python has a concept of “executors” that make it easy and efficient to run tasks in separate threads (for I/O-bound tasks) or separate processes (for CPU-bound tasks).
For communicating between multiple processes, use the types from the multiprocessing module (e.g. Queue, Connection, or Value).

Multithreaded parsing is slower then sequential

I am parsing 4 large XML files through threads and somehow the multithreaded code is slower then the sequential code?
Here is my multithreaded code:
def parse():
thread_list = []
for file_name in cve_file:
t = CVEParser(file_name)
t.start()
thread_list.append(t)
for t in thread_list:
t.join()
result = t.result
for res in result:
print res
PersistenceService.insert_data_from_file(res[0], res[1])
os.remove(res[0])
and thats the "faster" code:
def parse:
thread_list = []
for file_name in cve_file:
t = CVEParser(file_name)
t.start()
t.join()
thread_list.append(t)
for t in thread_list:
result = t.result
for res in result:
print res
PersistenceService.insert_data_from_file(res[0], res[1])
os.remove(res[0])
The sequential code is faster by 10 whole minutes, how is this possible?

Python uses the GIL (Global Interpreter Lock) to ensure only one thread executes Python code at a time. This is done to prevent data races and for some other reasons. That, however, means that multithreading in the default CPython will barely give you any code speedup (if it won't slow it down, as it did in your case).
To efficiently parallelize your workload, look into Python's multiprocessing module, which instead launches separate processes that are not affected by each other's GIL
Here's a SO question on that topic

Where did you read that multi-threading or even multi-processing should be always faster that sequential? That is simply wrong. Which one of the 3 modes is faster highly depends on the problem to solve, and where the bottleneck is.
if the algo needs plenty of memory, or if processing multiple parralel operation requires locking, sequential processing is often the best bet
if the bottleneck is IO, Python multithreading is the way to go: even if only one thread can be active at a time, the others will be waiting for io completion during that time and you will get a much better throughput - even if the really faster way is normally polling io with select when possible
only if the bottleneck is CPU processing - which IMHO is not the most common use case - parallelization over different cores is the winner. In Python that means multi-processing (*). That mainly concerns heavy computations
In your use case, there is one other potential cause: you wait for the threads in sequence in the join part. That means that if thread2 ends much before thread0, you will only process it after thread0 has ended which is subobtimal.
This kind of code is often more efficient because it allows processing as soon as one thread has finished:
active_list = thread_list[:]
while len(active_list) > 0:
for t in active_list:
if not t.is_active():
t.join()
active_list.remove[t]
# process t results
...
time.sleep(0.1)
(*) Some libraries specialized in heavy or parallel computation can allow Python threads to run simultaneously. A well knows example for that is numpy: complex operations using numpy and executed in multiple threads can actually run simultaneously on different cores. Thechnically this means releasing the Global Interpreter Lock.

If you're reading these files from a spinning disk, then trying to read 4 at once can really slow down the process.
The disk can only really read one at a time, and will have to physically move the read/write head back and forth between them many many times to service different reading threads. This takes a lot longer than actually reading the data, and you will have to wait for it.
If you're using an SSD, on the other hand, then you won't have this problem. You'll probably still be limited by I/O speed, but the 4-thread case should take about the same amount of time as the single-thread case.

Why do we need locks for threads, if we have GIL?

I believe it is a stupid question but I still can't find it. Actually it's better to separate it into two questions:
1) Am I right that we could have a lot of threads but because of GIL in one moment only one thread is executing?
2) If so, why do we still need locks? We use locks to avoid the case when two threads are trying to read/write some shared object, because of GIL twi threads can't be executed in one moment, can they?

GIL protects the Python interals. That means:
you don't have to worry about something in the interpreter going wrong because of multithreading
most things do not really run in parallel, because python code is executed sequentially due to GIL
But GIL does not protect your own code. For example, if you have this code:
self.some_number += 1
That is going to read value of self.some_number, calculate some_number+1 and then write it back to self.some_number.
If you do that in two threads, the operations (read, add, write) of one thread and the other may be mixed, so that the result is wrong.
This could be the order of execution:
thread1 reads self.some_number (0)
thread2 reads self.some_number (0)
thread1 calculates some_number+1 (1)
thread2 calculates some_number+1 (1)
thread1 writes 1 to self.some_number
thread2 writes 1 to self.some_number
You use locks to enforce this order of execution:
thread1 reads self.some_number (0)
thread1 calculates some_number+1 (1)
thread1 writes 1 to self.some_number
thread2 reads self.some_number (1)
thread2 calculates some_number+1 (2)
thread2 writes 2 to self.some_number
EDIT: Let's complete this answer with some code which shows the explained behaviour:
import threading
import time
total = 0
lock = threading.Lock()
def increment_n_times(n):
global total
for i in range(n):
total += 1
def safe_increment_n_times(n):
global total
for i in range(n):
lock.acquire()
total += 1
lock.release()
def increment_in_x_threads(x, func, n):
threads = [threading.Thread(target=func, args=(n,)) for i in range(x)]
global total
total = 0
begin = time.time()
for thread in threads:
thread.start()
for thread in threads:
thread.join()
print('finished in {}s.\ntotal: {}\nexpected: {}\ndifference: {} ({} %)'
.format(time.time()-begin, total, n*x, n*x-total, 100-total/n/x*100))
There are two functions which implement increment. One uses locks and the other does not.
Function increment_in_x_threads implements parallel execution of the incrementing function in many threads.
Now running this with a big enough number of threads makes it almost certain that an error will occur:
print('unsafe:')
increment_in_x_threads(70, increment_n_times, 100000)
print('\nwith locks:')
increment_in_x_threads(70, safe_increment_n_times, 100000)
In my case, it printed:
unsafe:
finished in 0.9840562343597412s.
total: 4654584
expected: 7000000
difference: 2345416 (33.505942857142855 %)
with locks:
finished in 20.564176082611084s.
total: 7000000
expected: 7000000
difference: 0 (0.0 %)
So without locks, there were many errors (33% of increments failed). On the other hand, with locks it was 20 times slower.
Of course, both numbers are blown up because I used 70 threads, but this shows the general idea.

At any moment, yes, only one thread is executing Python code (other threads may be executing some IO, NumPy, whatever). That is mostly true. However, this is trivially true on any single-processor system, and yet people still need locks on single-processor systems.
Take a look at the following code:
queue = []
def do_work():
while queue:
item = queue.pop(0)
process(item)
With one thread, everything is fine. With two threads, you might get an exception from queue.pop() because the other thread called queue.pop() on the last item first. So you would need to handle that somehow. Using a lock is a simple solution. You can also use a proper concurrent queue (like in the queue module)--but if you look inside the queue module, you'll find that the Queue object has a threading.Lock() inside it. So either way you are using locks.
It is a common newbie mistake to write multithreaded code without the necessary locks. You look at code and think, "this will work just fine" and then find out many hours later that something truly bizarre has happened because threads weren't synchronized properly.
Or in short, there are many places in a multithreaded program where you need to prevent another thread from modifying a structure until you're done applying some changes. This allows you to maintain the invariants on your data, and if you can't maintain invariants, then it's basically impossible to write code that is correct.
Or put in the shortest way possible, "You don't need locks if you don't care if your code is correct."

the GIL prevents simultaneous execution of multiple threads, but not in all situations.
The GIL is temporarily released during I/O operations executed by threads. That means, multiple threads can run at the same time. That's one reason you still need locks.
I don't know where I found this reference.... in a video or something - hard to look it up, but you can investigate further yourself
UPDATE:
The few thumbs down I got signal to me that people think memory is not a good enough reference, and google not a good enough database. While I'd disagree with that, let me provide one of the first URLs I looked up (and checked!), so the people who disliked my answer can live happily from how on:
https://wiki.python.org/moin/GlobalInterpreterLock

the GIL does not protect you from modification of the internal states of the objects that you are accessing concurrently from different threads, meaning that you can still mess things up if you don't take measures.
So, despite the fact that two threads may not be running at the same exact time, they can still be trying to manipulate the internal state of an object (one at a time, intermittently), and if you don't prevent that from happening (with some locking mechanism) your code could/will eventually fail.
Regards.

Is this multi-threaded function asynchronous

I'm afraid I'm still a bit confused (despite checking other threads) whether:
all asynchronous code is multi-threaded
all multi-threaded functions are asynchronous
My initial guess is no to both and that proper asynchronous code should be able to run in one thread - however it can be improved by adding threads for example like so:
So I constructed this toy example:
from threading import *
from queue import Queue
import time
def do_something_with_io_lag(in_work):
out = in_work
# Imagine we do some work that involves sending
# something over the internet and processing the output
# once it arrives
time.sleep(0.5) # simulate IO lag
print("Hello, bee number: ",
str(current_thread().name).replace("Thread-",""))
class WorkerBee(Thread):
def __init__(self, q):
Thread.__init__(self)
self.q = q
def run(self):
while True:
# Get some work from the queue
work_todo = self.q.get()
# This function will simiulate I/O lag
do_something_with_io_lag(work_todo)
# Remove task from the queue
self.q.task_done()
if __name__ == '__main__':
def time_me(nmbr):
number_of_worker_bees = nmbr
worktodo = ['some input for work'] * 50
# Create a queue
q = Queue()
# Fill with work
[q.put(onework) for onework in worktodo]
# Launch processes
for _ in range(number_of_worker_bees):
t = WorkerBee(q)
t.start()
# Block until queue is empty
q.join()
# Run this code in serial mode (just one worker)
%time time_me(nmbr=1)
# Wall time: 25 s
# Basically 50 requests * 0.5 seconds IO lag
# For me everything gets processed by bee number: 59
# Run this code using multi-tasking (launch 50 workers)
%time time_me(nmbr=50)
# Wall time: 507 ms
# Basically the 0.5 second IO lag + 0.07 seconds it took to launch them
# Now everything gets processed by different bees
Is it asynchronous?
To me this code does not seem asynchronous because it is Figure 3 in my example diagram. The I/O call blocks the thread (although we don't feel it because they are blocked in parallel).
However, if this is the case I am confused why requests-futures is considered asynchronous since it is a wrapper around ThreadPoolExecutor:
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
future_to_url = {executor.submit(load_url, url, 10): url for url in get_urls()}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
Can this function on just one thread?
Especially when compared to asyncio, which means it can run single-threaded
There are only two ways to have a program on a single processor do
“more than one thing at a time.” Multi-threaded programming is the
simplest and most popular way to do it, but there is another very
different technique, that lets you have nearly all the advantages of
multi-threading, without actually using multiple threads. It’s really
only practical if your program is largely I/O bound. If your program
is processor bound, then pre-emptive scheduled threads are probably
what you really need. Network servers are rarely processor bound,
however.

First of all, one note: concurrent.futures.Future is not the same as asyncio.Future. Basically it's just an abstraction - an object, that allows you to refer to job result (or exception, which is also a result) in your program after you assigned a job, but before it is completed. It's similar to assigning common function's result to some variable.
Multithreading: Regarding your example, when using multiple threads you can say that your code is "asynchronous" as several operations are performed in different threads at the same time without waiting for each other to complete, and you can see it in the timing results. And you're right, your function due to sleep is blocking, it blocks the worker thread for the specified amount of time, but when you use several threads those threads are blocked in parallel. So if you would have one job with sleep and the other one without and run multiple threads, the one without sleep would perform calculations while the other would sleep. When you use single thread, the jobs are performed in in a serial manner one after the other, so when one job sleeps the other jobs wait for it, actually they just don't exist until it's their turn. All this is pretty much proven by your time tests. The thing happened with print has to do with "thread safety", i.e. print uses standard output, which is a single shared resource. So when your multiple threads tried to print at the same time the switching happened inside and you got your strange output. (This also show "asynchronicity" of your multithreaded example.) To prevent such errors there are locking mechanisms, e.g. locks, semaphores, etc.
Asyncio: To better understand the purpose note the "IO" part, it's not 'async computation', but 'async input/output'. When talking about asyncio you usually don't think about threads at first. Asyncio is about event loop and generators (coroutines). The event loop is the arbiter, that governs the execution of coroutines (and their callbacks), that were registered to the loop. Coroutines are implemented as generators, i.e. functions that allow to perform some actions iteratively, saving state at each iteration and 'returning', and on the next call continuing with the saved state. So basically the event loop is while True: loop, that calls all coroutines/generators, assigned to it, one after another, and they provide result or no-result on each such call - this provides possibility for "asynchronicity". (A simplification, as there's scheduling mechanisms, that optimize this behavior.) The event loop in this situation can run in single thread and if coroutines are non-blocking it will give you true "asynchronicity", but if they are blocking then it's basically a linear execution.
You can achieve the same thing with explicit multithreading, but threads are costly - they require memory to be assigned, switching them takes time, etc. On the other hand asyncio API allows you to abstract from actual implementation and just consider your jobs to be performed asynchronously. It's implementation may be different, it includes calling the OS API and the OS decides what to do, e.g. DMA, additional threads, some specific microcontroller use, etc. The thing is it works well for IO due to lower level mechanisms, hardware stuff. On the other hand, performing computation will require explicit breaking of computation algorithm into pieces to use as asyncio coroutine, so a separate thread might be a better decision, as you can launch the whole computation as one there. (I'm not talking about algorithms that are special to parallel computing). But asyncio event loop might be explicitly set to use separate threads for coroutines, so this will be asyncio with multithreading.
Regarding your example, if you'll implement your function with sleep as asyncio coroutine, shedule and run 50 of them single threaded, you'll get time similar to the first time test, i.e. around 25s, as it is blocking. If you will change it to something like yield from [asyncio.sleep][3](0.5) (which is a coroutine itself), shedule and run 50 of them single threaded, it will be called asynchronously. So while one coroutine will sleep the other will be started, and so on. The jobs will complete in time similar to your second multithreaded test, i.e. close to 0.5s. If you will add print here you'll get good output as it will be used by single thread in serial manner, but the output might be in different order then the order of coroutine assignment to the loop, as coroutines could be run in different order. If you will use multiple threads, then the result will obviously be close to the last one anyway.
Simplification: The difference in multythreading and asyncio is in blocking/non-blocking, so basicly blocking multithreading will somewhat come close to non-blocking asyncio, but there're a lot of differences.
Multithreading for computations (i.e. CPU bound code)
Asyncio for input/output (i.e. I/O bound code)
Regarding your original statement:
all asynchronous code is multi-threaded
all multi-threaded functions are asynchronous
I hope that I was able to show, that:
asynchronous code might be both single threaded and multi-threaded
all multi-threaded functions could be called "asynchronous"

I think the main confusion comes from the meaning of asynchronous. From the Free Online Dictionary of Computing, "A process [...] whose execution can proceed independently" is asynchronous. Now, apply that to what your bees do:
Retrieve an item from the queue. Only one at a time can do that, while the order in which they get an item is undefined. I wouldn't call that asynchronous.
Sleep. Each bee does so independently of all others, i.e. the sleep duration runs on all, otherwise the time wouldn't go down with multiple bees. I'd call that asynchronous.
Call print(). While the calls are independent, at some point the data is funneled into the same output target, and at that point a sequence is enforced. I wouldn't call that asynchronous. Note however that the two arguments to print() and also the trailing newline are handled independently, which is why they can be interleaved.
Lastly, the call to q.join(). Here of course the calling thread is blocked until the queue is empty, so some kind of synchronization is enforced and wanted. I don't see why this "seems to break" for you.

How Do I Queue My Python Locks?

Is there a way to make python locks queued? I have been assuming thus far in my code that threading.lock operates on a queue. It looks like it just gives the lock to a random locker. This is bad for me, because the program (game) I'm working is highly dependent on getting messages in the right order. Are there queued locks in python? If so, how much will I lose on processing time?

I wholly agree with the comments claiming that you're probably thinking about this in an unfruitful way. Locks provide serialization, and aren't at all intended to provide ordering. The bog-standard, easy, and reliable way to enforce an order is to use a Queue.Queue
CPython leaves it up to the operating system to decide in which order locks are acquired. On most systems, that will appear to be more-or-less "random". That cannot be changed.
That said, I'll show a way to implement a "FIFO lock". It's neither hard nor easy - somewhere in between - and you shouldn't use it ;-) I'm afraid only you can answer your "how much will I lose on processing time?" question - we have no idea how heavily you use locks, or how much lock contention your application provokes. You can get a rough feel by studying this code, though.
import threading, collections
class QLock:
def __init__(self):
self.lock = threading.Lock()
self.waiters = collections.deque()
self.count = 0
def acquire(self):
self.lock.acquire()
if self.count:
new_lock = threading.Lock()
new_lock.acquire()
self.waiters.append(new_lock)
self.lock.release()
new_lock.acquire()
self.lock.acquire()
self.count += 1
self.lock.release()
def release(self):
with self.lock:
if not self.count:
raise ValueError("lock not acquired")
self.count -= 1
if self.waiters:
self.waiters.popleft().release()
def locked(self):
return self.count > 0
Here's a little test driver, which can be changed in the obvious way to use either this QLock or a threading.Lock:
def work(name):
qlock.acquire()
acqorder.append(name)
from time import sleep
if 0:
qlock = threading.Lock()
else:
qlock = QLock()
qlock.acquire()
acqorder = []
ts = []
for name in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
t = threading.Thread(target=work, args=(name,))
t.start()
ts.append(t)
sleep(0.1) # probably enough time for .acquire() to run
for t in ts:
while not qlock.locked():
sleep(0) # yield time slice
qlock.release()
for t in ts:
t.join()
assert qlock.locked()
qlock.release()
assert not qlock.locked()
print "".join(acqorder)
On my box just now, 3 runs using threading.Lock produced this output:
BACDEFGHIJKLMNOPQRSTUVWXYZ
ABCDEFGHIJKLMNOPQRSUVWXYZT
ABCEDFGHIJKLMNOPQRSTUVWXYZ
So it's certainly not random, but neither is it wholly predictable. Running it with the QLock instead, the output should always be:
ABCDEFGHIJKLMNOPQRSTUVWXYZ

I stumbled upon this post because I had a similar requirement. Or at least I thought so.
My fear was that if the locks didn't release in FIFO order, thread starvation would be likely to happen, and that would be terrible for my software.
After reading a bit, I dismissed my fears and realized what everyone was saying: if you want this, you're doing it wrong. Also, I was convinced that you can rely on the OS to do its job and not let your thread starve.
To get to that point, I did a bit of digging to understand better how locks worked under Linux. I started by taking a look at the pthreads (Posix Threads) glibc source code and specifications, because I was working in C++ on Linux. I don't know if Python uses pthreads under the hood, but I'm gonna assume it probably is.
I didn't find any specification, on the multiple pthreads references around, relating to the order of unlocks.
What I found is: locks in pthreads on Linux are implemented using a kernel feature called futex.
http://man7.org/linux/man-pages/man2/futex.2.html
http://man7.org/linux/man-pages/man7/futex.7.html
A link on the references of the first of these pages leads you to this PDF:
https://www.kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf
It explains a bit about unlocking strategies, and about how futexes work and are implemented in the Linux kernel, and a lot lot more.
And there I found what I wanted. It explains that futexes are implemented in the kernel in a way such that unlocks are mostly done in FIFO order (to increase fairness). However, that is not guaranteed, and it is possible that one thread might jump the line one in a while. They allow this to not complicate too much the code and allow for the good performance they achieved without losing it due to extreme measures to enforce the FIFO order.
So basically, what you have is:
The POSIX standard doesn't impose any requirement as to the order of locking and unlocking of mutexes. Any implementation is free to do as they want, so if you rely on this order, your code won't be portable (not even between different versions of the same platform).
The Linux implementation of the pthreads library rely on a feature/technique called futex to implement mutexes, and it mostly tries to do a FIFO style unlocking of mutexes, but it is not guaranteed that it will be done in that order.

Yes you could create a FIFO queue using a list of the thread IDs:
FIFO = [5,79,3,2,78,1,9...]
You would try to acquire the lock and if you can't, then push the attempting thread's ID (FIFO.insert(0,threadID)) onto the front of the queue and each time you release the lock, make sure that if a thread wants to acquire the lock it must have the thread ID at the end of the queue (threadID == FIFO[-1]). If the thread does have the thread ID at the end of the queue, then let it acquire the lock and then pop it off (FIFO.pop()). Repeat as necessary.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.