Why is this python code not thread-safe? - python

I handed this in as part a school assignment and the person who marked it mentioned that this section was not thread safe.
The assignment was to create a multithreaded socket server in python that accepted a number and returned the Fibonacci value of that number. My approach was to memoize the calculations by sharing a dictionary between each of the threads.
Here is the code (with error handling and such removed for brevity)
from socketserver import ThreadingMixIn, TCPServer, BaseRequestHandler
class FibonacciThreadedTCPServer(ThreadingMixIn, TCPServer):
def __init__(self, server_address):
TCPServer.__init__(self, server_address, FibonacciThreadedTCPRequestHandler, bind_and_activate=True)
#this dictionary will be shared between all Request handlers
self.fib_dict = {0: 0, 1: 1, 2: 1}
class FibonacciThreadedTCPRequestHandler(BaseRequestHandler):
def handle(self):
data = self.request.recv(1024).strip()
num = int(data)
result = self.calc_fib(self.server.fib_dict, num)
ret = bytes(str(result) + '\n', 'ascii')
self.request.sendall(ret)
#staticmethod
def calc_fib(fib_dict, n):
"""
Calculates the fibonacci value of n using a shared lookup table and a linear calculation.
"""
length = len(fib_dict)
while length <= n:
fib_dict[length] = fib_dict[length - 1] + fib_dict[length - 2]
length = len(fib_dict)
return fib_dict[n]
I understand that there are reads and writes occuring simultaneously in the calc_fib method, and normally that means that code is not thread-safe. However in this instance I think it's possible to prove that the code will always provide predictable results.
Is the fact that reads and writes can be occuring simultaneously enough for it to not be considered thread-safe? Or is something considered thread-safe if it always returns a reliabel result.
Why I think this code will always produce reliable results:
A read will never occur on any given index in the dictionary until a write has occurred there.
Any subsequent writes to any given index will contain the same number as the previous writes, so regardless of when the read/write sequence occurs it will always receive the same data.
I have tested this by added random sleeps between each operation and making requests with several hundred threads simultaneously and the right answer is already returned during my test.
Any thoughts or criticism would be appreciated. Thanks.

In this particular case, the GIL should keep your code safe, because:
CPython built-in data structures are protected from actual damage (as opposed to merely incorrect behavior) by the GIL (dict in particular needs this guarantee, because class instances and non-local scope usually use dicts for attribute/name lookup, and without the GIL, simply reading the values would be fraught with danger)
You're updating a cached value of the length then using it for the next set of operations instead of rechecking the length during the mutation; this can lead to repeated work (multiple threads see the old length and compute the new value repeatedly) but since the key is always set to the same value, it doesn't matter if they each set it independently
You never delete from your cache (if you did, that cached length would bite you)
So in CPython, this should be fine. I can't make any guarantees for other Python interpreters though; without the GIL, if they implement their dict without internal locking, it's wholly possible a rehash operation triggered by a write in one thread could cause another thread to read from a dict in an inconsistent/unusable state.

First of all, why do you think that dictionaries are thread-safe? I did a quick search of Python3 documentation (I'm a Python novice myself), and I can not find any reassurance that two unsynchronized threads can safely update the same dictionary without corrupting the dictionary internals and maybe crashing the program.
I've been writing multi-threaded code in other languages since 1980-something, and I've learned never to trust that something is thread safe just because it acts that way when I test it. I want to see documentation that it's supposed to be thread safe. Otherwise, I throw a mutex around it.
Second of all, you are assuming that fib_dict[length - 1] and fib_dict[length - 2] will be valid. My experience with other programming languages says not to assume it. In other programming languages, (e.g., Java), when threads share data without synchronization, it is possible for one thread to see variable updates happen in a different order from the order in which some other thread performed them. E.g., it is theoretically possible for a Java thread that accesses a Map with no synchronization to see the size() of the Map increase before it sees the new values actually appear in the map. I will assume that something similar could happen in Python until somebody shows me official documentation that says otherwise.

Related

What happens when a variable is read and AT THE SAME TIME another thread MODIFIES IT?

If the reading happens BEFORE or AFTER the update the behavior is obvious, it is not difficult to see, but what if it happens at the same time? Assuming there is nothing else (no synchronization, just a simple variable read and assignment).
from threading import Thread
import threading
value = 0
def thread1():
global value
value = 5
def thread2():
global value
# What if I want to read this variable and AT THE SAME TIME another
# thread wants to update it?
print(value)
# What if I want to update this variable and AT THE SAME TIME another
# thread wants to update it too?
# value = 10
if __name__ == "__main__":
thread1 = Thread(target=thread1)
thread2 = Thread(target=thread2)
thread1.start()
thread2.start()
Computer hardware and computer languages are intentionally designed so that we never have to answer the question "what if X and Y happen at the same time?"* In order to be able to make sense of any computer program—in order to predict how it will behave—we have to be able to assume that the outcome of the program will be the same as if all of the memory writes and all of the memory reads performed by the program's threads happened one-by-one in some order.
Which of the astronomical number of possible orderings are allowed and which are not allowed is defined by a consistency model. There are several well known models that may be enforced by the hardware and software, but understanding how they are different from each other and which ones are appropriate in which situations is a deeper subject than I am prepared to talk about.
My point is though, no single reads or writes to a computer's memory ever happen "at the same time." reading or writing a Python variable practically always is a single, primitive memory operation. In any language, within any single process, If one thread performs a primitive memory operation X and another thread performs a primitive memory operation Y, then the end result will either be the same as if X happened before Y or, as if Y happened before X. There is no other possibility.
* When we're talking about sequences of operations, that's a whole different story. We sometimes say things like, "method calls A and B happened at the same time," but the formal way to say that is to say that the method calls were overlapped. Two sequences of operations are overlapped if there is any moment in time when both of them have started, but neither one of them has finished.

Python dict.get() Lock

When I use dictionary.get() function, is it locking the whole dictionary? I am developing a multiprocess and multithreading program. The dictionary is used to act as a state table to keep track of data. I have to impose a size limit to the dictionary, so whenever the limit is being hit, I have to do garbage collection on the table, based on the timestamp. The current implementation will delay adding operation while garbage collection is iterating through the whole table.
I will have 2 or more threads, one just to add data and one just to do garbage collection. Performance is critical in my program to handle streaming data. My program is receiving streaming data, and whenever it receives a message, it has to look for it in the state table, then add the record if it's non-existent in the first place, or copy certain information and then send it along the pipe.
I have thought of using multiprocessing to do the search and adding operation concurrently, but if I used processes, I have to make a copy of state table for each process, in that case, the performance overhead for synchronization is too high. And I also read that multiprocessing.manager.dict() is also locking the access for each CRUD operation. I could not spare the overhead for it so my current approach is using threading.
So my question is while one thread is doing .get(), del dict['key'] operation on the table, will the other insertion thread be blocked from accessing it?
Note: I have read through most SO's python dictionary related posts, but I cannot seem to find the answer. Most people only answer that even though python dictionary operations are atomic, it is safer to use a Lock for insertion/update. I'm handling a huge amount of streaming data so Locking every time is not ideal for me. Please advise if there is a better approach.
If the process of hashing or comparing the keys in your dictionary could invoke arbitrary Python code (basically, if the keys aren't all Python built-in types implemented in C, e.g. str, int, float, etc.), then yes, it would be possible for a race condition to occur in which the GIL is released while a bucket collision is being resolved (during the equality test), and another thread could leap in and cause the object being compared to to disappear from the dict. They try to ensure it doesn't actually crash the interpreter, but it has been a source of errors in the past.
If that's a possibility (or you're on a non-CPython interpreter, where there is no GIL providing basic guarantees like this), then you should really use a lock to coordinate access. On CPython, as long as you're on modern Python 3, the cost will be fairly low; contention on the lock should be fairly low since the GIL ensures only one thread is actually running at once; most of the time your lock should be uncontended (because the contention is on the GIL), so the incremental cost of using it should be fairly small.
A note: You might consider using collections.OrderedDict to simplify the process of limiting the size of your table. With OrderedDict, you can implement the size limit as a strict LRU (least-recently used) system by making additions to the table done as:
with lock:
try:
try:
odict.move_to_end(key) # If key already existed, make sure it's "renewed"
finally:
odict[key] = value # set new value whether or not key already existed
except KeyError:
# move_to_end raising key error means newly added key, so we might
# have grown larger than limit
if len(odict) > maxsize:
odict.popitem(False) # Pops oldest item
and usage done as:
with lock:
# move_to_end optional; if using key means it should live longer, then do it
# if only setting key should refresh it, omit move_to_end
odict.move_to_end(key)
return odict[key]
This does need a lock, but it also reduces the work for garbage collection when it grows too large from "check every key" (O(n) work) to "pop the oldest item off without looking at anything else" (O(1) work).
A lock is used to avoid race conditions so no two threads could make change to the dict at the same time so it is advisible that you use the lock else you might go into a race condition causing program to fail. A mutex lock can be used to deal with 2 threads.

Python Multithreading a for loop with limited Threads

i am just learning Python and dont have much expierence with Multithreading. I am trying to send some json via the Requests session.post Method. This is called in the function at the bottem of the many for loops i need to run through the dictionary.
Is there a way to let this run in paralell?
I also have to limit my numbers of Threads, otherwise the post calls get blocked because they are to fast after each other. Help would be much appreciated.
def doWork(session, List, RefHashList):
for itemRefHash in RefHashList:
for equipment in res['Response']['data']['items']:
if equipment['itemHash'] == itemRefHash:
if equipment['characterIndex'] != 0:
SendJsonViaSession(session, getCharacterIdFromIndex(res, equipment['characterIndex']), itemRefHash, equipment['quantity'])
First, structuring your code differently might improve the speed without the added complexity of threading.
def doWork(session, res, RefHashList):
for equipment in res['Response']['data']['items']:
i = equipment['itemHash']
k = equipment['characterIndex']
if i in RefHashList and k != 0:
SendJsonViaSession(session, getCharacterIdFromIndex(res, k), i, equipment['quantity'])
To start with, we will look up equipment['itemHash'] and equipment['characterIndex'] only once.
Instead of explicitly looping over RefHashList, you could use the in operator. This moves the loop into the Python virtual machine, which is faster.
And instead of a nested if-conditional, you could use a single conditional using and.
Note: I have removed the unused parameter List, and replaced it with res. It is generally good practice to write functions that only act on parameters that they are given, not global variables.
Second, how much extra performance do you need? How much time is there on average between the SendJsonViaSession calls, and how small can this this time become before calls get blocked? If the difference between those numbers is small, it is probably not worth to implement a threaded sender.
Third, a design feature of the standard Python implementation is that only one thread at a time can be executing Python bytecode. So it is not certain that threading will improve performance.
Edit:
There are several ways to run stuff in parallel in Python. There is multiprocessing.Pool which uses processes, and multiprocessing.dummy.ThreadPool which uses threads. And from Python 3.2 onwards there is concurrent.futures, which can use processes or threads.
The thing is, neither of them has rate limiting. So you could get blocked for making too many calls.
Every time you call SendJsonViaSession you'd have to save the current time somehow so that all processes or threads can use it. And before every call, you would have to read that time and wait if it is too close to the last call.
Edit2:
If a call to SendJsonViaSession only takes 0.3 seconds, you should be able to do 3 calls/second sequentially. But your code only does 1 call/second. This implies that the speed restriction is somewhere else. You'd have to profile your code to see where the problem lies.

Will the collections.deque "pop" methods release GIL?

I have a piece of code where I have a processing thread and a monitor thread. In the processing thread, I have a call to collections.deque.popleft function. I wanted to know if this function releases GIL because I want run my monitor thread even when the processing function is blocked on the popleft function
Instead of answering this specific question I'll answer a different question:
What is the Global Interpreter Lock (GIL), and when will it block my program?
In short, the GIL protects the interpreter's state from becoming corrupted by concurrent threads.
For a sense of what it is for, Consider the low level implementation of dict, which somewhere has an array of keys, organized for quick lookup. When you write some code like:
myDict['foo'] = 'bar'
the python interpreter needs to adjust its collection of keys. That might involve things like making more room for the additional key as well as adding the particular key to that array.
If multiple, concurrent threads are modifying that dict, then one thread might reallocate the array while another is in the middle of modifying it, which could cause some unpredictable, probably bad behavior (anything from corrupted data, segfault or heartbleed like memory content leak of sensitive data or arbitrary code execution)
Since that's not the sort of state you can reasonably describe or prevent at the level of your python application, the run-time goes to great lengths to prevent those sorts of problems from occuring. The way it does it is that certain parts of the interpreter, such as the modification of a dict, is surrounded by a PyGILState_Ensure()/PyGILState_Release() pair, so that critical operations always reach a consistent state.
Note however that the scope of this lock is very narrow; it doesn't attempt to protect from general data races, it won't protect you from writing a program with multiple threads overwriting each other's work in a common container (say, a collections.deque), only that even if you do write such a program, it wont' cause the interpreter to crash, you'll always have a valid, working deque. You can add additional application locks, as in queue.Queue to give good concurrent semantics to your application.
Since every operation that the GIL protects is a change in the interpreter state, it never blocks on external events; since those events won't cause the interpreter state to be changed, a signaling condition variable cannot corrupt memory.
The only time you might have a problem is when you have several unblocked threads, since they are potentially all executing code in the low level interpreter, they'll compete for the GIL, and only one thread can hold it, blocking other threads that also want to do some computation.
Unless you are writing C extensions, you probably don't need to worry about it, and unless you have multiple, compute bound threads, in python, you won't be affected by it, either.
Yes -- deque is thread-safe (thanks #hemanths) http://docs.python.org/2/library/collections.html#collections.deque
No, because collections.deque is not thread-safe. Use a Queue, or make your own deque subclass.

python threadsafe object cache

I have implemented a python webserver. Each http request spawns a new thread.
I have a requirement of caching objects in memory and since its a webserver, I want the cache to be thread safe. Is there a standard implementatin of a thread safe object cache in python? I found the following
http://freshmeat.net/projects/lrucache/
This does not look to be thread safe. Can anybody point me to a good implementation of thread safe cache in python?
Thanks!
Well a lot of operations in Python are thread-safe by default, so a standard dictionary should be ok (at least in certain respects). This is mostly due to the GIL, which will help avoid some of the more serious threading issues.
There's a list here: http://coreygoldberg.blogspot.com/2008/09/python-thread-synchronization-and.html that might be useful.
Though atomic nature of those operation just means that you won't have an entirely inconsistent state if you have two threads accessing a dictionary at the same time. So you wouldn't have a corrupted value. However you would (as with most multi-threading programming) not be able to rely on the specific order of those atomic operations.
So to cut a long story short...
If you have fairly simple requirements and aren't to bothered about the ordering of what get written into the cache then you can use a dictionary and know that you'll always get a consistent/not-corrupted value (it just might be out of date).
If you want to ensure that things are a bit more consistent with regard to reading and writing then you might want to look at Django's local memory cache:
http://code.djangoproject.com/browser/django/trunk/django/core/cache/backends/locmem.py
Which uses a read/write lock for locking.
Thread per request is often a bad idea. If your server experiences huge spikes in load it will take the box to its knees. Consider using a thread pool that can grow to a limited size during peak usage and shrink to a smaller size when load is light.
Point 1. GIL does not help you here, an example of a (non-thread-safe) cache for something called "stubs" would be
stubs = {}
def maybe_new_stub(host):
""" returns stub from cache and populates the stubs cache if new is created """
if host not in stubs:
stub = create_new_stub_for_host(host)
stubs[host] = stub
return stubs[host]
What can happen is that Thread 1 calls maybe_new_stub('localhost'), and it discovers we do not have that key in the cache yet. Now we switch to Thread 2, which calls the same maybe_new_stub('localhost'), and it also learns the key is not present. Consequently, both threads call create_new_stub_for_host and put it into the cache.
The map itself is protected by the GIL, so we cannot break it by concurrent access. The logic of the cache, however, is not protected, and so we may end up creating two or more stubs, and dropping all except one on the floor.
Point 2. Depending on the nature of the program, you may not want a global cache. Such shared cache forces synchronization between all your threads. For performance reasons, it is good to make the threads as independent as possible. I believe I do need it, you may actually not.
Point 3. You may use a simple lock. I took inspiration from https://codereview.stackexchange.com/questions/160277/implementing-a-thread-safe-lrucache and came up with the following, which I believe is safe to use for my purposes
import threading
stubs = {}
lock = threading.Lock()
def maybe_new_stub(host):
""" returns stub from cache and populates the stubs cache if new is created """
with lock:
if host not in stubs:
channel = grpc.insecure_channel('%s:6666' % host)
stub = cli_pb2_grpc.BrkStub(channel)
stubs[host] = stub
return stubs[host]
Point 4. It would be best to use existing library. I haven't found any I am prepared to vouch for yet.
You probably want to use memcached instead. It's very fast, very stable, very popular, has good python libraries, and will allow you to grow to a distributed cache should you need to:
http://www.danga.com/memcached/
I'm not sure any of these answers are doing what you want.
I have a similar problem and I'm using a drop-in replacement for lrucache called cachetools which allows you to pass in a lock to make it a bit safer.
For a thread safe object you want threading.local:
from threading import local
safe = local()
safe.cache = {}
You can then put and retrieve objects in safe.cache with thread safety.

Categories