Are Python built-in containers thread-safe? - python

I would like to know if the Python built-in containers (list, vector, set...) are thread-safe? Or do I need to implement a locking/unlocking environment for my shared variable?

You need to implement your own locking for all shared variables that will be modified in Python. You don't have to worry about reading from the variables that won't be modified (ie, concurrent reads are ok), so immutable types (frozenset, tuple, str) are probably safe, but it wouldn't hurt. For things you're going to be changing - list, set, dict, and most other objects, you should have your own locking mechanism (while in-place operations are ok on most of these, threads can lead to super-nasty bugs - you might as well implement locking, it's pretty easy).
By the way, I don't know if you know this, but locking is very easy in Python - create a threading.lock object, and then you can acquire/release it like this:
import threading
list1Lock = threading.Lock()
with list1Lock:
# change or read from the list here
# continue doing other stuff (the lock is released when you leave the with block)
In Python 2.5, do from __future__ import with_statement; Python 2.4 and before don't have this, so you'll want to put the acquire()/release() calls in try:...finally: blocks:
import threading
list1Lock = threading.Lock()
try:
list1Lock.acquire()
# change or read from the list here
finally:
list1Lock.release()
# continue doing other stuff (the lock is released when you leave the with block)
Some very good information about thread synchronization in Python.

Yes, but you still need to be careful of course
For example:
If two threads are racing to pop() from a list with only one item, One thread will get the item successfully and the other will get an IndexError
Code like this is not thread-safe
if L:
item=L.pop() # L might be empty by the time this line gets executed
You should write it like this
try:
item=L.pop()
except IndexError:
# No items left

They are thread-safe as long as you don't disable the GIL in C code for the thread.

The queue module implements multi-producer, multi-consumer queues. It is especially useful in threaded programming when information must be exchanged safely between multiple threads. The Queue class in this module implements all the required locking semantics.
https://docs.python.org/3/library/queue.html

Related

Python multithreading program is giving unexpected output

I know that there is no guarantee regarding the order of execution for the threads. But my doubt is when I ran below code,
import threading
def doSomething():
print("Hello ")
d = threading.Thread(target=doSomething, args=())
d.start()
print("done")
Output that is coming is either
Hello done
or this
Hello
done
May be if I try too much then it might give me below as well
done
Hello
But I am not convinced with the first output. Since order can be different but how come both outputs are available in the same line. Does that means that one thread is messing up with other threads working?
This is a classic race condition. I can't personally reproduce it, and it would likely vary by interpreter implementation and the precise configuration applied to stdout. On Python interpreters without a GIL, there is basically no protection against races, and this behavior is expected to a certain extent. Python interpreters do tend to try to protect you from egregious data corruption due to threading, unlike C/C++, but even if they ensure every byte written ends up actually printed, they usually wouldn't try to make explicit guarantees against interleaving; Hdelolnoe would be a possible (if fairly unlikely given likely implementations) output when you're making no effort whatsoever to synchronize access to stdout.
On CPython, the GIL protects you more, and writing a single string to stdout is more likely to be atomic, but you're not writing a single string. Essentially, the implementation of print is to write objects one by one to the output file object as it goes, it doesn't batch up to a single string then call write just once. What this means is that:
print("Hello ") # Implicitly outputs default end argument of '\n' after printing provided args
is roughly equivalent to:
sys.stdout.write("Hello ")
sys.stdout.write("\n")
If the underlying stack of file objects that implements sys.stdout decides to engage in real I/O in response to the first write, they'll release the GIL before performing the actual write, allowing the main thread to catch up and potentially grab the GIL before the worker thread is given a chance to write the newline. The main thread then outputs the done and then the newlines from each print come out in some unspecified (and irrelevant) order based on further potential races.
Assuming you're on CPython, you could probably fix this by changing the code to this equivalent code using single write calls:
import threading
import sys
def doSomething():
sys.stdout.write("Hello \n")
d = threading.Thread(target=doSomething) # If it takes no arguments, no need to pass args
d.start()
sys.stdout.write("done\n")
and you'd be back to a race condition that only swaps the order, without interleaving (the language spec wouldn't guarantee a thing, but most reasonable implementations would be atomic for this case). If you want it to work with any guarantees without relying on the quirks of the implementation, you have to synchronize:
import threading
lck = threading.Lock()
def doSomething():
with lck:
print("Hello ")
d = threading.Thread(target=doSomething)
d.start()
with lck:
print("done")

Is it right to append item into the same list by multi-thread without lock?

Hrere's the detail question:
I want use multi-thread way to do a batch-http-request work, then gather all these result into a list and sort all items.
So I want to define a empty list origin_list in main process first, and start some threads to just append result into this list after pass origin_list to ervery thread.
And It seemed that I got the expected results in then end, so I think I got the right result list finally without thread lock for the list is a mutable object, am I right?
My main codes are as below:
def do_request_work(final_item_list,request_url):
request_results = request.get(request_url).text
# do request work
finnal_item_list.append(request_results )
def do_sort_work(final_item_list):
# do sort work
return final_item_list
def main():
f_item_list = []
request_list = [url1, url2, ...]
with ThreadPoolExecutor(max_workers=20) as executor:
executor.map(
partial(
do_request_work,
f_item_list
),
request_list)
sorted_list = do_sort_work(f_item_list)
Any commentary is very welcome. great thanks.
I think, that this is a quite questionable solution even without taking thread safety into account.
First of all python has GIL, which
In CPython, the global interpreter lock, or GIL, is a mutex that
protects access to Python objects, preventing multiple threads from
executing Python bytecodes at once.
Thus, I doubt about much performance benefit here, even noting that
potentially blocking or long-running operations, such as I/O, image
processing, and NumPy number crunching, happen outside the GIL.
all python work will be executed one thread in a time.
From the other perspective, the same lock may help you with the thread safety here, so only one thread will modify final_item_list in a time, but I am not sure.
Anyway, I would use multiprocessing module here with integrated parallel map:
from multiprocessing import Pool
def do_request_work(request_url):
request_results = request.get(request_url).text
# do request work
return request_results
if __name__ == '__main__':
request_list = [url1, url2, ...]
with Pool(20) as p:
f_item_list = p.map(do_request_work, request_list)
Which will guarantee you parallel lock-free execution of requests, since every process will receive only their part of work and just return the result, when ready.
Look at this thread: I'm seeking advise on multi-tasking on Python36 platform, Procedure setup.
Relevant to python3.5+
Running Tasks Concurrently¶
awaitable asyncio.gather(*aws, loop=None, return_exceptions=False)
Run awaitable objects in the aws sequence concurrently.
I use this very often, just be aware that its not thread-safe, so do not change values inside, otherwise you will have use deepcopy.
Other things to look at:
https://github.com/kennethreitz/grequests
https://github.com/jreese/aiomultiprocess
aiohttp

how to safely pass a variables value between threads in python

I read on the python documentation that Queue.Queue() is a safe way of passing variables between different threads. I didn't really know that there was a safety issue with multithreading. For my application, I need to develop multiple objects with variables that can be accessed from multiple different threads. Right now I just have the threads accessing the object variables directly. I wont show my code here because there's way too much of it, but here is an example to demonstrate what I'm doing.
from threading import Thread
import time
import random
class switch:
def __init__(self,id):
self.id=id
self.is_on = False
def self.toggle():
self.is_on = not self.is_on
switches = []
for i in range(5):
switches[i] = switch(i)
def record_switch():
switch_record = {}
while True:
time.sleep(10)
current = {}
current['time'] = time.srftime(time.time())
for i in switches:
current[i.id] = i.is_on
switch_record.update(current)
def toggle_switch():
while True:
time.sleep(random.random()*100)
for i in switches:
i.toggle()
toggle = Thread(target=toggle_switch(), args = ())
record = Thread(target=record_switch(), args = ())
toggle.start()
record.start()
So as I understand, the queue object can be used only to put and get values, which clearly won't work for me. Is what I have here "safe"? If not, how can I program this so that I can safely access a variable from multiple different threads?
Whenever you have threads modifying a value other threads can see, then you are going to have safety issues. The worry is that a thread will try to modify a value when another thread is in the middle of modifying it, which has risky and undefined behavior. So no, your switch-toggling code is not safe.
The important thing to know is that changing the value of a variable is not guaranteed to be atomic. If an action is atomic, it means that action will always happen in one uninterrupted step. (This differs very slightly from the database definition.) Changing a variable value, especially a list value, can often times take multiple steps on the processor level. When you are working with threads, all of those steps are not guaranteed to happen all at once, before another thread starts working. It's entirely possible that thread A will be halfway through changing variable x when thread B suddenly takes over. Then if thread B tries to read variable x, it's not going to find a correct value. Even worse, if thread B tries to modify variable x while thread A is halfway through doing the same thing, bad things can happen. Whenever you have a variable whose value can change somehow, all accesses to it need to be made thread-safe.
If you're modifying variables instead of passing messages, you should be using aLockobject.
In your case, you'd have a global Lock object at the top:
from threading import Lock
switch_lock = Lock()
Then you would surround the critical piece of code with the acquire and release functions.
for i in switches:
switch_lock.acquire()
current[i.id] = i.is_on
switch_lock.release()
for i in switches:
switch_lock.acquire()
i.toggle()
switch_lock.release()
Only one thread may ever acquire a lock at a time (this kind of lock, anyway). When any of the other threads try, they'll be blocked and wait for the lock to become free again. So by putting locks around critical sections of code, you make it impossible for more than one thread to look at, or modify, a given switch at any time. You can put this around any bit of code you want to be kept exclusive to one thread at a time.
EDIT: as martineau pointed out, locks are integrated well with the with statement, if you're using a version of Python that has it. This has the added benefit of automatically unlocking if an exception happens. So instead of the above acquire and release system, you can just do this:
for i in switches:
with switch_lock:
i.toggle()

Multiple python threads writing to different records in same list simultaneously - is this ok?

I am trying to fix a bug where multiple threads are writing to a list in memory. Right now I have a thread lock and am occasionally running into problems that are related to the work being done in the threads.
I was hoping to simply make an hash of lists, one for each thread, and remove the thread lock. It seems like each thread could write to its own record without worrying about the others, but perhaps the fact that they are all using the same owning hash would itself be a problem.
Does anyone happen to know if this will work or not? If not, could I, for example, dynamically add a list to a package for each thread? Is that essentially the same thing?
I am far from a threading expert so any advice welcome.
Thanks,
import threading
def job(root_folder,my_list):
for current,files,dirs in os.walk(root):
my_list.extend(files)
time.sleep(1)
my_lists = [[],[],[]]
my_folders = ["C:\\Windows","C:\\Users","C:\\Temp"]
my_threads = []
for folder,a_list in zip(my_folders,my_lists):
my_threads.append(threading.Thread(target=job,args=(folder,a_list)
for thread in my_threads:
thread.start()
for thread in my_threads:
thread.join()
my_full_list = my_lists[0] + my_lists[1] + my_lists[2]
this way each thread just modifies its own list and at the end combines all the individual lists
also as pointed out this gives zero performance gain (actually probably slower than not threading it... ) you may get performance gains using multiprocessing instead ...
Don't use list. Use Queue (python2) or queue (python3).
There is 3 kinds of queue: fifo, lifo and priority. The last one is for ordered data.
You may put data at one side (with thread):
q.put(data)
And get at the other side (maybe in a loop for, say, database):
while not q.empty:
print q.get()
https://docs.python.org/2/library/queue.html

Communication between threads in Python (without using Global Variables)

Let's say if we have a main thread which launches two threads for test modules - " test_a" and " test_b".
Both the test module threads maintain their state whether they are done performing test or if they encountered any error, warning or if they want to update some other information.
How main thread can get access to this information and act accordingly.
For example, if " test_a" raised an error flag; How "main" will know and stop rest of the tests before existing with error ?
One way to do this is using global variables but that gets very ugly.. Very soon.
The obvious solution is to share some kind of mutable variable, by passing it in to the thread objects/functions at constructor/start.
The clean way to do this is to build a class with appropriate instance attributes. If you're using a threading.Thread subclass, instead of just a thread function, you can usually use the subclass itself as the place to stick those attributes. But I'll show it with a list just because it's shorter:
def test_a_func(thread_state):
# ...
thread_state[0] = my_error_state
# ...
def main_thread():
test_states = [None]
test_a = threading.Thread(target=test_a_func, args=(test_states,))
test_a.start()
You can (and usually want to) also pack a Lock or Condition into the mutable state object, so you can properly synchronize between main_thread and test_a.
(Another option is to use a queue.Queue, an os.pipe, etc. to pass information around, but you still need to get that queue or pipe to the child thread—which you do in the exact same way as above.)
However, it's worth considering whether you really need to do this. If you think of test_a and test_b as "jobs", rather than "thread functions", you can just execute those jobs on a pool, and let the pool handle passing results or errors back.
For example:
try:
with concurrent.futures.ThreadPoolExecutor(workers=2) as executor:
tests = [executor.submit(job) for job in (test_a, test_b)]
for test in concurrent.futures.as_completed(tests):
result = test.result()
except Exception as e:
# do stuff
Now, if the test_a function raises an exception, the main thread will get that exception—and, because that means exiting the with block, and all of the other jobs get cancelled and thrown away, and the worker threads shut down.
If you're using 2.5-3.1, you don't have concurrent.futures built in, but you can install the backport off PyPI, or you can rewrite things around multiprocessing.dummy.Pool. (It's slightly more complicated that way, because you have to create a sequence of jobs and call map_async to get back an iterator over AsyncResult objects… but really that's still pretty simple.)

Categories