How to use global variable in python, in a threadsafe way

How to use global variable in python, in a threadsafe way - python

I want to use a global variable,
Init it once.
having a thread safe access.
Can someone share an example please?

If you need read-only access and the value is initialized before threads are spawn, you don't need to worry about thread safety.
If that is not the case Python threading library is probably what you need, more precisely locks. A really good read on the subject - http://effbot.org/zone/thread-synchronization.htm with quite a lot of examples.

You do have a problem if you are using multiprocessing.Processes. In which case you should take a look at Managers and Queues in the multiprocessing module.

The threading library is what you want:
import threading
mydata = threading.local()
mydata.x = 1

If you initialize it once, and if you initialize it on when module is loaded (that means: before it can be accessed from other threads), you will have no problems with thread safety at all. No synchronization is needed.
But if you mean a more complex scenario, you have to explain it further to get a reasonable code example.

Related

Is Python reload thread safe?

Using python, I am writing a nasty cralwer system that cralws something from the websites of each local government, and total websites count to over 100, just in case their webpage changes, I have to use reload to do hot-update. But I am wondering if reload is thread safe. because say, I am reloading moudle Cralwer1 in thread 1, but at the same time, thread 2 is using Cralwer1. Will thread 1's reload cause thread 2 to fail? If so, I have to do a lock or something, otherwise, I can happily do the reload without extra work. Can any one help me out?

Is Python reload thread safe?
No.
The reload() executes all the pure python code in the module. Any pure python step can thread-switch at any time. So, this definitely isn't safe.

reload = re-execute top level code in Crawler1.
Generally speaking without more info/code sample, you can:
Encapsulate the "operational" top level code that kicks things off, e.g. put it in a function or a class, and invoke that instead of reloading the whole module. This may involve calling/adding some cleanup function.
Use a global variable, which thread1 and thread2 will flip and be aware of to prevent stepping on each other. This doesn't scale quite as well, but can perhaps prevent/delay usage of locks.
Using locks is actually not that hard,
they even support context managers:
https://docs.python.org/3/library/threading.html#with-locks

What is and isn't automatically thread-local in Python Threading?

I'm having a hard time wrapping my head around Python threading, especially since the documentation explicitly tells you to RTFS at some points, instead of kindly including the relevant info. I'll admit I don't feel qualified to read the threading module. I've seen lots of dirt-simple examples, but they all use global variables, which is offensive and makes me wonder if anyone really knows when or where it's required to use them as opposed to just convenient.
In particular, I'd like to know:
In threading.Thread(target=x), is x shared or private? Does each thread have its own stack, or are all threads using the same context simultaneously?
What is the preferred way to pass mutable variables to threads? Immutable ones are obviously through Thread(args=[],kwargs={}) and that's what all the examples cover. If it's global, I'll have to hold my nose and use it, but it seems like there has to be a better way. I suppose I could wrap everything in a class and just pass the instance in, but it'd be nice to point at regular variables, too.
When do I need threading.local()? In the x above?
Do I have to subclass Thread to update data, as many examples show?
I'm used to Win32 threads and pthreads, where it's explicitly laid out in docs what is and isn't shared with different uses of threads. Those are pretty low-level, and I'd like to avoid _thread if possible to be pythonic.
I'm not sure if it's relevant, but I'm trying to thread OpenMP-style to get the hang of it - make a for loop run concurrently using a queue and some threads. It was easy with globals and locks, but now I'd like to nail down scopes for better lock use.

In threading.Thread(target=x), is x shared or private?
It is private. Each thread has its own private invocation of x.
This is similar to recursion, for example (regardless of multithreading). If x calls itself, each invocation of x gets its own "private" frame, with its own private local variables.
What is the preferred way to pass mutable variables to threads? Do I have to subclass Thread to update data?
I view the target argument as a quick shortcut, good for quick experiments, but not much else. Using it where it ought not be used leads to all the limitations you describe in your question (and the hacks you describe in the possible solutions you contemplate).
Most of the time, you'd want to subclass threading.Thread. The code creating/managing the threads would pass all mutable shared objects to your thread-classes' __init__, and they should keep those objects as their attributes, and access them when running (within their run method).
When do I need threading.local()?
You rarely do, so you probably don't.
I'd like to avoid _thread if possible to be pythonic
Without a doubt, avoid it.

Does thread-local mean thread safe?

Specifically I'm talking about Python. I'm trying to hack something (just a little) by seeing an object's value without ever passing it in, and I'm wondering if it is thread safe to use thread local to do that. Also, how do you even go about doing such a thing?

No -- thread local means that each thread gets its own copy of that variable. Using it is (at least normally) thread-safe, simply because each thread uses its own variable, separate from variables by the same name that's accessible to other threads. OTOH, they're not (normally) useful for communication between threads.

Python: time a method call and stop it if time is exceeded

I need to dynamically load code (comes as source), run it and get the results. The code that I load always includes a run method, which returns the needed results. Everything looks ridiculously easy, as usual in Python, since I can do
exec(source) #source includes run() definition
result = run(params)
#do stuff with result
The only problem is, the run() method in the dynamically generated code can potentially not terminate, so I need to only run it for up to x seconds. I could spawn a new thread for this, and specify a time for .join() method, but then I cannot easily get the result out of it (or can I). Performance is also an issue to consider, since all of this is happening in a long while loop
Any suggestions on how to proceed?
Edit: to clear things up per dcrosta's request: the loaded code is not untrusted, but generated automatically on the machine. The purpose for this is genetic programming.

The only "really good" solutions -- imposing essentially no overhead -- are going to be based on SIGALRM, either directly or through a nice abstraction layer; but as already remarked Windows does not support this. Threads are no use, not because it's hard to get results out (that would be trivial, with a Queue!), but because forcibly terminating a runaway thread in a nice cross-platform way is unfeasible.
This leaves high-overhead multiprocessing as the only viable cross-platform solution. You'll want a process pool to reduce process-spawning overhead (since presumably the need to kill a runaway function is only occasional, most of the time you'll be able to reuse an existing process by sending it new functions to execute). Again, Queue (the multiprocessing kind) makes getting results back easy (albeit with a modicum more caution than for the threading case, since in the multiprocessing case deadlocks are possible).
If you don't need to strictly serialize the executions of your functions, but rather can arrange your architecture to try two or more of them in parallel, AND are running on a multi-core machine (or multiple machines on a fast LAN), then suddenly multiprocessing becomes a high-performance solution, easily paying back for the spawning and IPC overhead and more, exactly because you can exploit as many processors (or nodes in a cluster) as you can use.

You could use the multiprocessing library to run the code in a separate process, and call .join() on the process to wait for it to finish, with the timeout parameter set to whatever you want. The library provides several ways of getting data back from another process - using a Value object (seen in the Shared Memory example on that page) is probably sufficient. You can use the terminate() call on the process if you really need to, though it's not recommended.

You could also use Stackless Python, as it allows for cooperative scheduling of microthreads. Here you can specify a maximum number of instructions to execute before returning. Setting up the routines and getting the return value out is a little more tricky though.

I could spawn a new thread for this, and specify a time for .join() method, but then I cannot easily get the result out of it
If the timeout expires, that means the method didn't finish, so there's no result to get. If you have incremental results, you can store them somewhere and read them out however you like (keeping threadsafety in mind).
Using SIGALRM-based systems is dicey, because it can deliver async signals at any time, even during an except or finally handler where you're not expecting one. (Other languages deal with this better, unfortunately.) For example:
try:
# code
finally:
cleanup1()
cleanup2()
cleanup3()
A signal passed up via SIGALRM might happen during cleanup2(), which would cause cleanup3() to never be executed. Python simply does not have a way to terminate a running thread in a way that's both uncooperative and safe.
You should just have the code check the timeout on its own.
import threading
from datetime import datetime, timedelta
local = threading.local()
class ExecutionTimeout(Exception): pass
def start(max_duration = timedelta(seconds=1)):
local.start_time = datetime.now()
local.max_duration = max_duration
def check():
if datetime.now() - local.start_time > local.max_duration:
raise ExecutionTimeout()
def do_work():
start()
while True:
check()
# do stuff here
return 10
try:
print do_work()
except ExecutionTimeout:
print "Timed out"
(Of course, this belongs in a module, so the code would actually look like "timeout.start()"; "timeout.check()".)
If you're generating code dynamically, then generate a timeout.check() call at the start of each loop.

Consider using the stopit package that could be useful in some cases you need timeout control. Its doc emphasizes the limitations.
https://pypi.python.org/pypi/stopit

a quick google for "python timeout" reveals a TimeoutFunction class

Executing untrusted code is dangerous, and should usually be avoided unless it's impossible to do so. I think you're right to be worried about the time of the run() method, but the run() method could do other things as well: delete all your files, open sockets and make network connections, begin cracking your password and email the result back to an attacker, etc.
Perhaps if you can give some more detail on what the dynamically loaded code does, the SO community can help suggest alternatives.

python threadsafe object cache

I have implemented a python webserver. Each http request spawns a new thread.
I have a requirement of caching objects in memory and since its a webserver, I want the cache to be thread safe. Is there a standard implementatin of a thread safe object cache in python? I found the following
http://freshmeat.net/projects/lrucache/
This does not look to be thread safe. Can anybody point me to a good implementation of thread safe cache in python?
Thanks!

Well a lot of operations in Python are thread-safe by default, so a standard dictionary should be ok (at least in certain respects). This is mostly due to the GIL, which will help avoid some of the more serious threading issues.
There's a list here: http://coreygoldberg.blogspot.com/2008/09/python-thread-synchronization-and.html that might be useful.
Though atomic nature of those operation just means that you won't have an entirely inconsistent state if you have two threads accessing a dictionary at the same time. So you wouldn't have a corrupted value. However you would (as with most multi-threading programming) not be able to rely on the specific order of those atomic operations.
So to cut a long story short...
If you have fairly simple requirements and aren't to bothered about the ordering of what get written into the cache then you can use a dictionary and know that you'll always get a consistent/not-corrupted value (it just might be out of date).
If you want to ensure that things are a bit more consistent with regard to reading and writing then you might want to look at Django's local memory cache:
http://code.djangoproject.com/browser/django/trunk/django/core/cache/backends/locmem.py
Which uses a read/write lock for locking.

Thread per request is often a bad idea. If your server experiences huge spikes in load it will take the box to its knees. Consider using a thread pool that can grow to a limited size during peak usage and shrink to a smaller size when load is light.

Point 1. GIL does not help you here, an example of a (non-thread-safe) cache for something called "stubs" would be
stubs = {}
def maybe_new_stub(host):
""" returns stub from cache and populates the stubs cache if new is created """
if host not in stubs:
stub = create_new_stub_for_host(host)
stubs[host] = stub
return stubs[host]
What can happen is that Thread 1 calls maybe_new_stub('localhost'), and it discovers we do not have that key in the cache yet. Now we switch to Thread 2, which calls the same maybe_new_stub('localhost'), and it also learns the key is not present. Consequently, both threads call create_new_stub_for_host and put it into the cache.
The map itself is protected by the GIL, so we cannot break it by concurrent access. The logic of the cache, however, is not protected, and so we may end up creating two or more stubs, and dropping all except one on the floor.
Point 2. Depending on the nature of the program, you may not want a global cache. Such shared cache forces synchronization between all your threads. For performance reasons, it is good to make the threads as independent as possible. I believe I do need it, you may actually not.
Point 3. You may use a simple lock. I took inspiration from https://codereview.stackexchange.com/questions/160277/implementing-a-thread-safe-lrucache and came up with the following, which I believe is safe to use for my purposes
import threading
stubs = {}
lock = threading.Lock()
def maybe_new_stub(host):
""" returns stub from cache and populates the stubs cache if new is created """
with lock:
if host not in stubs:
channel = grpc.insecure_channel('%s:6666' % host)
stub = cli_pb2_grpc.BrkStub(channel)
stubs[host] = stub
return stubs[host]
Point 4. It would be best to use existing library. I haven't found any I am prepared to vouch for yet.

You probably want to use memcached instead. It's very fast, very stable, very popular, has good python libraries, and will allow you to grow to a distributed cache should you need to:
http://www.danga.com/memcached/

I'm not sure any of these answers are doing what you want.
I have a similar problem and I'm using a drop-in replacement for lrucache called cachetools which allows you to pass in a lock to make it a bit safer.

For a thread safe object you want threading.local:
from threading import local
safe = local()
safe.cache = {}
You can then put and retrieve objects in safe.cache with thread safety.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.