Overwriting class variable and concurrent Flask requests - python

I'm running a python Flask server to perform tricky algorithms, one of which assigns cables to tubes.
class Tube:
max_capacity = 5
cables: List[str]
def has_capacity(self):
return len(self.cables) < self.max_capacity
The max capacity was always 5, but now there's a new customer that actually has tubes that can fit 6 cables.
When I receive a request, I now just set Tube.max_capacity = request.args.get('max_capacity', 5). Then each instance of Tube will have the correct setting.
I was wondering if this will keep working if there are multiple requests being handled at the same time?
Are the Flask (I use Gunicorn as WSGI) processes all separate from each other such that this is safe to do? I don't want to end up with strange bugs because the max capacity changed halfway through a request because another request came in.
EDIT:
I tried this out and it appears to work as intended:
#app.route('/concurrency')
def concurrency():
my_value = randint(0, 100)
Concurrency.value = my_value
time.sleep(8)
return f"My value: {my_value} should be equal to Concurrency.value {Concurrency.value}"
class Concurrency:
value = 10
Still, I want to know more about how multiple Flask/Gunicorn requests work to be certain.

WSGI applications are typically served using multiple processes - eventually on different servers -, and requests from a same user will be handled by the first available process. IOW: you do NOT want to change any module or class level variables on a per-request basis, this is **garanteed* to mess up everything.
It's impossible to tell you exactly how to solve the issue without much more context, but in all cases, you'll have to rethink your design.
EDIT:
how do processes behave? If one of them sets the value, does another process see that value as well?
Of course not - each process is totally isolated from the others - so changing a module-level variable or class attribute will only affect the current process. But since processes are not tied to clients (which process will handle a given request is totally unpredictable), such kind of changes in one process will not necessarily be seen in the next request if it's served by another process. AND:
Or, is a process re-used, and then still has the value from the previous request?
process are of course reused, but that doesnt mean the same process will be reused for the next request from a same user - and this is the second part of the issue: when serving another user, your process will still use the "updated" max_capacity value from the previous user.
IOW, what you're doing is garanteed to mess up everything for all your users. That's why we use external (out of process) means to store and share per-user data between requests - either sessions (for volatile data) or a database (for permanent storage).

Related

Share state between threads in bottle

In my Bottle app running on pythonanywhere, I want objects to be persisted between requests.
If I write something like this:
X = {'count': 0}
#route('/count')
def count():
X['count'] += 1
tpl = SimpleTemplate('Hello {{count}}!')
return tpl.render(count=X['count'])
The count increments, meaning that X persists between requests.
I am currently running this on pythonanywhere, which is a managed service where I have no control over the web server (nginx I presume?) threading, load balancing (if any) etc...
My question is, is this coincidence because it's only using one thread while on minimal load from me doing my tests?
More generally, at which point will this stop working? E.g. I have more than one thread/socket/instance/load-balanced server etc...?
Beyond that, what is my best options to make something like this work (sticking to Bottle) even if I have to move to a barebones server.
Here's what Bottle docs have to say about their request object:
A thread-safe instance of LocalRequest. If accessed from within a request callback, this instance always refers to the current request (even on a multi-threaded server).
But I don't fully understand what that means, or where global variables like the one I used stand with regards to multi-threading.
TL;DR: You'll probably want to use an external database to store your state.
If your application is tiny, and you're planning to always have exactly one server process running, then your current approach can work; "all" you need to do is acquire a lock around every (!) access to the shared state (the dict X in your sample code). (I put "all" in scare quotes there because it's likely to become more complicated than it sounds at first.)
But, since you're asking about multithreading, I'll assume that your application is more than a toy, meaning that you plan to receive substantial traffic and/or want to handle multiple requests concurrently. In this case, you'll want multiple processes, which means that your approach--storing state in memory--cannot work. Memory is not shared across processes. The (general) way to share state across processes is to store the state externally, e.g. in a database.
Are you familiar with Redis? That'd be on my short list of candidates.
I go the answers by contacting PythonAnywhere support, who had this to say:
When you run a website on a free PythonAnywhere account, just
one process handles all of your requests -- so a global variable like
the one you use there will be fine. But as soon as you want to scale
up, and get (say) a hacker account, then you'll have multiple processes
(not, not threads) -- and of course each one will have its own global
variables, so things will go wrong.
So that part deals with the PythonAnywhere specifics on why it works, and when it would stop working on there.
The answer to the second part, about how to share variables between multiple Bottle processes, I also got from their support (most helpful!) once they understood that a database would not work well in this situation.
Different processes cannot of course share variables, and the most viable solution would be to:
write your own kind of caching server to handle keeping stuff in memory [...] You'd have one process that ran all of the time, and web API requests would access it somehow (an internal REST API?). It could maintain stuff in memory [...]
Ps: I didn't expect other replies to tell me to store state in a database, I figured that the fact I'm asking this means I have a good reason not to use a database, apologies for time wasted!

How to load files to Google-App-Engine in standard enviroment

I am using Google-App-Engine standard (Not flex) Enviroment with Python2.7, and I need to load some pre-trained models (Gensim's Word2vec and Keras's LSTM).
I need to load it once (since it very slow - takes around 1.5 seconds) and keep it in faster access for several hours.
What is the best & fastest way to do so?
Thanks!
IMHO the best place for read-only data (including imported code!) needed to be accessed at any time by individual requests is the global application variables area.
Such variables would typically be loaded exactly once per GAE instance lifetime and available until the instance goes away.
Since loading of the data is expensive you need to be aware that it could impact the response time for requests coming in while the instance is starting up (i.e. while the loading request is still active). There are 2 ways to address this:
one would be to use "lazy" loading of the data - effective if just a small percentage of the incoming requests actually need the data. But the requests which actually need the data when it's not available will still be affected, so it'll just reduce the impact of the problem. The method is described in detail in the App Engine Startup time and the Global Variable problem article:
from google.appengine.ext import ndb
# a global variable
gCDNServer = None
def getCDN():
global gCDNServer
if gCDNServer==None:
gCDNServer = Settings.query(Settings.name == "gCDNServer").value
return gCDNServer
another approach, which would completely eliminate the problem, is to make your app support warmup requests (available only if you're using automatic scaling). The data would be loaded by the warmup request handler and will always be available for "live" requests (because no "live" requests will be routed to the instance until the warmup request handling completes).
It might be possible to add logic to drop the data from memory (to reduce the app's memory footprint) if/when you know it'll no longer be needed (i.e. after those several hours you mentioned expired), but that complicates the picture, especially if you configured your app as threadsafe. I'd simply separate the code which doesn't need the data from the one which does in different services and leave autoscaling shut down the instances with the global data when no longer needed.

Django Threading Structure

First of all to begin with 'Yes' i checked and googled this topic but can't find anything that gives me a clear answer to my question? I am a beginner in Djagno and studying its documentation where i read about the Thread Safety Considerations for render method of nodes for Templates Tags. Here is the link to the documentation Link. My question lies where it states that Once the node is parsed the render method for that node might be called multiple times i am confused whether it is talking about the use of the template tag in the same document at different places for the same user at the single instance level of the user on the server or the use of the template tag for multiple request coming from users all around the world sharing the same django instance in memory? If its the latter one does't django create a new instance at the server level for every new user request and have separate resources for every user in the memory or am i wrong about this?
It's the latter.
A WSGI server usually runs a number of persistent processes, and in each process it runs a number of threads. While some automatic scaling can be applied, the number of processes and threads is more or less constant, and determines how many concurrent requests Django can handle. The days where each request would create a new CGI process are long gone, and in most cases persistent processes are much more efficient.
Each process has its own memory, and the communication between processes is usually handled by the database, the cache etc. They can't communicate directly through memory.
Each thread within a process shares the same memory. That means that any object that is not locally scoped (e.g. only defined inside a function), is accessible from the other threads. The cached template loader parses each template once per process, and each thread uses the same parsed nodes. That also means that if you set e.g. self.foo = 'bar' in one thread, each thread will then read 'bar' when accessing self.foo. Since multiple threads run at the same time, this can quickly become a huge mess that's impossible to debug, which is why thread safety is so important.
As the documentation says, as long as you don't store data on self, but put it into context.render_context, you should be fine.

memcache.get returns wrong object (Celery, Django)

Here is what we have currently:
we're trying to get cached django model instance, cache key includes name of model and instance id. Django's standard memcached backend is used. This procedure is a part of common procedure used very widely, not only in celery.
sometimes(randomly and/or very rarely) cache.get(key) returns wrong object: either int or different model instance, even same-model-different-id case appeared. We catch this by checking correspondence of model name & id and cache key.
bug appears only in context of three of our celery tasks, never reproduces in python shell or other celery tasks. UPD: appears under long-running CPU-RAM intensive tasks only
cache stores correct value (we checked that manually at the moment the bug just appeared)
calling same task again with same arguments might don't reproduce the issue, although probability is much higher, so bug appearances tend to "group" in same period of time
restarting celery solves the issue for the random period of time (minutes - weeks)
*NEW* this isn't connected with memory overflow. We always have at least 2Gb free RAM when this happens.
*NEW* we have cache_instance = cache.get_cache("cache_entry") in static code. During investigation, I found that at the moment the bug happens cache_instance.get(key) returns wrong value, although get_cache("cache_entry").get(key) on the next line returns correct one. This means either bug disappears too quickly or for some reason cache_instance object got corrupted.
Isn't cache instance object returned by django's cache thread safe?
*NEW* we logged very strange case: as another wrong object from cache, we got model instance w/o id set. This means, the instance was never saved to DB therefore couldn't be cached. (I hope)
*NEW* At least one MemoryError was logged these days
I know, all of this sounds like some sort of magic.. And really, any ideas how that's possible or how to debug this would be very appreciated.
PS: My current assumption is that this is connected with multiprocessing: as soon as cache instance is created in static code and before Worker process fork this would lead to all workers sharing same socket (Does it sound plausibly?)
Solved it finally:
Celery has dynamic scaling feature- it's capable to add/kill workers according to load
It does it via forking existing one
Opened sockets and files are copied to the forked process, so both processes share them, which leads to race condition, when one process reads response of another one. Simply, it's possible that one process reads response intended for second one, and vise-versa.
from django.core.cache import cache this object stores pre-connected memcached socket. Don't use it when your process could be dynamically forked.. and don't use stored connections, pools and other.
OR store them under current PID, and check it each time you're accessing cache
This has been bugging me for a while until I found this question and answer. I just want to add some things I've learnt.
You can easily reproduce this problem with a local memcached instance:
from django.core.cache import cache
import os
def write_read_test():
pid = os.getpid()
cache.set(pid, pid)
for x in range(5):
value = cache.get(pid)
if value != pid:
print "Unexpected response {} in process {}. Attempt {}/5".format(
value, pid, x+1)
os._exit(0)
cache.set("access cache", "before fork")
for x in range(5):
if os.fork() == 0:
write_read_test()
What you can do is close the cache client as Django does in the request_finished signal:
https://github.com/django/django/blob/master/django/core/cache/init.py#L128
If you put a cache.close() after the fork, everything works as expected.
For celery you could connect to a signal that is fired after the worker is forked and execute cache.close().
This also affects gunicorn when preload is active and the cache is initialized before forking the workers.
For gunicorn, you could use post_fork in your gunicorn configuration:
def post_fork(server, worker):
from django.core.cache import cache
cache.close()

Set / get objects with memcached

In a Django Python app, I launch jobs with Celery (a task manager). When each job is launched, they return an object (lets call it an instance of class X) that lets you check on the job and retrieve the return value or errors thrown.
Several people (someday, I hope) will be able to use this web interface at the same time; therefore, several instances of class X may exist at the same time, each corresponding to a job that is queued or running in parallel. It's difficult to come up with a way to hold onto these X objects because I cannot use a global variable (a dictionary that allows me to look up each X objects from a key); this is because Celery uses different processes, not just different threads, so each would modify its own copy of the global table, causing mayhem.
Subsequently, I received the great advice to use memcached to share the memory across the tasks. I got it working and was able to set and get integer and string values between processes.
The trouble is this: after a great deal of debugging today, I learned that memcached's set and get don't seem to work for classes. This is my best guess: Perhaps under the hood memcached serializes objects to the shared memory; class X (understandably) cannot be serialized because it points at live data (the status of the job), and so the serial version may be out of date (i.e. it may point to the wrong place) when it is loaded again.
Attempts to use a SQLite database were similarly fruitless; not only could I not figure out how to serialize objects as database fields (using my Django models.py file), I would be stuck with the same problem: the handles of the launched jobs need to stay in RAM somehow (or use some fancy OS tricks underneath), so that they update as the jobs finish or fail.
My best guess is that (despite the advice that thankfully got me this far) I should be launching each job in some external queue (for instance Sun/Oracle Grid Engine). However, I couldn't come up with a good way of doing that without using a system call, which I thought may be bad style (and potentially insecure).
How do you keep track of jobs that you launch in Django or Django Celery? Do you launch them by simply putting the job arguments into a database and then have another job that polls the database and runs jobs?
Thanks a lot for your help, I'm quite lost.
I think django-celery does this work for you. Did you had a look at the tables made by django-celery? I.e. djcelery_taskstate holds all data for a given task like state, worker_id and so on. For periodic tasks there is a table called djcelery_periodictask.
In a Django view you can access the TaskMeta object:
from djcelery.models import TaskMeta
task = TaskMeta.objects.get(task_id=task_id)
print task.status

Categories