I have an Flask-RESTful API that acts as a gateway to TCP devices that cannot handle asynchronous calls to them.
Since for me Resource objects just spawn, I cannot queue and manage them from a single point of source.
I tried to create a decorator that Resources that need sync will use. In this decorator, I tried to append the id of TCP device (load_id) to a list in the global scope, and remove it after the request is handled.
The problem is, when an async request is made, the first Resource gets an empty list, appends to it, and while it is still being executed the second Resource is created for a second request. This second Resource instance also gets an empty list. Therefore I cannot actually make Resource instances share a list.
I tried this without a decorator, within get, put methods explicitly, with locks defined on database model objects or with a common handler object that manages locks on objects that are uniquely identified with load_id, but no avail, I always get a list that is outdated.
Here is the stripped down version of one:
loads_with_query_in_progress = [] # Global scope
def disallow_async_calls(func):
#wraps(func)
def decorator(*args, **kwargs):
global loads_with_query_in_progress
load_id = kwargs.get("load_id", None)
load = Load.query.get(load_id)
if load in loads_with_query_in_progress: # Load is in the list. Aborting.
raise Exception
else:
loads_with_query_in_progress.append(load) # APPEND
try:
decorated_function_output = func(*args, **kwargs)
except Exception as e:
loads_with_query_in_progress.remove(load) # Expt handling cleanup
raise e
loads_with_query_in_progress.remove(load) # Remove lock
return decorated_function_output
return decorator
class LoadStateAPI(Resource):
decorators = [auth.login_required,
disallow_async_calls]
...
def get(self, load_id):
load = Load.query.get(load_id)
try:
rqObj = RelayQueryObject(load)
rqObj.execute()
except:
raise
if(rqObj.fsmState == CommState.COMPLETED):
return {'state' : rqObj.response}, 200
Here on the code, in the first request the line commented with #APPEND changes the loads_with_query_in_progress in its scope. But when another request is spawned the variable loads_with_query_in_progress is fetched unedited.
Is there any way to resolve this async-sync conversion?
The discrepancy was due to the fact that production uses uwsgi, and uwsgi uses several processes, which results in phantom differences between shared objects, cause we were using different objects in different processes, but we were debugging with logging processes that all log to the same file.
Related
The Problem
I am using a pattern to cache SQLAlchemy objects in Redis. Whenever an instance is modified and committed, I want to clear the relevant caches so that it will be reloaded when next fetched. This clear must happen after commit to avoid race conditions (another thread querying cache, missing, and reloading stale data into the cache).
I have fought with this for a long time, coming up with various solutions that work sometimes but nothing bulletproof. This seems like a straightforward enough problem that there should be a solution to it. How do I trigger some code every time a change is committed to a SQLAlchemy instance?
What I've Tried
Events
I've tried to stitch together some SQLAlchemy events to achieve my goal with varying levels of success. Listening to "after_insert" and "after_update" will tell me when an object is modified, and "after_commit" tells me that whatever was modified was saved, so I had a scheme where the first two events would register listeners for "after_commit", which would in turn pass the object to my cache clearing function. Like this:
def _register_after_commit(_: Mapper, __: Connection, target: MyClass) -> None:
""" Generic callback that adds this function for a target change without params """
targets.add(target) # Clear cache uses this set to know which instances to clear
event.listen(get_session(), "after_commit", clear_cache)
event.listen(MyClass, "after_insert", _register_after_commit)
event.listen(MyClass, "after_update", _register_after_commit)
This works most of the time, but I occasionally get DetachedInstanceError when accessing the attributes on the target that I need to know to clear them from cache (e.g. id). I've read that this happens because of automatic expiring during the commit which causes SQLAlchemy to want to refresh all attributes. I can't turn off auto-expiring nor can I expunge every object that passes through here, either of those could end up breaking other pieces of the code base.
A Custom Session
I made my own session class that looked something like this:
class SessionWithCallback(scoped_session):
""" A version of orm.Session which can call a method after commit completes """
def __init__(self, session_factory, scopefunc = None) -> None:
super().__init__(session_factory=session_factory, scopefunc=scopefunc)
self._callbacks = {}
def add_callback(self, func, *args, **kwargs) -> None:
"""
Adds a callback to be called after commit, ensuring only a single
instance of the callback for each set of args and kwargs is used
"""
key = f"{func}.{args}.{kwargs}"
self._callbacks[key] = (func, args, kwargs)
def run_callbacks(self) -> None:
"""
Executes all callbacks
"""
for (func, args, kwargs) in self._callbacks.values():
func(*args, **kwargs)
self._callbacks = {}
def commit(self) -> None:
""" Flush and commit the current transaction """
super().commit()
self.run_callbacks()
Then instead of _register_after_commit using the "after_commit" event, it would call the current session's add_callback function. This seemed to work when running tests with just SQLAlchemy, but it fell apart when integrating with a Flask app which uses these models and utilizes Flask-SQLAlchemy. I followed instructions to customize session (overriding create_session on the SQLAlchemy instance) but as soon as I commit anything I get an exception that scoped_session has no attribute add_callback. I stepped through and it is using my class internally somehow, but the session it gives me is not an instance of my class. Confusing.
I've Considered
Storing primary keys in my listeners, then requiring the callbacks to open a session and query for a new instance itself if it needs more info. Might work, but it feels like extra I/O that I shouldn't need. I could have several different callbacks for one instance, them all querying feels like a lot of work.
Having some global place to store callbacks instead of on the Session, so that I can avoid the add_callback function. I'd need to still make this session-specific and thread-safe though. Easy enough in Flask, but Flask isn't the only app that needs to share this code.
Just doing these cache clears manually... but that's bound to cause developer error.
Spawning some time-delayed job to clear cache from "after_insert/update". That gets really complicated really fast and sounds like a real headache. For instance, how do you decide how long to wait?
I am trying to do something like
resource = MyResource()
def fn(x):
something = dosemthing(x, resource)
return something
client = Client()
results = client.map(fn, data)
The issue is that resource is not serializable and is expensive to construct.
Therefore I would like to construct it once on each worker and be available to be used by fn.
How do I do this?
Or is there some other way to make resource available on all workers?
You can always construct a lazy resource, something like
class GiveAResource():
resource = [None]
def get_resource(self):
if self.resource[0] is None:
self.resource[0] = MyResource()
return self.resource[0]
An instance of this will serialise between processes fine, so you can include it as an input to any function to be executed on workers, and then calling .get_resource() on it will get your local expensive resource (which will get remade on any worker which appears later on).
This class would be best defined in a module rather than dynamic code.
There is no locking here, so if several threads ask for the resource at the same time when it has not been needed so far, you will get redundant work.
I'm fairly new to Python and to Spark but let me see if I can explain what I am trying to do.
I have a bunch of different types of pages that I want to process. I created a base class for all the common attributes of those pages and then have a page specific class inherit from the base class. The idea being that the spark runner will be able to do the exact thing for all pages by changing just the page type when called.
Runner
def CreatePage(pageType):
if pageType == "Foo":
return PageFoo(pageType)
elif pageType == "Bar":
return PageBar(pageType)
def Main(pageType):
page = CreatePage(pageType)
pageList_rdd = sc.parallelize(page.GetPageList())
return = pageList_rdd.mapPartitions(lambda pageNumber: CreatePage(pageType).ProcessPage(pageNumber))
print Main("Foo")
PageBaseClass.py
class PageBase(object):
def __init__(self, pageType):
self.pageType = None
self.dbConnection = None
def GetDBConnection(self):
if self.dbConnection == None:
# Set up a db connection so we can share this amongst all nodes.
self.dbConnection = DataUtils.MySQL.GetDBConnection()
return self.dbConnection
def ProcessPage():
raise NotImplementedError()
PageFoo.py
class PageFoo(PageBase, pageType):
def __init__(self, pageType):
self.pageType = pageType
self.dbConnetion = GetDBConnection()
def ProcessPage():
result = self.dbConnection.cursor("SELECT SOMETHING")
# other processing
There are a lot of other page specific functionality that I am omitting from brevity, but the idea is that I'd like to keep all the logic of how to process that page in the page class. And, be able to share resources like db connection and an s3 bucket.
I know that the way that I have it right now, it is creating a new Page object for every item in the rdd. Is there a way to do this so that it is only creating the one object? Is there a better pattern for this? Thanks!
A few suggestions:
Don't create connections directly. Use connection pool (since each executor uses separate process setting pool size to one is just fine) and make sure that connections are automatically closed on timeout) instead.
Use Borg pattern to store the pool and adjust your code to use it to retrieve connections.
You won't be able to share connections between nodes or even within a single node (see comment about separate processes). The best guarantee you can get is a single connection per partition (or a number of partitions with interpreter reuse).
I am building a messaging app in python that interfaces with twitter I want to create a global fifo queue that the twitterConnection object can access to insert new messages.
this is the main section of the app:
#!/usr/bin/python
from twitterConnector import TwitterConnector
from Queue import deque
#instantiate the request queue
requestQueue = deque()
#instantiate the twitter connector
twitterInterface = TwitterConnector()
#Get all new mentions
twitterInterface.GetMentions()
Specifically I want to be able to call the .GetMentions() method of the TwitterConnector class and process any new mentions and put those messages into the queue for processing separately.
This is the twitterConnector class so far:
class TwitterConnector:
def __init__(self):
self.connection = twitter.Api(consumer_key=self.consumer_key,consumer_secret=self.consumer_secret,access_token_key=self.access_token_key,access_token_secret=self.access_token_secret)
self.mentions_since_id = None
def VerifyCredentials(self):
print self.connection.VerifyCredentials()
def GetMentions(self):
mentions = self.connection.GetMentions(since_id=self.mentions_since_id)
for mention in mentions:
print mention.text
global requestQueue.add(mention) # <- error
Any assistance would be greatly appreciated.
Update: let me see if I can clarify the use case a bit. My TwitterConnector is intended to retrieve messages and will eventually transform the twitter status object into a simplified object that contains the needed information for downstream processing. In the same method, it will perform some other calculations to try to determine other needed information. It is actually the transformed objects that I want to put into the queue. hopefully that makes sense. As a simplified example, this method would map the twitter status object to a new object that would contain properties such as: origin network (twitter), username, message, and location if available. This new object would then be put into the queue.
eventually there will be other connectors for other messaging systems that will perform similar actions. They will receive a message populate the new message object and place them into the same queue for further processing. Once the message has been completely processed a response would be formulated and then added to an appropriate queue for transmittal. Using the new object described above the once the actual processing of the tweet content took place then based upon the origin network and username a response might be returned to the originator via twitter.
I had thought of passing a reference to the queue in the contractor (or as an argument to the method) I had also thought of modifying the method to return a list of the new objects and iterate over them to put them into the queue, but I wanted to make sure there wasn't a better or more efficient way to handle this. I also wish to be able to do the same thing with a logging object and a database handler. Thoughts?
The problem is that "global" should appear on a line of its own
global requestQueue
requestQueue.add(mention)
Moreover, the requestQueue must appear in the module that is defining the class.
If you are importing the requestQueue symbol from another class, you don't need the global at all.
from some_module import requestQueue
class A(object):
def foo(o):
requestQueue.add(o) # Should work
It is generally good idea to avoid globals; a better design often exists. For details on the issue of globals see for instance ([1], [2], [3], [4]).
If by using globals you wish to share a single requestQueue between multiple TwitterConnector instances, you could also pass the queue to the connector in its constructor:
class TwitterConnector:
def __init__(self, requestQueue):
self.requestQueue = requestQueue
...
def GetMentions(self):
mentions = self.connection.GetMentions(since_id = self.mentions_since_id)
requestQueue = deque()
for mention in mentions:
print mention.text
self.requestQueue.add(mention)
Correspondingly, you need to provide your requestQueue to the constructor as:
twitterInterface = TwitterConnector(requestQueue)
I noticed a strange behaviour today: It seems that, in the following example, the config.CLIENT variable stays persistent accross requests – even if the view gets passed an entirely different client_key, the query that gets the client is only executed once (per many requests), and then the config.CLIENT variable stays assigned.
It does not seem to be a database caching issue.
It happens with mod_python as well as with the test server (the variable is reassigned when the test server is restarted).
What am I missing here?
#views.py
from my_app import config
def get_client(client_key=None):
if config.CLIENT == None:
config.CLIENT = get_object_or_404(Client, key__exact=client_key, is_active__exact=True)
return config.CLIENT
def some_view(request, client_key):
client = get_client(client_key)
...
return some_response
# config.py
CLIENT = None
Multiple requests are processed by the same process and global variables like your CLIENT live as long, as process does. You shouldn't rely on global variables, when processing requests - use either local ones, when you need to keep a variable for the time of building response or put data into the database, when something must persist across multiple requests.
If you need to keep some value through the request you can either add it to thread locals (here you should some examples, that adds user info to locals) or simply pass it as a variable into other functions.
OK, just to make it slightly clearer (and in response to the comment by Felix), I'm posting the code that does what I needed. The whole problem arose from a fundamental misunderstanding on my part and I'm sorry for any confusion I might have caused.
import config
# This will be called once per request/view
def init_client(client_key):
config.CLIENT = get_object_or_404(Client, key__exact=client_key, is_active__exact=True)
# This might be called from other modules that are unaware of requests, views etc
def get_client():
return config.CLIENT