How to implement a FIFO queue that supports namespaces - python

I'm using the following approach to handle a FIFO queue based on Google App Engine db.Model (see this question).
from google.appengine.ext import db
from google.appengine.ext import webapp
from google.appengine.ext.webapp import run_wsgi_app
class QueueItem(db.Model):
created = db.DateTimeProperty(required=True, auto_now_add=True)
data = db.BlobProperty(required=True)
#staticmethod
def push(data):
"""Add a new queue item."""
return QueueItem(data=data).put()
#staticmethod
def pop():
"""Pop the oldest item off the queue."""
def _tx_pop(candidate_key):
# Try and grab the candidate key for ourselves. This will fail if
# another task beat us to it.
task = QueueItem.get(candidate_key)
if task:
task.delete()
return task
# Grab some tasks and try getting them until we find one that hasn't been
# taken by someone else ahead of us
while True:
candidate_keys = QueueItem.all(keys_only=True).order('created').fetch(10)
if not candidate_keys:
# No tasks in queue
return None
for candidate_key in candidate_keys:
task = db.run_in_transaction(_tx_pop, candidate_key)
if task:
return task
This queue works as expected (very good).
Right now my code has a method that access this FIFO queue invoked by a deferred queue:
def deferred_worker():
data= QueueItem.pop()
do_something_with(data)
I would like to enhance this method and the queue data structure adding a client_ID parameter representing a specific client that needs to access its own Queue.
Something like:
def deferred_worker(client_ID):
data= QueueItem_of_this_client_ID.pop() # I need to implement this
do_something_with(data)
How could I code the Queue to be client_ID aware?
Constraints:
- The number of clients is dynamic and not predefined
- Taskqueue is not an option (1. ten max queues 2. I would like to have full control on my queue)
Do you know how could I add this behaviour using the new Namespaces api (Remember that I'm not calling the db.Model from a webapp.RequestHandler)?
Another option: I could add a client_ID db.StringProperty to the QueueItem using it has a filter on pull method:
QueueItem.all(keys_only=True).filter(client_ID=an_ID).order('created').fetch(10)
Any better idea?

Assuming your "client class" is really a request handler the client calls, you could do something like this:
from google.appengine.api import users
from google.appengine.api.namespace_manager import set_namespace
class ClientClass(webapp.RequestHandler):
def get(self):
# For this example let's assume the user_id is your unique id.
# You could just as easily use a parameter you are passed.
user = users.get_current_user()
if user:
# If there is a user, use their queue. Otherwise the global queue.
set_namespace(user.user_id())
item = QueueItem.pop()
self.response.out.write(str(item))
QueueItem.push('The next task.')
Alternatively, you could also set the namespace app-wide.
By setting the default namespace all calls to the datastore will be "within" that namespace, unless you explicitly specify otherwise. Just note, to fetch and run tasks you'll have to know the namespace. So you probably want to maintain a list of namespaces in the default namespace for cleanup purposes.

As I said in response to your query on my original answer, you don't need to do anything to make this work with namespaces: the datastore, on which the queue is built, already supports namespaces. Just set the namespace as desired, as described in the docs.

Related

Drawbacks of executing code in an SQLAlchemy managed session and if so why?

I have seen different "patterns" in handling this case so I am wondering if one has any drawbacks comapred to the other.
So lets assume that we wish to create a new object of class MyClass and add it to the database. We can do the following:
class MyClass:
pass
def builder_method_for_myclass():
# A lot of code here..
return MyClass()
my_object=builder_method_for_myclass()
with db.managed_session() as s:
s.add(my_object)
which seems that only keeps the session open for adding the new object but I have also seen cases where the entire builder method is called and executed within the managed session like so:
class MyClass:
pass
def builder_method_for_myclass():
# A lot of code here..
return MyClass()
with db.managed_session() as s:
my_object=builder_method_for_myclass()
are there any downsides in either of these methods and if yes what are they? Cant find something specific about this in the documentation.
When you build objects depending on objects fetched from a session you have to be in a session. So a factory function can only execute outside a session for the simplest cases. Usually you have to pass the session around or make it available on a thread local.
For example in this case to build a product I need to fetch the product category from the database into the session. So my product factory function depends on the session instance. The new product is created and added to the same session that the category is also in. An implicit commit should also occur when the session ends, ie the context manager completes.
def build_product(session, category_name):
category = session.query(ProductCategory).where(
ProductCategory.name == category_name).first()
return Product(category=category)
with db.managed_session() as s:
my_product = build_product(s, "clothing")
s.add(my_product)

Initializing state on dask-distributed workers

I am trying to do something like
resource = MyResource()
def fn(x):
something = dosemthing(x, resource)
return something
client = Client()
results = client.map(fn, data)
The issue is that resource is not serializable and is expensive to construct.
Therefore I would like to construct it once on each worker and be available to be used by fn.
How do I do this?
Or is there some other way to make resource available on all workers?
You can always construct a lazy resource, something like
class GiveAResource():
resource = [None]
def get_resource(self):
if self.resource[0] is None:
self.resource[0] = MyResource()
return self.resource[0]
An instance of this will serialise between processes fine, so you can include it as an input to any function to be executed on workers, and then calling .get_resource() on it will get your local expensive resource (which will get remade on any worker which appears later on).
This class would be best defined in a module rather than dynamic code.
There is no locking here, so if several threads ask for the resource at the same time when it has not been needed so far, you will get redundant work.

Celery pickling not playing nice with Cassandra driver, can't figure out the root cause

I'm experiencing some behavior that I can't quite figure out. I'm using Cassandra to store message objects, and I'm using Celery for async pulls and pushes to the database. Everything is working fine, except for a single Celery task; the other tasks that use the same code/classes work. Here's a rough breakdown of the code logic:
db_manager = DBManager()
class User(object):
def __init__(self, user_id):
... normal init stuff ...
self.loader()
#run_async
def loader(self):
... loads from database if found, otherwise pulls from API ...
# THIS WORKS
#celery.task(name='user-to-db', filter=task_method)
def to_db(self):
# db_manager is a custom backend that handles relevant db reads, writes, etc.
db_manager.add('users', self.user_payload)
# THIS WORKS
#celery.task(name='load-friends', filter=task_method)
def load_friends(self):
# Checks secondary redis index for friends of user
friends = redis.srandmember('users:the-users-id:friends', self.id, 20)
if not friends:
profiles = load_friends_from_api(user_id=self.id)
else:
query = "SELECT * FROM keyspace.users WHERE id IN ({friends})".format(friends=friends)
# Init a User object for every friend
loaded_friends = [User(friend) for friend in profiles]
# Returns a class container with all the instances of User(friend), accessible through a class property
return FriendContainer(self.id, loaded_friends)
# THIS DOES NOT WORK
#celery.task(name='get-user-messages', filter=task_method)
def get_user_messages(self):
# THIS IS WHERE IT FAILS #
messages = db_manager.get("SELECT message FROM keyspace.message_timelines WHERE user_id = {user_id}".format(user_id=self.id))
# THAT LINE ABOVE #
# Init a message class object for every message payload in database
msgs = [Message(m, user=self) for m in messages]
# Returns a message container class holding all the message objects, accessible through a class property
return MessageContainer(msgs)
This last class method throws an error:
File "/usr/local/lib/python2.7/dist-packages/kombu/serialization.py", line 356, in pickle_dumps
return dumper(obj, protocol=pickle_protocol)
EncodeError: Can't pickle <class 'cassandra.io.eventletreactor.message'>: attribute lookup cassandra.io.eventletreactor.message failed
cassandra.io.eventletreactor.message points to a user-defined type in Cassandra that I use as a container for message objects per user. The line that throws this error is:
messages = db_manager.get("SELECT message FROM keyspace.message_timelines WHERE user_id = {user_id}".format(user_id=self.id))
This is the method from DBManager():
class DBManager(object):
... stuff ...
def get(self, query):
# I do some stuff to prepare the query, namely substituting `WHERE this = that` for `WHERE this = ?` to create a Cassandra prepared statement.
statement = cassandra.prepare(query_prepared)
# I want these messages as a dict, not the default namedtuple
cassandra.row_factory = dict_factory
# User id is parsed out of query
results = cassandra.execute(statement, (user_id,))
rows = results.current_rows
# rows is a list of dicts, no weird class references or anything in there
return rows
I've read that Celery tasks out of class methods is/was kind of experimental, but I can't figure out why all the other methods qua tasks that use the same instance of DBManager are working.
The problem seems to be localized to some issue with the user-defined type message that's not playing nice within the Cassandra driver; however, if I run the get method from DBManager within the Celery task itself, it works. That is, if I copy/paste the code that is throwing the error from DBManager.get into User.get_user_messages, it works fine. If I try to call DBManager.get from within User.get_user_messages, it breaks.
I just can't figure out where the problem is. I can do all the following just fine:
Run the get_user_messages method without Celery, and it works.
Run the get_user_messages method WITH Celery if I run the get method code right in the Celery task method itself.
I can run other methods registered as Celery tasks that point to other methods in DBManager that use the Cassandra driver, even ones that insert the same message user-defined type into the database.
I've tried pickling ALL THE THINGS all the way down myself, and in various combinations, and can't reproduce the error.
What I have not tried:
Change serializer to json or yaml. There are a few convenience items in the db payload that won't serialize with either of those two.
Use dill instead of pickle. It seems like this should work without having to switch serializers given that I can get various parts working separately.
I could just say screw it and run the query directly through the Cassandra driver instead of my DBManager class, but I feel like this should be solvable and I'm just missing something really, really obvious, so obvious that I'm not seeing it. Any suggestions on where to look would be greatly appreciated.
In case of relevance: Cassandra 3.3, CQL 3.4, DataStax python driver 3.1
Meh, I found the problem, and it WAS really obvious. I guess I didn't actually try pickling all the things, just most of the things, and I didn't catch this in my 4am debugging stupor.
At any rate, cassandra.row_factory = dict_factory, when called on a user defined type, doesn't actually return everything as a dict. It gives a dict of {'label': message(x='this', y='that')}, where message is a namedtuple. The Cassandra driver dynamically creates the namedtuple inside of a class instance, and so pickle couldn't find it.

Access global queue object in python

I am building a messaging app in python that interfaces with twitter I want to create a global fifo queue that the twitterConnection object can access to insert new messages.
this is the main section of the app:
#!/usr/bin/python
from twitterConnector import TwitterConnector
from Queue import deque
#instantiate the request queue
requestQueue = deque()
#instantiate the twitter connector
twitterInterface = TwitterConnector()
#Get all new mentions
twitterInterface.GetMentions()
Specifically I want to be able to call the .GetMentions() method of the TwitterConnector class and process any new mentions and put those messages into the queue for processing separately.
This is the twitterConnector class so far:
class TwitterConnector:
def __init__(self):
self.connection = twitter.Api(consumer_key=self.consumer_key,consumer_secret=self.consumer_secret,access_token_key=self.access_token_key,access_token_secret=self.access_token_secret)
self.mentions_since_id = None
def VerifyCredentials(self):
print self.connection.VerifyCredentials()
def GetMentions(self):
mentions = self.connection.GetMentions(since_id=self.mentions_since_id)
for mention in mentions:
print mention.text
global requestQueue.add(mention) # <- error
Any assistance would be greatly appreciated.
Update: let me see if I can clarify the use case a bit. My TwitterConnector is intended to retrieve messages and will eventually transform the twitter status object into a simplified object that contains the needed information for downstream processing. In the same method, it will perform some other calculations to try to determine other needed information. It is actually the transformed objects that I want to put into the queue. hopefully that makes sense. As a simplified example, this method would map the twitter status object to a new object that would contain properties such as: origin network (twitter), username, message, and location if available. This new object would then be put into the queue.
eventually there will be other connectors for other messaging systems that will perform similar actions. They will receive a message populate the new message object and place them into the same queue for further processing. Once the message has been completely processed a response would be formulated and then added to an appropriate queue for transmittal. Using the new object described above the once the actual processing of the tweet content took place then based upon the origin network and username a response might be returned to the originator via twitter.
I had thought of passing a reference to the queue in the contractor (or as an argument to the method) I had also thought of modifying the method to return a list of the new objects and iterate over them to put them into the queue, but I wanted to make sure there wasn't a better or more efficient way to handle this. I also wish to be able to do the same thing with a logging object and a database handler. Thoughts?
The problem is that "global" should appear on a line of its own
global requestQueue
requestQueue.add(mention)
Moreover, the requestQueue must appear in the module that is defining the class.
If you are importing the requestQueue symbol from another class, you don't need the global at all.
from some_module import requestQueue
class A(object):
def foo(o):
requestQueue.add(o) # Should work
It is generally good idea to avoid globals; a better design often exists. For details on the issue of globals see for instance ([1], [2], [3], [4]).
If by using globals you wish to share a single requestQueue between multiple TwitterConnector instances, you could also pass the queue to the connector in its constructor:
class TwitterConnector:
def __init__(self, requestQueue):
self.requestQueue = requestQueue
...
def GetMentions(self):
mentions = self.connection.GetMentions(since_id = self.mentions_since_id)
requestQueue = deque()
for mention in mentions:
print mention.text
self.requestQueue.add(mention)
Correspondingly, you need to provide your requestQueue to the constructor as:
twitterInterface = TwitterConnector(requestQueue)

Initializing cherrypy.session early

I love CherryPy's API for sessions, except for one detail. Instead of saying cherrypy.session["spam"] I'd like to be able to just say session["spam"].
Unfortunately, I can't simply have a global from cherrypy import session in one of my modules, because the cherrypy.session object isn't created until the first time a page request is made. Is there some way to get CherryPy to initialize its session object immediately instead of on the first page request?
I have two ugly alternatives if the answer is no:
First, I can do something like this
def import_session():
global session
while not hasattr(cherrypy, "session"):
sleep(0.1)
session = cherrypy.session
Thread(target=import_session).start()
This feels like a big kludge, but I really hate writing cherrypy.session["spam"] every time, so to me it's worth it.
My second solution is to do something like
class SessionKludge:
def __getitem__(self, name):
return cherrypy.session[name]
def __setitem__(self, name, val):
cherrypy.session[name] = val
session = SessionKludge()
but this feels like an even bigger kludge and I'd need to do more work to implement the other dictionary functions such as .get
So I'd definitely prefer a simple way to initialize the object myself. Does anyone know how to do this?
For CherryPy 3.1, you would need to find the right subclass of Session, run its 'setup' classmethod, and then set cherrypy.session to a ThreadLocalProxy. That all happens in cherrypy.lib.sessions.init, in the following chunks:
# Find the storage class and call setup (first time only).
storage_class = storage_type.title() + 'Session'
storage_class = globals()[storage_class]
if not hasattr(cherrypy, "session"):
if hasattr(storage_class, "setup"):
storage_class.setup(**kwargs)
# Create cherrypy.session which will proxy to cherrypy.serving.session
if not hasattr(cherrypy, "session"):
cherrypy.session = cherrypy._ThreadLocalProxy('session')
Reducing (replace FileSession with the subclass you want):
FileSession.setup(**kwargs)
cherrypy.session = cherrypy._ThreadLocalProxy('session')
The "kwargs" consist of "timeout", "clean_freq", and any subclass-specific entries from tools.sessions.* config.

Categories