Managing global data in a namespaced, multitenant Appengine application - python

I'm designing a multitenant system using namespaces.
Users authenticate via OpenID, and a User model is saved in Cloud Datastore. Users will be grouped into Organizations, also modeled in the database. Application data needs to be partitioned by organization.
So the idea is to map namespaces to "Organizations".
When a user logs in, their Organization is looked up and saved in a session.
WSGI middleware inspects the session and sets the namespace accordingly.
My question relates to how best to manage switching between data that is "global" (i.e. User and Organization) and application data (namespaced by organization)
My current approach is to use python decorators and context managers to temporarily switch to the global namepace for operations that access such global data. e.g.
standard_datastore_op()
with global_namespace():
org = Organization.query(Organization.key=org_key)
another_standard_datastore_op(org.name)
or
#global_namespace
def process_login(user_id):
user = User.get_by_id(user_id)
This also implies that models have cross-namespace KeyProperties:
class DomainData(ndb.Model): # in the current user's namespace
title = ndb.StringProperty()
foreign_org = ndb.KeyProperty(Organization) #in the "global" namespace
Does this seem a reasonable approach? It feels a little fragile to me, but I suspect that is because I'm new to working with namespaces in App Engine. My alternative idea is to extract all "global" data from Cloud Datastore into external web-service but I'd rather avoid that if possible.
Advice gratefully received. Thanks in advance

Decorators is a perfectly fine approach, which also has the benefit of clearly labeling which functions operate outside of organization-specific namespace bounds.
def global_namespace(global_namespace_function):
def wrapper():
# Save the current namespace.
previous_namespace = namespace_manager.get_namespace()
try:
# Empty string = default namespace, change to whatever you want to use as 'global'
global_namespace = ''
# Switch to 'global' namespace
namespace_manager.set_namespace(global_namespace)
# Run code that requires global namespace
global_namespace_function()
finally:
# Restore the saved namespace.
namespace_manager.set_namespace(previous_namespace)
return wrapper
On a related note, we also have documentation on using namespaces for multitenenacy.

Related

Advice on database access approach in a custom environment using Python Pyramid

I’m new to Pyramid. I’ve used Python for a few months. I've created a Python application on Linux to maintain an Oracle database using weekly data feeds from a vendor. To get that done, one of the things I did was to create a customized database wrapper class using the cx_Oracle package. I had specific requirements for maintaining history in the DB. All Oracle access goes through the methods in this wrapper. I now want to use Pyramid to create a simple reporting browser interface to the Oracle DB. To allow me the greatest flexibility, I’d like to use the wrapper I already have to get to the data on Oracle instead of Alchemy (or possibly with it, I'm not sure).
In my Pyramid app, I’ve considered importing my wrapper in my views.py init method but that seems to get executed with every browser submit.
Can anyone suggest how I might create a persistent connection to Oracle that I can use over and over from my reporting application which uses my wrapper class? I’m finding Pyramid a bit opaque. I’m never sure what’s happening behind the scenes but I’m willing to operate on trust until I get the swing of it. I need the benefit of the automatic authorization/ authentication and login.
What I’m really looking for is a good approach from experienced Pyramid users before going down the wrong track.
Many thanks.
You should definitely use SQLAlchemy as it makes use of connection pooling and such. In your case SQLAlchemy would use cx_Oracle underneath but make you be able to concentrate writing actual code instead of maintaining connections/pooling them/and such.
You should follow the patterns in Wiki2 tutorial to set up basic SQLAlchemy.
So, basically, the question boils down to "how do I use my existing API with Pyramid", right? This is quite easy as Pyramid is database-agnostic and very transparent in this area despite what you say :) Basically, you import it, call its methods, retrieve data and send it to the template:
import pokemon_api as api
#view_config(route_name='list_pokemon', renderer='pokemon_list.mako')
def list_pokemon(request):
# To illustrate how to get data from the request to send to your API
batch_start = request.GET.get('batch_start', 0)
batch_end = batch_start + request.GET.get('batch_size', 0)
sort_order = request.GET.get('sort_by', 'name')
# Here we call our API - we don't actually care where it gets the data from, it's a black box
pokemon = api.retrieve_pokemon(from=batch_start, to=batch_end, sort=sort_order)
# send the data to the renderer/template
return {
'pokemon': pokemon
}
#view_config(route_name='add_pokemon', request_method='POST')
def add_pokemon(request):
"""
Add a new Pokemon
"""
name = request.POST.get('name', 0)
weight = request.GET.get('weight', 0)
hight = request.GET.get('hight', 0)
api.create(name=name, weight=weight, height=height)
# go back to the listing
return HTTPFound('/pokemon_list')
and if your API needs some initialization, you can do it at startup time
import pokemon_api as api
def main(global_config, **settings):
""" This function returns a Pyramid WSGI application.
"""
config = Configurator(settings=settings)
...
api.init("MAGIC_CONNECTION_STRING")
return config.make_wsgi_app()
Of course, this assumes your API already handles transactions, connections, pooling and other boring stuff :)
One last point to mention - in Python you generally don't import things inside methods, you import them at the top of the file at module-level scope. There are exceptions for that rule, but I don't see why you would need that in this case. Also, importing a module should be free of side-effects (which you might have since "importing my wrapper in my views.py __init__ method" seems to be causing trouble).

Google app engine - Python: How to use Namespaces and Ancestors?

I'm trying to understand how namespaces and ancestors work in GAE.
So I made a simple multitenant "ToDo List" application where each tenant is a user. So I set the namespaces in appengine_config.py like so:
def namespace_manager_default_namespace_for_request():
name = users.GetCurrentUser().user_id()
return name
And then in my main.py I manage the views like so:
#Make tasks with strong consistency by giving them a parent
DEFAULT_TASKS_ANCESTOR = ndb.Key('Agenda', 'default_agenda')
class TasksList(webapp2.RequestHandler):
def get(self):
namespace = namespace_manager.get_namespace()
tasks_query = Task.query(ancestor=DEFAULT_TASKS_ANCESTOR)
tasks = tasks_query.fetch(10, use_cache=False)
context = {
'tasks': tasks,
'namespace': namespace,
}
template = JINJA_ENVIRONMENT.get_template('templates/tasks_list.html')
self.response.write(template.render(context))
def post(self):
name = cgi.escape(self.request.get('name'))
task = Task(parent=DEFAULT_TASKS_ANCESTOR)
task.name = name
task.put()
self.redirect('/tasks_list')
...And that gave a leaking data:
I logged in as userA, created a task, then logged out and logged in again as a different user (userB), and I could see that task from userA.
I confirmed from the admin panel that the task entities are from different namespaces and that the ancestor is also different "since it includes the namespace?".
I also turned caching off by using .fetch(10, use_cache=False). But the problem was still there. So it was not a caching issue.
So finally I make the parent just before .query() and .put() and it did work! Like so:
DEFAULT_TASKS_ANCESTOR = ndb.Key('Agenda', 'default_agenda')
tasks_query = Task.query(ancestor=DEFAULT_TASKS_ANCESTOR)
But now I have questions...
Why it worked? Since those task entities don't share the same namespace, how could I query data from another namespace? Even if I query for an ancestor from another namespace by mistake should I get the data?
Is DEFAULT_TASKS_ANCESTOR from namespaceA the same ancestor as DEFAULT_TASKS_ANCESTOR from the namespaceB, or are they two completely different ancestors?
Do namespaces really compartmentalize data in the datastore or is that not the case?
If ancestors play that much of a big role even in namespaces, should I still use namespaces or ancestors to compartmentalize data in a multitenant application?
Thank you in advance!
I think your problem was defining the DEFAULT_TASKS_ANCESTOR constant outside of the requests. In that case, datastore used the default namespace to create this key. So by querying with that ancestor key, you used the same namespace.
From Google's datastore documentation:
By default, the datastore uses the current namespace setting in the namespace manager for datastore requests. The API applies this current namespace to Key or Query objects when they are created. Therefore, you need to be careful if an application stores Key or Query objects in serialized forms, since the namespace is preserved in those serializations.
As you have found out, it worked when you defined the ancestor key inside the requests. You can set the namespace for a particular query in the constructor call (i.e. ndb.Query(namespace='1234')), so you can query entities from a different namespace.
For your other questions:
These are in my understanding different keys, since they include different namespaces.
To make a hard compartmentalization, you could check on the server side whether the namespace used for a query is equal to the userID of the current user.
It really depends on the overall structure of the application. IMHO, ancestor keys are more relevant with regard to strong consistency. Think of the situation where a user might have several todo-lists. He should be able to see them all, but strong consistency may only be required at the list level.

How to users.set_current_user?

How to create a custom login system with google-app-engine's User class?
I'm making my own login system. I have tried os.environ['USER_EMAIL'] = 'john.doe#example.com', but it doesn't work. I guess the information is not saved in a session, so if I visit another page then users.get_current_user() has no memory of what I had set it to previously.
I have seen google-app-engine unit tests that does this to fake a logins. I think these works because the tests are within the same session. I need to store my own custom information across multiple sessions.
I have also tried setting environment variables, but without luck.
os.environ['USER_EMAIL'] = 'john.doe#example.com'
os.environ['USER_ID'] = 'todo_user_id'
os.environ['USER_IS_ADMIN'] = '1'
os.environ['AUTH_DOMAIN'] = 'todo_auth_domain'
os.environ['FEDERATED_IDENTITY'] = 'todo_federated_identity'
os.environ['FEDERATED_PROVIDER'] = 'todo_federated_provider'
You can't - the Users API only supports signing in with methods supported by that API. If you want to implement your own signin, you will need to use a third-party session library and keep track of logged in users the 'regular' way.

Alternative to singleton?

I'm a Python & App Engine (and server-side!) newbie, and I'm trying to create very simple CMS. Each deployment of the application would have one -and only one -company object, instantiated from something like:
class Company(db.Model):
name = db.StringPropery()
profile = db.TextProperty()
addr = db.TextProperty()
I'm trying to provide the facility to update the company profile and other details.
My first thought was to have a Company entity singleton. But having looked at (although far from totally grasped) this thread I get the impression that it's difficult, and inadvisable, to do this.
So then I thought that perhaps for each deployment of the CMS I could, as a one-off, run a script (triggered by a totally obscure URL) which simply instantiates Company. From then on, I would get this instance with theCompany = Company.all()[0]
Is this advisable?
Then I remembered that someone in that thread suggested simply using a module. So I just created a Company.py file and stuck a few variables in it. I've tried this in the SDK and it seems to work -to my suprise, modified variable values "survived" between requests.
Forgive my ignorance but, I assume these values are only held in memory rather than on disk -unlike Datastore stuff? Is this a robust solution? (And would the module variables be in scope for all invocations of my application's scripts?)
Global variables are "app-cached." This means that each particular instance of your app will remember these variables' values between requests. However, when an instance is shutdown these values will be lost. Thus I do not think you really want to store these values in module-level variables (unless they are constants which do not need to be updated).
I think your original solution will work fine. You could even create the original entity using the remote API tool so that you don't need an obscure page to instantiate the one and only Company object.
You can also make the retrieval of the singleton Company entity a bit faster if you retrieve it by key.
If you will need to retrieve this entity frequently, then you can avoid round-trips to the datastore by using a caching technique. The fastest would be to app-cache the Company entity after you've retrieved it from the datastore. To protect against the entity from becoming too out of date, you can also app-cache the time you last retrieved the entity and if that time is more than N seconds old then you could re-fetch it from the datastore. For more details on this option and how it compares to alternatives, check out Nick Johnson's article Storage options on App Engine.
It sounds like you are trying to provide a way for your app to be configurable on a per-application basis.
Why not use the datastore to store your company entity with a key_name? Then you will always know how to fetch the company entity, and you'll be able edit the company without redeploying.
company = Company(key_name='c')
# set stuff on company....
company.put()
# later in code...
company = Company.get_by_key_name('c')
Use memcache to store the details of the company and avoid repeated datastore calls.
In addition to memcache, you can use module variables to cache the values. They are cached, as you have seen, between requests.
I think the approach you read about is the simplest:
Use module variables, initialized in None.
Provide accessors (get/setters) for these variables.
When a variable is accessed, if its value is None, fetch it from the database. Otherwise, just use it.
This way, you'll have app-wide variables provided by the module (which won't be instantiated again and again), they will be shared and you won't lose them.

Design pattern to organize non-trivial ORM queries?

I am developing a web API with 10 tables or so in the backend, with several one-to-many and many-to-many associations. The API essentially is a database wrapper that performs validated updates and conditional queries. It's written in Python, and I use SQLAlchemy for ORM and CherryPy for HTTP handling.
So far I have separated the 30-some queries the API performs into functions of their own, which look like this:
# in module "services.inventory"
def find_inventories(session, user_id, *inventory_ids, **kwargs):
query = session.query(Inventory, Product)
query = query.filter_by(user_id=user_id, deleted=False)
...
return query.all()
def find_inventories_by(session, app_id, user_id, by_app_id, by_type, limit, page):
....
# in another service module
def remove_old_goodie(session, app_id, user_id):
try:
old = _current_goodie(session, app_id, user_id)
services.inventory._remove(session, app_id, user_id, [old.id])
except ServiceException, e:
# log it and do stuff
....
The CherryPy request handler calls the query methods, which are scattered across several service modules, as needed. The rationale behind this solution is, since they need to access multiple model classes, they don't belong to individual models, and also these database queries should be separated out from direct handling of API accesses.
I realize that the above code might be called Foreign Methods in the realm of refactoring. I could well live with this way of organizing for a while, but as things are starting to look a little messy, I'm looking for a way to refactor this code.
Since the queries are tied directly to the API and its business logic, they are hard to generalize like getters and setters.
It smells to repeat the session argument like that, but as the current implementation of the API creates a new CherryPy handler instance for each API call and therefore the session object, there is no global way of getting at the current session.
Is there a well-established pattern to organize such queries? Should I stick with the Foreign Methods and just try to unify the function signature (argument ordering, naming conventions etc.)? What would you suggest?
The standard way to have global access to the current session in a threaded environment is ScopedSession. There are some important aspects to get right when integrating with your framework, mainly transaction control and clearing out sessions between requests. A common pattern is to have an autocommit=False (the default) ScopedSession in a module and wrap any business logic execution in a try-catch clause that rolls back in case of exception and commits if the method succeeded, then finally calls Session.remove(). The business logic would then import the Session object into global scope and use it like a regular session.
There seems to be an existing CherryPy-SQLAlchemy integration module, but as I'm not too familiar with CherryPy, I can't comment on its quality.
Having queries encapsulated as functions is just fine. Not everything needs to be in a class. If they get too numerous just split into separate modules by topic.
What I have found useful is too factor out common criteria fragments. They usually fit rather well as classmethods on model classes. Aside from increasing readability and reducing duplication, they work as implementation hiding abstractions up to some extent, making refactoring the database less painful. (Example: instead of (Foo.valid_from <= func.current_timestamp()) & (Foo.valid_until > func.current_timestamp()) you'd have Foo.is_valid())
SQLAlchemy strongly suggests that the session maker be part of some global configuration.
It is intended that the sessionmaker()
function be called within the global
scope of an application, and the
returned class be made available to
the rest of the application as the
single class used to instantiate
sessions.
Queries which are in separate modules isn't an interesting problem. The Django ORM works this way. A web site usually consists of multiple Django "applications", which sounds like your site that has many "service modules".
Knitting together multiple services is the point of an application. There aren't a lot of alternatives that are better.

Categories