How to keep global variables persistent over multiple google appengine instances?

How to keep global variables persistent over multiple google appengine instances? - python

Our situation is as follows:
We are working on a schoolproject where the intention is that multiple teams walk around in a city with smarthphones and play a city game while walking.
As such, we can have 10 active smarthpones walking around in the city, all posting their location, and requesting data from the google appengine.
Someone is behind a webbrowser,watching all these teams walk around, and sending them messages etc.
We are using the datastore the google appengine provides to store all the data these teams send and request, to store the messages and retrieve them etc.
However we soon found out we where at our max limit of reads and writes, so we searched for a solution to be able to retrieve periodic updates(which cost the most reads and writes) without using any of the limited resources google provides. And obviously, because it's a schoolproject we don't want to pay for more reads and writes.
Storing this information in global variables seemed an easy and quick solution, which it was... but when we started to truly test we noticed some of our data was missing and then reappearing. Which turned out to be because there where so many requests being done to the cloud that a new instance was made, and instances don't keep these global variables persistent.
So our question is:
Can we somehow make sure these global variables are always the same on every running instance of google appengine.
OR
Can we limit the amount of instances ever running, no matter how many requests are done to '1'.
OR
Is there perhaps another way to store this data in a better way, without using the datastore and without using globals.

You should be using memcache. If you use the ndb (new database) library, you can automatically cache the results of queries. Obviously this won't improve your writes much, but it should significantly improve the numbers of reads you can do.
You need to back it with the datastore as data can be ejected from memcache at any time. If you're willing to take the (small) chance of losing updates you could just use memcache. You could do something like store just a message ID in the datastore and have the controller periodically verify that every message ID has a corresponding entry in memcache. If one is missing the controller would need to reenter it.

Interesting question. Some bad news first, I don't think there's a better way of storing data; no, you won't be able to stop new instances from spawning and no, you cannot make seperate instances always have the same data.
What you could do is have the instances perioidically sync themselves with a master record in the datastore, by choosing the frequency of this intelligently and downloading/uploading the information in one lump you could limit the number of read/writes to a level that works for you. This is firmly in the kludge territory though.
Despite finding the quota for just about everything else, I can't find the limits for free read/write so it is possible that they're ludicrously small but the fact that you're hitting them with a mere 10 smartphones raises a red flag to me. Are you certain that the smartphones are being polled (or calling in) at a sensible frequency? It sounds like you might be hammering them unnecessarily.

Consider jabber protocol for communication between peers. Free limits are on quite high level for it.

First, definitely use memcache as Tim Delaney said. That alone will probably solve your problem.
If not, you should consider a push model. The advantage is that your clients won't be asking you for new data all the time, only when something has actually changed. If the update is small enough that you can deliver it in the push message, you won't need to worry about datastore reads on memcache misses, or any other duplicate work, for all those clients: you read the data once when it changes and push it out to everyone.
The first option for push is C2DM (Android) or APN (iOS). These are limited on the amount of data they send and the frequency of updates.
If you want to get fancier you could use XMPP instead. This would let you do more frequent updates with (I believe) bigger payloads but might require more engineering. For a starting point, see Stack Overflow questions about Android and iOS.
Have fun!

Related

Two flask apps using one database

Hello I don't think this is in the right place for this question but I don't know where to ask it. I want to make a website and an api for that website using the same SQLAlchemy database would just running them at the same time independently be safe or would this cause corruption from two write happening at the same time.

SQLA is a python wrapper for SQL. It is not it's own database. If you're running your website (perhaps flask?) and managing your api from the same script, you can simply use the same reference to your instance of SQLA. Meaning, when you use SQLA to connect to a database and save to a variable, what is really happening is it saves the connection to a variable, and you continually reference that variable, as opposed to the more inefficient method of creating a new connection every time. So when you say
using the same SQLAlchemy database
I believe you are actually referring to the actual underlying database itself, not the SQLA wrapper/connection to it.
If your website and API are not running in the same script (or even if they are, depending on how your API handles simultaneous requests), you may encounter a race condition, which, according to Wikipedia, is defined as:
the condition of an electronics, software, or other system where the system's substantive behavior is dependent on the sequence or timing of other uncontrollable events. It becomes a bug when one or more of the possible behaviors is undesirable.
This may be what you are referring to when you mentioned
would this cause corruption from two write happening at the same time.
To avoid such situations, when a process accesses a file, (depending on the OS,) check is performed to see if there is a "lock" on that file, and if so, the OS refuses to open that file. A lock is created when a process accesses a file (and there is no other process holding a lock on that file), such as by using with open(filename): and is released when the process no longer holds an open reference to the file (such as when python execution leaves the with open(filename): indentation block.) This may be the real issue you might encounter when using two simultaneous connections to a SQLite db.
However, if you are using something like MySQL, where you connect to a SQL server process, and NOT a file, since there is no direct access to a file, there will be no lock on the database, and you may run in to that nasty race condition in the following made up scenario:
Stack Overflow queries the reputation an account to see if it should be banned due to negative reputation.
AT THE EXACT SAME TIME, Someone upvotes an answer made by that account that sets it one point under the account ban threshold.
The outcome is now determined by the speed of execution of these 2 tasks.
If the upvoter has, say, a slow computer, and the "upvote" does not get processed by StackOverflow before the reputation query completes, the account will be banned. However, if there is some lag on Stack Overflow's end, and the upvote processes before the account query finishes, the account will not get banned.
The key concept behind this example is that all of these steps can occur within fractions of a second, and the outcome depends of the speed of execution on both ends.
To address the issue of data corruption, most databases have a system in place that properly order database read and writes, however, there are still semantic issues that may arise, such as the example given above.

Two applications can use the same database as the DB is a separate application that will be accessed by each flask app.
What you are asking can be done and is the methodology used by many large web applications, specially when the API is written in a different framework than the main application.
Since SQL databases are ACID compliant, they have a system in place to queue the multiple read/write requests put to it and perform them in the correct order while ensuring data reliability.
One question to ask though is whether it is useful to write two separate applications. For most flask-only projects the best approach would be to separate the project using blueprints, having a “main” blueprint and a “api” blueprint.

Share state between threads in bottle

In my Bottle app running on pythonanywhere, I want objects to be persisted between requests.
If I write something like this:
X = {'count': 0}
#route('/count')
def count():
X['count'] += 1
tpl = SimpleTemplate('Hello {{count}}!')
return tpl.render(count=X['count'])
The count increments, meaning that X persists between requests.
I am currently running this on pythonanywhere, which is a managed service where I have no control over the web server (nginx I presume?) threading, load balancing (if any) etc...
My question is, is this coincidence because it's only using one thread while on minimal load from me doing my tests?
More generally, at which point will this stop working? E.g. I have more than one thread/socket/instance/load-balanced server etc...?
Beyond that, what is my best options to make something like this work (sticking to Bottle) even if I have to move to a barebones server.
Here's what Bottle docs have to say about their request object:
A thread-safe instance of LocalRequest. If accessed from within a request callback, this instance always refers to the current request (even on a multi-threaded server).
But I don't fully understand what that means, or where global variables like the one I used stand with regards to multi-threading.

TL;DR: You'll probably want to use an external database to store your state.
If your application is tiny, and you're planning to always have exactly one server process running, then your current approach can work; "all" you need to do is acquire a lock around every (!) access to the shared state (the dict X in your sample code). (I put "all" in scare quotes there because it's likely to become more complicated than it sounds at first.)
But, since you're asking about multithreading, I'll assume that your application is more than a toy, meaning that you plan to receive substantial traffic and/or want to handle multiple requests concurrently. In this case, you'll want multiple processes, which means that your approach--storing state in memory--cannot work. Memory is not shared across processes. The (general) way to share state across processes is to store the state externally, e.g. in a database.
Are you familiar with Redis? That'd be on my short list of candidates.

I go the answers by contacting PythonAnywhere support, who had this to say:
When you run a website on a free PythonAnywhere account, just
one process handles all of your requests -- so a global variable like
the one you use there will be fine. But as soon as you want to scale
up, and get (say) a hacker account, then you'll have multiple processes
(not, not threads) -- and of course each one will have its own global
variables, so things will go wrong.
So that part deals with the PythonAnywhere specifics on why it works, and when it would stop working on there.
The answer to the second part, about how to share variables between multiple Bottle processes, I also got from their support (most helpful!) once they understood that a database would not work well in this situation.
Different processes cannot of course share variables, and the most viable solution would be to:
write your own kind of caching server to handle keeping stuff in memory [...] You'd have one process that ran all of the time, and web API requests would access it somehow (an internal REST API?). It could maintain stuff in memory [...]
Ps: I didn't expect other replies to tell me to store state in a database, I figured that the fact I'm asking this means I have a good reason not to use a database, apologies for time wasted!

Google App Engine, Datastore and Task Queues, Performance Bottlenecks?

We're designing a system that will take thousands of rows at a time and send them via JSON to a REST API built on Google App Engine. Typically 3-300KB of data but let's say in extreme cases a few MB.
The REST API app will then adapt this data to models on the server and save them to the Datastore. Are we likely to (eventually if not immediately) encounter any performance bottlenecks here with Google App Engine, whether it's working with that many models or saving so many rows of data at a time to the datastore?
The client does a GET to get thousands of records, then a PUT with thousands of records. Is there any reason for this to take more than a few seconds, and necessitate the need for a Task queues API?

The only bottleneck in App Engine (apart from the single entity group limitation) is how many entities you can process in a single thread on a single instance. This number depends on your use case and the quality of your code. Once you reach a limit, you can (a) use a more powerful instance, (b) use multi-threading and/or (c) add more instances to scale up your processing capacity to any level you desire.
Task API is a very useful tool for large data loads. It allows you to split your job into a large number of smaller tasks, set the desired processing rate, and let App Engine automatically adjust the number of instances to meet that rate. Another option is a MapReduce API.

This is a really good question, one that I've been asked in interviews, seen pop up in a lot of different situations as well. Your system essentially consists of two things:
Savings (or writing) models to the data store
Reading from the data store.
From my experience of this problem, when you view these two things differently you're able to come up with solid solutions to both. I typically use a cache, such as memcachd, in order to keep data easily accessible for reading. At the same time, for writing, I try to have a main db and a few slave instances as well. All the writes will go to the slave instances (thereby not locking up the main db for reads that sync to the cache), and the writes to the slave db's can be distributed in a round robin approach there by ensuring that your insert statements are not skewed by any of the model's attributes having a high occurance.

Datastore vs Memcache for high request rate game

I have been using the datastore with ndb for a multiplayer app. This appears to be using a lot of reads/writes and will undoubtedly go over quota and cost a substantial amount.
I was thinking of changing all the game data to be stored only in memcache. I understand that data stored here can be lost at any time, but as the data will only be needed for, at most, 10 minutes and as it's just a game, that wouldn't be too bad.
Am I right to move to solely use memcache, or is there a better method, and is memcache essentially 'free' short term data storage?

Yes, memcache is free and you can use it as a free "datastorage". Just keep in mind that it can be purged at any time (more likely to be purged if heavily used) and that it also is not always available. To check for memecache availability use Capabilities API.

As a commenter on another answer noted, there are now two memcache offerings: shared and dedicated. Shared is the original service, and is still free. Dedicated is in preview, and presently costs $.12/GB hour.
Dedicated memcache allows you to have a certain amount of space set aside. However, it's important to understand that you can still experience partial or complete flushes at any time with dedicated memcache, due to things like machine reboots. Because of this, it's not a suitable replacement for the datastore.
However, it is true that you can greatly reduce your datastore usage with judicious use of memcache. Using it as a write-through cache, for example, can greatly reduce your datastore reads (albeit not the writes).
Hope this helps.

Using callLater in Twisted to keep track of auction endings

I was wondering if it would be a good idea to use callLater in Twisted to keep track of auction endings. It would be a callLater on the order of 100,000's of seconds, though does that matter? Seems like it would be very convenient. But then again it seems like a horrible idea if the server crashes.
Keeping a database of when all the auctions are ending seems like the most secure solution, but checking the whole database each second to see if any auction has ended seems very expensive.
If the server crashes, maybe the server can recreate all the callLater's from database entries of auction end times. Are there other potential concerns for such a model?

One of the Divmod projects, Axiom, might be applicable here. Axiom is an object database. One of its unexpected, useful features is a persistent scheduling system.
You schedule events using APIs provided by the database. When the events come due, a callback you specified is called. The events persist across process restarts, since they're represented as database objects. Large numbers of scheduled events are supported, by only doing work to keep track when the next event is going to happen.
The canonical Divmod site went down some time ago (sadly the company is no longer an operating concern), but the code is all available at http://launchpad.net/divmod.org and the documentation is being slowly rehosted at http://divmod.readthedocs.org/.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.