How to count votes live for complex polls

How to count votes live for complex polls - python

I'm working on a project using Google app engine where I need to have users vote on a poll live, and they only have 15 seconds to submit their vote. I already have the delivery of the options working using Pusher.com, but I'm struggling to think of the right way to go about the voting.
A set of options is generated every 30-60seconds. After that votes are counted and a new set is delivered to the users, the old votes are useless and don't need to be stored. The number of options varies every time, usually around 5 but it could be up to 20 in rare occasions. Here comes the tricky part, there are also 2 sets of sub options which are all different for every main option. These however are secondary and only matter if the particular option won. Also not every main option has them. So a sample set could be this:
Option A
sub options:
- X
- Y
- Z
sub options 2
- F
- G
- H
Option B
sub options 2
- X
- Z
Option C
Option D
sub options
- sub X
- sub Y
- sub Z
At first I thought about using a simple database table but Google app engine has concurrent user limits and it gets expensive to go to the higher tiers where I would be wasting a bunch of resources like storage limit since I don't need to save the results. I need this to be scalable to a couple thousand concurrent users.
From what I read in the Googledocumentation of sharded counters, it seems like they can only be integers, so I can't store an array or string, which would be ideal. (an example of a single vote data would be {'option':'2', 'sub':'0', 'sub2':'1'}) I've been playing around here and the only idea I've come up with is to create an int counter for every possible vote combination but that just seems inefficient, there could often be over a hundred counters. Any idea of how I could set this up? Also there doesn't seem to be any documentation on how to delete a counter after the app is running.
I should also add that I'm a beginner self taught programmer and this is my first time stepping out of my comfort zone of PHP, javascript, and very simple python.
Thank you so much for taking the time to read this.

This may be a naive solution, but you can simply store the votes in memory in when using instances of basic or manual scaling, and push them to the datastore every few seconds, thus not encountering the maximum allowed concurrent updates.

Related

Django/Python - Updating the database every second

I'm working on creating a browser-based game in Django and Python, and I'm trying to come up with a solution to one of the problems I'm having.
Essentially, every second, multiple user variables need to be updated. For example, there's a currency variable that should increase by some amount every second, progressively getting larger as you level-up and all of that jazz.
I feel like it's a bad idea to do this with cronjobs (and from my research, other people think that too), so right now I'm thinking I should just create a thread that loops through all of the users in the database that performs these updates.
Am I on the right track here, or is there a better solution? In Django, how can I start a thread the second the server starts?
I appreciate the insight.

One of the possible solutions would be to use separate daemonized lightweight python script to perform all the in-game business logic and left django be just the frontend to your game. To bind them together you might pick any of high-performance asynchronous messaging library like ZeroMQ (for instance to pass player's actions to that script). This stack would also have a benefit of a frontend being separated and completely agnostic of a backend implementation.

Generally a better solution would be to not update everyone's currency every second. Instead, store the timestamp of the user's most recent transaction, their income rate, and their latest balance. With these three pieces of data, you can calculate their current balance whenever needed.
When the user makes a purchase for example, do the math to calculate what their new balance is. Some pseudocode:
def current_balance(user):
return user.latest_balance + user.income * (now() - user.last_timestamp)
def purchase(user, price):
if price > current_balance(user):
print("You don't have enough cash.")
return
user.balance = current_balance(user) - price
user.last_timestamp = now()
user.save()
print("You just bought a thing!")
With a system like this, your database will only get updated upon user interaction, which will make your system scale oodles better.

Displaying a continuous stream of integers on a web page

I want to make a web page that generates a uniform random number between 0 and 99 every 10 seconds, and displays a list of the 100 most recent numbers (which are the same for everyone visiting the site). It should update live.
My design is the following:
A long-running Python process (e.g. using supervisord) that runs in an eternal loop, generating numbers at 10-second intervals, and writing the numbers to a file or SQL database, and pruning the old numbers since they are no longer needed.
Then the web server process simply reads the file and displays to the user (either on initial load, or from an Ajax call to get the most recent numbers)
I don't feel great about this solution. It's pretty heavy on file system I/O, which is not really a bottleneck or anything, but I just wonder if there's a smarter way that is still simple. If I could store the list as an in-memory data structure shared between processes, I could have one process push and pop values every 10 seconds, and then the web server processes could just read that data structure. I read a bit about Unix domain sockets, but it wasn't clear that this was a great fit to my problem
Is there a more efficient approach that is still simple?
EDIT: the approach suggested by Martijn Peters in his answer (don't generate anything until someone visits) is sensible and I am considering it too, since the website doesn't get very heavy traffic. The problem I see is with race conditions, since you then have multiple processes trying to write to the same file/DB. If the values in the file/DB are stale, we need to generate new ones, but one process might read the old values before another process has had the chance to update them. File locking as described in this question is a possibility, but many people in the answers warn about having multiple processes write to the same file.

You are overcomplicating things.
Don't generate any numbers until you have an actual request. Then see how old your last number is, generate enough numbers to cover the intervening time period, update your tables, return the result.
There is no actual need to actually generate a random number every 10 seconds here. You only need to produce the illusion that the numbers have been generated every 10 seconds, that more than suffices for your use case.
A good database will handle concurrent access for you, and most will also let you set exclusive locks. Grab a lock when you need to update the numbers. Fail to grab the lock? Something else is already updating the numbers.
Pre-generate numbers; nothing says you actually have to only generate the numbers for the past time slot. Randomize what requests pre-generate to minimize lock contention. Append the numbers to the end of the pool, so that if you accidentally run this twice all you get is double the extra random numbers, so you can wait twice as long before you need to generate more.
Most of all, generating a sequence of random numbers is cheap, so doing this during any request is hardly going to slow down your responses.

I would pre-generate a lot of numbers (say, enough numbers for 1 week; do the math) and store them. That way, the Ajax calls would only load the next number on the list. When you are running out of numbers, pre-generate again. The process of generating and writing into DB would be only executed one in a while (e.g. once a week).
EDIT: For a full week, you would need 60480 numbers at the most. Using what Martijn Pieters recommends (only reading a new number when a visitor indeed asks for one), and depending on your specific need (as you may need to still burn the numbers even if nobody is seeing them), those numbers may last way more than the week.

Multiple queries vs. manually sorting one large query (AppEngine NDB)

For a model like:
class Thing(ndb.Model):
visible = ndb.BooleanProperty()
made_by = ndb.KeyProperty(kind=User)
belongs_to = ndb.KeyProperty(kind=AnotherThing)
Essentially performing an 'or' query, but comparing different properties so I can't use a built in OR... I want to get all Thing (belonging to a particular AnotherThing) which either have visible set to True or visible is False and made_by is the current user.
Which would be less demanding on the datastore (ie financially cost less):
Query to get everything, ie: Thing.query(Thing.belongs_to == some_thing.key) and iterate through the results, storing the visible ones, and the ones that aren't visible but are made_by the current user?
Query to get the visible ones, ie: Thing.query(Thing.belongs_to == some_thing.key, Thing.visible == "True") and query separately to get the non-visible ones by the current user, ie: Thing.query(Thing.belongs_to == some_thing.key, Thing.visible == "False", Thing.made_by = current_user)?
Number 1. would get many unneeded results, like non-visible Things by other users - which I think is many reads of the datastore? 2. is two whole queries though, which is also possibly unnecessarily heavy, right? I'm still trying to work out what kinds of interaction with the database cause what kinds of costs.
I'm using ndb, tasklets and memcache where necessary, in case that's relevant.

Number two is going to be financially less for two reasons. First you pay for each read of the data store and for each returned entity in a query, therefore you will be paying more for the first one which you have to Read all data and query all data. The second way you only pay for what you need.
Secondly you also pay for backend or frontend time, and you will be using time to iterate through all your results in the first method, where as you need to spend no time for the second method.
I can't see a way where the first option is better. (maybe if you only have a few entities??)
To understand how reads and queries cost you scroll down a little on:
https://developers.google.com/appengine/docs/billing
You will see how Read, Writes and Smalls are added up for reads, writes and queries.
I would also just query for ones that are owned by the current user instead of visible=false and owner=current, this way you don't need a composite index which will save some time. You can also make visible a partial index this was saving some space as well (only index it when true, assuming you never need to query for false ones). You will need to do a litte work to remove duplicates, but that is probably not to bad.

You are probably best benchmarking both cases using real-world data. It's hard to determine things like this in the abstract, as there are many subtleties that may affect overall performance.
I would expect option 2 to be better though. Loading tons of objects that you don't care about is simply going to put a heavy burden on the data store that I don't think an extra query would be comparable to. Of course, it depends on how many extra things, etc.

Google App Engine counts

What is the appropriate way to handle counts in app engine(ndb or db)?
I have two projects on is a django-nonrel and the other is a pure django project but both need the ability to take a query and get a count back. The results could be greater than 1,000.
I saw some posts that said I could use Sharded Counters but they are counting all entities. I need to be able to know how many entities have the following properties x=1,y=True,z=3
#Is this the appropriate way?
count = some_entity.gql(query_string).count(SOME_LARGE_NUMBER)

The datastore is not good at this sort of query, because of tradeoffs to make it distributed. These include fairly slow reads, and very limited indexing.
If there are a limited set of statistics you need (number of users, articles, etc) then you can keep running totals in a separate entity. This means you need to do two writes(puts) when something changes: one for the entity that changes, and one to update the stats entity. But you only need one read(get) to get your statistics, instead of however many entities they are distilled from.
You may be uncomfortable with this because it goes against what we all learned about normalisation, but it is far more efficient and in many cases works fine. You can always have a cron job periodically do your queries to check the statistics are accurate, if this is critical.

Since you are using db.Model here is one way on how can you count all your entities with some filters that might be over the 1000 that is a hard limit (if it's still applicable):
FETCH_LIMIT = 1000
def count_model(x=1, y=True, z=3):
model_qry = MyModel.all(keys_only=True)
model_qry.filter('x =', x)
model_qry.filter('y =', y)
model_qry.filter('z =', z)
count = None
total = 0
cursor = None
while count != 0:
if cursor:
count = model_qry.with_cursor(cursor).count()
else:
count = model_qry.count(limit=FETCH_LIMIT)
total += count
cursor = model_qry.cursor()
return total
If you're going to use the above in a request then you might timeout so consider using Task Queues instead.
Also as FoxyLad proposed, it is much better to keep running totals in a separate entity, for performance reasons and having the above method as a cron job that runs on regular basis to have the stats in a perfect sync.

What is the best way to represent a schedule in a database, via Python/Django?

I am writing a backup system in Python, with a Django front-end. I have decided to implement the scheduling in a slightly strange way - the client will poll the server (every 10 minutes or so), for a list of backups that need doing. The server will only respond when the time to backup is reached. This is to keep the system platform independent - so that I don't rely on cronjobs or suchlike. Therefore the Django front-end (which exposes an XML-RPC API) has to store the schedule in a database, and interpret that schedule to decide if a client should start backing up or not.
At present, the schedule is stored using 3 fields: days, hours and minutes. These are comma-separated lists of integers, representing the days of the week (0-6), hours of the day (0-23) and minutes of the hour (0-59). To decide whether a client should start backing up or not is a horribly inefficient operation - Python must loop over all the days since a time 7-days in the past, then the hours, then the minutes. I have done some optimization to make sure it doesn't loop too much - but still!
This works relatively well, although the implementation is pretty ugly. The problem I have is how to display and interpret this information via the HTML form on the front-end. Currently I just have huge lists of multi-select fields, which obviously doesn't work well.
Can anyone suggest a different method for implementing the schedule that would be more efficient, and also easier to represent in an HTML form?

Take a look at django-chronograph. It has a pretty nice interface for scheduling jobs at all sorts of intervals. You might be able to borrow some ideas from that. It relies on python-dateutil, which you might also find useful for specifying repeating events.

Your question is a bit ambiguous—do you mean: "Back up every Sunday, Monday and Friday at time X."?
If so, use a Bitmask to store the recurring schedule as an integer:
Let's say that you want a backup as mentioned above—on Sundays, Mondays and Fridays. Encode the days of the week as an integer (represented in Binary):
S M T W T F S
1 1 0 0 0 1 0 = 98
To find out if today (eg. Friday) is a backup day, simply do a bitwise and:
>>> 0b1100010 & 0b0000010 != 0
True
To get the current day as an integer, you need to offset it by one since weekday() assumes week starts on Monday:
current_day = (timezone.now().weekday() + 1) % 7
In summary, the schema for your Schedule object would look something like:
class Schedule(models.Model):
days_recurrence = models.PositiveSmallIntegerField(db_index=True)
time = models.TimeField()
With this schema, you would need a new Schedule object for each time of the day you would like to back-up. This is a fast lookup since the bitwise operation costs around 2 cycles and since you're indexing the field days_recurrence, you have a worst-case day-lookup of O(logn) which should cut down your complexity considerably. If you want to squeeze more performance out of this, you can also use a bitmask for an hour then store the minute.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.