3 questions about appengine indexes

3 questions about appengine indexes - python

I got the following situation
class M(db.Model):
a = db.ReferenceProperty(A)
x = db.ReferenceProperty(X)
y = db.ReferenceProperty(Y)
z = db.ReferenceProperty(Z)
items = db.StringListProperty()
date = db.DateTimeProperty()
I want to make queries that filter on (a), (x, y or z) and (items), ordered by date i.e.
mm = M.all().filter('a =', a1).filter('x =', x1).filter('items =', i).order('-date')
There will never be a query with filter on x and y at the same time, for example.
So, my questions are:
1) How many (and which) indexes should I create?
2) How many 'strings' can I add on items? (I'd like to add in the order of thousands)
3) How many index records will I have on a single "M" if there are 1000 items?
I don't quite yet understand this index stuff, and is killing me. Your help will be very appreciated :)

This article explains indexes/exploding indexes quite well, and it actually fits your example: https://developers.google.com/appengine/docs/python/datastore/queries#Big_Entities_and_Exploding_Indexes
Your biggest issue will be the fact that you will probably run into the 5000 indexes per entity limit with thousands of items. If you take an index for a, x, items (1000 items), date: |a||x||items|*|date| == 1*1*1000*1 == 1000.
If you have 5001 entries in items, the put() will fail with the appropriate exception.
From the example you provided, whether you filter on x, y or anything else seems irrelevant, as there is only 1 of that property, and therefore you do not run the chance of an exploding index. 1*1 == 1.
Now, if you had two list properties, you would want to make sure that they are indexed separately, otherwise you'd get an exploding index. For example, if you had 2 list properties with 100 items each, that would produce 100*100 indexes, unless you split them, which would then result in only 200 (assuming all other properties were non-lists).

For the criteria you have given, you only need to create three compound indexes: a,x,items,-date, a,y,items,-date, a,z,items,-date. Note that a list property creates an index entry for each property in the list.
There is a limit of total 5000 index entries per entity. If you only have three compound indexes, than it's 5000/3 = 1666 (capped at 1000 for a single list property).
In the case of three compound indexes only, 3*1000 = 3000.
NOTE: the above assumes you do not have built-in indexes per property (= properties are save as unindexed). Otherwise you need to account for built-in indexes at 2N, where N in number of single properties (2 is for asc, desc). In your case this would be 2*(5 + no_items), since items is a list property and every entry creates an index entry.

See also https://developers.google.com/appengine/articles/indexselection, which describes App Engine's (relatively recent) improved query planning capabilities. Basically, you can reduce the number of index entries needed to: (number of filters + 1) * (number of orders).
Though, as the article discusses, there can be reasons that you might still use compound indexes-- essentially, there is a time/space tradeoff.

Related

Get 10 random elements in Django template [duplicate]

How do I get two distinct random records using Django? I've seen questions about how to get one but I need to get two random records and they must differ.

The order_by('?')[:2] solution suggested by other answers is actually an extraordinarily bad thing to do for tables that have large numbers of rows. It results in an ORDER BY RAND() SQL query. As an example, here's how mysql handles that (the situation is not much different for other databases). Imagine your table has one billion rows:
To accomplish ORDER BY RAND(), it needs a RAND() column to sort on.
To do that, it needs a new table (the existing table has no such column).
To do that, mysql creates a new, temporary table with the new columns and copies the existing ONE BILLION ROWS OF DATA into it.
As it does so, it does as you asked, and runs rand() for every row to fill in that value. Yes, you've instructed mysql to GENERATE ONE BILLION RANDOM NUMBERS. That takes a while. :)
A few hours/days later, when it's done it now has to sort it. Yes, you've instructed mysql to SORT THIS ONE BILLION ROW, WORST-CASE-ORDERED TABLE (worst-case because the sort key is random).
A few days/weeks later, when that's done, it faithfully grabs the two measly rows you actually needed and returns them for you. Nice job. ;)
Note: just for a little extra gravy, be aware that mysql will initially try to create that temp table in RAM. When that's exhausted, it puts everything on hold to copy the whole thing to disk, so you get that extra knife-twist of an I/O bottleneck for nearly the entire process.
Doubters should look at the generated query to confirm that it's ORDER BY RAND() then Google for "order by rand()" (with the quotes).
A much better solution is to trade that one really expensive query for three cheap ones (limit/offset instead of ORDER BY RAND()):
import random
last = MyModel.objects.count() - 1
index1 = random.randint(0, last)
# Here's one simple way to keep even distribution for
# index2 while still gauranteeing not to match index1.
index2 = random.randint(0, last - 1)
if index2 == index1: index2 = last
# This syntax will generate "OFFSET=indexN LIMIT=1" queries
# so each returns a single record with no extraneous data.
MyObj1 = MyModel.objects.all()[index1]
MyObj2 = MyModel.objects.all()[index2]

If you specify the random operator in the ORM I'm pretty sure it will give you two distinct random results won't it?
MyModel.objects.order_by('?')[:2] # 2 random results.

For the future readers.
Get the the list of ids of all records:
my_ids = MyModel.objects.values_list('id', flat=True)
my_ids = list(my_ids)
Then pick n random ids from all of the above ids:
n = 2
rand_ids = random.sample(my_ids, n)
And get records for these ids:
random_records = MyModel.objects.filter(id__in=rand_ids)

Object.objects.order_by('?')[:2]
This would return two random-ordered records. You can add
distinct()
if there are records with the same value in your dataset.

About sampling n random values from a sequence, the random lib could be used,
random.Random().sample(range(0,last),2)
will fetch 2 random samples from among the sequence elements, 0 to last-1

from django.db import models
from random import randint
from django.db.models.aggregates import Count
class ProductManager(models.Manager):
def random(self, count=5):
index = randint(0, self.aggregate(count=Count('id'))['count'] - count)
return self.all()[index:index + count]
You can get different number of objects.

class ModelName(models.Model):
# Define model fields etc
#classmethod
def get_random(cls, n=2):
"""Returns a number of random objects. Pass number when calling"""
import random
n = int(n) # Number of objects to return
last = cls.objects.count() - 1
selection = random.sample(range(0, last), n)
selected_objects = []
for each in selection:
selected_objects.append(cls.objects.all()[each])
return selected_objects

Keeping count of values available from among multiple sets

I have the following situation:
I am generating n combinations of size 3 from, made from n values. Each kth combination [0...n] is pulled from a pool of values, located in the kth index of a list of n sets. Each value can appear 3 times. So if I have 10 values, then I have a list of size 10. Each index holds a set of values 0-10.
So, it seems to me that a good way to do this is to have something keeping count of all the available values from among all the sets. So, if a value is rare(lets say there is only 1 left), if I had a structure where I could look up the rarest value, and have the structure tell me which index it was located in, then it would make generating the possible combinations much easier.
How could I do this? What about one structure to keep count of elements, and a dictionary to keep track of list indices that contain the value?
edit: I guess I should put in that a specific problem I am looking to solve here, is how to update the set for every index of the list (or whatever other structures i end up using), so that when I use a value 3 times, it is made unavailable for every other combination.
Thank you.
Another edit
It seems that this may be a little too abstract to be asking for solutions when it's hard to understand what I am even asking for. I will come back with some code soon, please check back in 1.5-2 hours if you are interested.

how to update the set for every index of the list (or whatever other structures i end up using), so that when I use a value 3 times, it is made unavailable for every other combination.
I assume you want to sample the values truly randomly, right? What if you put 3 of each value into a list, shuffle it with random.shuffle, and then just keep popping values from the end of the list when you're building your combination? If I'm understanding your problem right, here's example code:
from random import shuffle
valid_values = [i for i in range(10)] # the valid values are 0 through 9 in my example, update accordingly for yours
vals = 3*valid_values # I have 3 of each valid value
shuffle(vals) # randomly shuffle them
while len(vals) != 0:
combination = (vals.pop(), vals.pop(), vals.pop()) # combinations are 3 values?
print(combination)
EDIT: Updated code based on the added information that you have sets of values (but this still assumes you can use more than one value from a given set):
from random import shuffle
my_sets_of_vals = [......] # list of sets
valid_values = list()
for i in range(my_sets_of_vals):
for val in my_sets_of_vals[i]:
valid_values.append((i,val)) # this can probably be done in list comprehension but I forgot the syntax
vals = 3*valid_values # I have 3 of each valid value
shuffle(vals) # randomly shuffle them
while len(vals) != 0:
combination = (vals.pop()[1], vals.pop()[1], vals.pop()[1]) # combinations are 3 values?
print(combination)

Based on the edit you could make an object for each value. It could hold the number of times you have used the element and the element itself. When you find you have used an element three times, remove it from the list

Evenly Distributed Slice of Django QuerySet (or Python list)

I'd like to get some values from a database and slice "n" of them, say to draw on a graph.
I'm starting with a Django QuerySet, but I don't think this is something I can reduce just via the ORM, so I'm fine with getting a list of values and then using whatever libs in Python are available to get a sample of the data.
So if I have a dataset of a thousand items, I'd like to be able to grab an evenly distributed range of non-random samples from that dataset, always including the start and end points and then evenly distributing the elements in between.
For instance, if:
data = [x for x in xrange(0, 777)]
and I wanted ten of them, how would I get a list back not of every 10th item, but exactly ten list items distributed evenly over the total number of elements in the list?
I'm trying:
number_of_results = 10
step = len(data) / number_of_results
data[::step]
But I'm hoping there's a more efficient way (and also a way that keeps the end point and returns exactly number_of_results items, even if the step between the items can't be exactly even).

There might be more efficient ways of doing it, but this is just a thought:
qs_ids = list(ModelA.objects.order_by('id').values_list('id', flat=True))
number_of_results = 10
step = len(qs_ids) / number_of_results
ids = data[::step]
qs = ModelA.objects.filter(id__in=ids).order_by('id')

Program running too slow! : Suggest Algorithmic/Implementation Optimization

I have a huge python list(A) of lists. The length of list A is around 90,000. Each inner list contain around 700 tuples of (datetime.date,string). Now, I am analyzing this data. What I am doing is I am taking a window of size x in inner lists where- x = len(inner list) * (some fraction <= 1) and I am saving each ordered pair (a,b) where a occurs before b in that window (actually the innerlists are sorted wrt time). I am moving this window upto the last element adding one element at a time from one end and removing from other which takes O(window-size)time as I am considering the new tuples only. My code:
for i in xrange(window_size):
j = i+1;
while j<window_size:
check_and_update(cur, my_list[i][1], my_list[j][1],log);
j=j+1
i=1;
while i<=len(my_list)-window_size:
j=i;
k=i+window_size-1;
while j<k:
check_and_update(cur, my_list[j][1], my_list[k][1],log);
j+=1
i += 1
Here cur is actually a sqlite3 database cursor,my_list is a list containing the tuples and I iterate this code for all the lists in A and log is a opened logfile. In method check_and_update() I am looking up my database to find the tuple if exists or else I insert it, along with its total number of occurrence so far. Code:
def check_and_update(cur,start,end,log):
t = str(start)+":"+ str(end)
cur.execute("INSERT OR REPLACE INTO Extra (tuple,count)\
VALUES ( ? , coalesce((SELECT count +1 from Extra WHERE tuple = ?),1))",[t,t])
As expected this number of tuples is HUGE and I have previously experimented with dictionary which eats up the memory quite fast. So, I resorted to SQLite3, but now it is too slow. I have tried indexing but with no help. Probably the my program is spending way to much time querying and updating the database. Do you have any optimization ideas for this problem? Probably changing the algorithm or some different approach/tools. Thank you!
Edit: My goal here is to find the total number of tuples of strings that occur within the window grouped by the number of different innerlists they occur in. I extract this information with this query:
for i in range(1,size+1):
cur.execute('select * from Extra where count = ?',str(i))
#other stuff
For Example ( I am ignoring the date entries and will write them as 'dt'):
My_list = [
[ ( dt,'user1') , (dt, 'user2'), (dt, 'user3') ]
[ ( dt,'user3') , (dt, 'user4')]
[ ( dt,'user2') , (dt, 'user3'), (dt,'user1') ]
]
here if I take fraction = 1 then, results:
only 1 occurrence in window: 5 (user 1-2,1-3,3-4,2-1,3-1)
only 2 occurrence in window: 2 (user 2-3)

Let me get this straight.
You have up to about 22 billion potential tuples (for 90000 lists, any of 700, any of the following entries, on average 350) which might be less depending on the window size. You want to find, but number of inner lists that they appear in, how many tuples there are.
Data of this size has to live on disk. The rule for data that lives on disk due to size is, "Never randomly access, instead generate and then sort."
So I'd suggest that you write out each tuple to a log file, one tuple per line. Sort that file. Now all instances of any given tuple are in one place. Then run through the file, and for each tuple emit the count of how many times it appears in (that is how many inner lists it is in). Sort that second file. Now run through that file, and you can extract how many tuples appeared 1x, 2x, 3x, etc.
If you have multiple machines, it is easy to convert this into a MapReduce. (Which is morally the same approach, but you get to parallelize a lot of stuff.)

Apache Hadoop is one of the MapReduce implementations that is suited for this kind of problem:

Compare DB row values efficiently

I want to loop through a database of documents and calculate a pairwise comparison score.
A simplistic, naive method would nest a loop within another loop. This would result in the program comparing documents twice and also comparing each document to itself.
Is there a name for the algorithm for doing this task efficiently?
Is there a name for this approach?
Thanks.

Assume all items have a number ItemNumber
Simple solution -- always have the 2nd element's ItemNumber greater than the first item.
eg
for (firstitem = 1 to maxitemnumber)
for (seconditem = firstitemnumber+1 to maxitemnumber)
compare(firstitem, seconditem)
visual note: if you think of the compare as a matrix (item number of one on one axis item of the other on the other axis) this looks at one of the triangles.
........
x.......
xx......
xxx.....
xxxx....
xxxxx...
xxxxxx..
xxxxxxx.

I don't think its complicated enough to qualify for a name.
You can avoid duplicate pairs just by forcing a comparison on any value which might be different between different rows - the primary key is an obvious choice, e.g.
Unique pairings:
SELECT a.item as a_item, b.item as b_item
FROM table AS a, table AS b
WHERE a.id<b.id
Potentially there are a lot of ways in which the the comparison operation can be used to generate data summmaries and therefore identify potentially similar items - for single words the soundex is an obvious choice - however you don't say what your comparison metric is.
C.

You can keep track of which documents you have already compared, e.g. (with numbers ;))
compared = set()
for i in [1,2,3]:
for j in [1,2,3]:
pair = frozenset((i,j))
if i != k and pair not in compared:
compare.add(pair)
compare(i,j)
Another idea would be to create the combination of documents first and iterate over them. But in order to generate this, you have to iterate over both lists and the you iterate over the result list again so I don't think that it has any advantage.
Update:
If you have the documents already in a list, then Hogan's answer is indeed better. But I think it needs a better example:
docs = [1,2,3]
l = len(docs)
for i in range(l):
for j in range(i+1,l):
compare(l[i],l[j])

Something like this?
src = [1,2,3]
for i, x in enumerate(src):
for y in src[i:]:
compare(x, y)
Or you might wish to generate a list of pairs instead:
pairs = [(x, y) for i, x in enumerate(src) for y in src[i:]]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

3 questions about appengine indexes - python

Related

Get 10 random elements in Django template [duplicate]

Keeping count of values available from among multiple sets

Evenly Distributed Slice of Django QuerySet (or Python list)

Program running too slow! : Suggest Algorithmic/Implementation Optimization

Compare DB row values efficiently

Categories

Resources