Iterating data store entities conveniently

Iterating data store entities conveniently - python

Please educate me on how to do this the right way, as I feel my current way is long-winded.
I know iterating over all entities in App Engine is not quite how it is designed to be used, but sometimes I want to gather statistics about my entities, for example how many users are female. In reality the criteria might be something more complicated, but in any case something that requires examining each entity.
Here is some pseudoish code on how I am iterating over entities:
def handle_count_female_users(cursor = None, counter = 0):
q = User.all()
if cursor:
q.with_cursor(cursor)
MAX_FETCH = 100
users = q.fetch(MAX_FETCH)
count_of_female_users = len(filter(lambda user:user.gender == 'female', users))
total_count = counter + count_of_female_users
if len(users) == MAX_FETCH:
Task(
url = "/count_female_users",
params = {
'counter' : str(total_count),
'cursor' : q.cursor()
}
).add()
else:
# Now finally have the result
logging.info("We have %s female users in total." % total_count)
I have routing code that automatically maps GET /foo to be handled by handle_foo, something that I've found convenient. As you can see, even with that I have a lot of stuff supporting the looping, having almost nothing to do with what I actually want to accomplish.
What I would really want to do is something like:
tally_entities(
entity_class = User,
filter_criteria = lambda user:user.gender == 'female',
result = lambda count:logging.info("We have %s female users in total" % count)
)
Any ideas how to get closer to this ideal, or is there some even better way?

Sounds like a good use case for mapreduce:
http://code.google.com/p/appengine-mapreduce/wiki/GettingStartedInPython

Related

sqlite: Calling database each time function is called

I got a function that is called to calculate a response every time the user inputs something. The function gets the response from a database. What I don't understand is, why I have to redefine my variable (I have called it intents_db) that contains all the data from the database each time the function is called? I have tried putting it outside the function, but then my program only works the first time, but the returns an empty answer the second time the user inputs something.
def response(sentence, user_id='1'):
results = classify_intent(sentence)
intents_db = c.execute("SELECT row_num, responses, tag, responses, intent_type, response_type, context_set,\
context_filter FROM intents")
if results:
# loop as long as there are matches to process
while results:
if results[0][1] > answer_threshold:
for i in intents_db:
# print('tag:', i[2])
if i[2] == results[0][0]:
print(i[6])
if i[6] != 'N/A':
if show_details:
print('context: ', i[6])
context[user_id] = i[6]
responses = i[1].split('&/&')
print(random.choice(responses))
if i[7] == 'N/A' in i or \
(user_id in context and i[7] in i and i[7] == context[
user_id]):
# a random response from the intent
responses = i[1].split('&/&')
print(random.choice(responses))
print(i[4], i[5])
print(results[0][1])
elif results[0][1] <= answer_threshold:
print(results[0][1])
for i in intents_db:
if i[2] == 'unknown':
# a random response from the intent
responses = i[1].split('&/&')
print(random.choice(responses))
initial_comm_output = random.choice(responses)
return initial_comm_output
else:
initial_comm_output = "Something unexpected happened when calculating response. Please restart me"
return initial_comm_output
results.pop(0)
return results
Also, I started getting into databases and sqlite3 because I want to make a massive database long term. Therefore it also seems inefficient that I have to load the whole database at all. Is there some way I can only load the row of data I need? I got a row_number column in my database, so if it was somehow possible to say like:
"SELECT WHERE row_num=2 FROM intents"
that would be great, but I can't figure out how to do it.

cursor.execute() returns an iterator, and you can only loop over it once.
If you want to reuse it, turn it into a list:
intents_db = list(c.execute("..."))

Therefore it also seems inefficient that I have to load the whole database at all. Is there some way I can only load the row of data I need? I got a row_number column in my database, so if it was somehow possible to say like: "SELECT WHERE row_num=2 FROM intents" that would be great, but I can't figure out how to do it.
You nearly got it: it is
intents_db = c.execute("SELECT row_num, responses, tag, responses, intent_type,
response_type, context_set, context_filter
FROM intents WHERE row_num=2")
But don't do the mistake other database beginners make and try to put in some variable from your program directly into that string. This makes the program prone to SQL injections.
Rather, do
row_num = 2
intents_db = c.execute("SELECT row_num, responses, tag, responses, intent_type,
response_type, context_set, context_filter
FROM intents WHERE row_num=?", (row_num,))
Of course, you can also set conditions for other fields.

Ordering objects by rating accounting for the number or ratings

I'm trying to do something similar to the first response in this SO question: SQL ordering by rating/votes, where resources may be rated (one rating per user per resource), but when ordering the resources based on their ratings, any resources with fewer than X separate ratings will appear below those with X or more.
I'm implementing this in Django and I'd very much prefer to avoid the use of raw query and keep within the Django model and query framework.
So far, this is what I have:
data = []
data_top = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating')).exclude(rate_count__lt=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
for d in data_top:
data.append(d)
data_bottom = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating')).exclude(rate_count__gte=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
for d in data_bottom:
data.append(d)
This all functions and returns the ordering by rating as I need, however, it doesn't feel very efficient - what with running 2 queries and looping over the results of each.
Is there a better way I can code this, either in a single query, or at least avoiding looping though each query set?
Any help much appreciated.

from itertools import chain
main_query = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating'))
data_top_query = main_query.exclude(rate_count__lt=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
data_bottom_query = main_query.exclude(rate_count__gte=settings.ORB_RESOURCE_MIN_RATINGS).order_by(order_by)
data = list(chain(data_top_query, data_bottom_query))
Using itertools.chain is faster than looping each list and appending elements one by one
Also, the querysets will get evaluated when list is called on them (as they don't hit the database till then)
FYI, the above will hit the db twice when evaluated.

You're currently querying twice and iterating twice, but you can cut it down to one and one easily-just query for the items ordered by rating, then iterate like this:
data_top = []
data_bottom = []
data = Resource.objects.all().annotate(rating=Avg('resourcerating__rating'),rate_count=Count('resourcerating')).order_by(order_by)
for d in data:
if data.rate_count >= settings.ORB_RESOURCE_MIN_RATINGS:
data_top.append(d)
else:
data_bottom.append(d)
data = data_top + data_bottom
This can also be done with the query only, by creating another aggregate column which contains the value rate_count < settings.ORB_RESOURCE_MIN_RATINGS (return 0 for values above or at the threshold, 1 for below) and sorting on (new_column, rating). Pretty sure this would require some custom SQL, but perhaps someone else knows otherwise.

Paginating requests to an API

I'm consuming (via urllib/urllib2) an API that returns XML results. The API always returns the total_hit_count for my query, but only allows me to retrieve results in batches of, say, 100 or 1000. The API stipulates I need to specify a start_pos and end_pos for offsetting this, in order to walk through the results.
Say the urllib request looks like http://someservice?query='test'&start_pos=X&end_pos=Y.
If I send an initial 'taster' query with lowest data transfer such as http://someservice?query='test'&start_pos=1&end_pos=1 in order to get back a result of, for conjecture, total_hits = 1234, I'd like to work out an approach to most cleanly request those 1234 results in batches of, again say, 100 or 1000 or...
This is what I came up with so far, and it seems to work, but I'd like to know if you would have done things differently or if I could improve upon this:
hits_per_page=100 # or 1000 or 200 or whatever, adjustable
total_hits = 1234 # retreived with BSoup from 'taster query'
base_url = "http://someservice?query='test'"
startdoc_positions = [n for n in range(1, total_hits, hits_per_page)]
enddoc_positions = [startdoc_position + hits_per_page - 1 for startdoc_position in startdoc_positions]
for start, end in zip(startdoc_positions, enddoc_positions):
if end > total_hits:
end = total_hits
print "url to request is:\n ",
print "%s&start_pos=%s&end_pos=%s" % (base_url, start, end)
p.s. I'm a long time consumer of StackOverflow, especially the Python questions, but this is my first question posted. You guys are just brilliant.

I'd suggest using
positions = ((n, n + hits_per_page - 1) for n in xrange(1, total_hits, hits_per_page))
for start, end in positions:
and then not worry about whether end exceeds hits_per_page unless the API you're using really cares whether you request something out of range; most will handle this case gracefully.
P.S. Check out httplib2 as a replacement for the urllib/urllib2 combo.

It might be interesting to use some kind of generator for this scenario to iterate over the list.
def getitems(base_url, per_page=100):
content = ...urllib...
total_hits = get_total_hits(content)
sofar = 0
while sofar < total_hits:
items_from_next_query = ...urllib...
for item in items_from_next_query:
sofar += 1
yield item
Mostly just pseudo code, but it could prove quite useful if you need to do this many times by simplifying the logic it takes to get the items as it only returns a list which is quite natural in python.
Save you quite a bit of duplicate code also.

Python: why does this code take forever (infinite loop?)

I'm developing an app in Google App Engine. One of my methods is taking never completing, which makes me think it's caught in an infinite loop. I've stared at it, but can't figure it out.
Disclaimer: I'm using http://code.google.com/p/gaeunitlink text to run my tests. Perhaps it's acting oddly?
This is the problematic function:
def _traverseForwards(course, c_levels):
''' Looks forwards in the dependency graph '''
result = {'nodes': [], 'arcs': []}
if c_levels == 0:
return result
model_arc_tails_with_course = set(_getListArcTailsWithCourse(course))
q_arc_heads = DependencyArcHead.all()
for model_arc_head in q_arc_heads:
for model_arc_tail in model_arc_tails_with_course:
if model_arc_tail.key() in model_arc_head.tails:
result['nodes'].append(model_arc_head.sink)
result['arcs'].append(_makeArc(course, model_arc_head.sink))
# rec_result = _traverseForwards(model_arc_head.sink, c_levels - 1)
# _extendResult(result, rec_result)
return result
Originally, I thought it might be a recursion error, but I commented out the recursion and the problem persists. If this function is called with c_levels = 0, it runs fine.
The models it references:
class Course(db.Model):
dept_code = db.StringProperty()
number = db.IntegerProperty()
title = db.StringProperty()
raw_pre_reqs = db.StringProperty(multiline=True)
original_description = db.StringProperty()
def getPreReqs(self):
return pickle.loads(str(self.raw_pre_reqs))
def __repr__(self):
return "%s %s: %s" % (self.dept_code, self.number, self.title)
class DependencyArcTail(db.Model):
''' A list of courses that is a pre-req for something else '''
courses = db.ListProperty(db.Key)
def equals(self, arcTail):
for this_course in self.courses:
if not (this_course in arcTail.courses):
return False
for other_course in arcTail.courses:
if not (other_course in self.courses):
return False
return True
class DependencyArcHead(db.Model):
''' Maintains a course, and a list of tails with that course as their sink '''
sink = db.ReferenceProperty()
tails = db.ListProperty(db.Key)
Utility functions it references:
def _makeArc(source, sink):
return {'source': source, 'sink': sink}
def _getListArcTailsWithCourse(course):
''' returns a LIST, not SET
there may be duplicate entries
'''
q_arc_heads = DependencyArcHead.all()
result = []
for arc_head in q_arc_heads:
for key_arc_tail in arc_head.tails:
model_arc_tail = db.get(key_arc_tail)
if course.key() in model_arc_tail.courses:
result.append(model_arc_tail)
return result
Am I missing something pretty obvious here, or is GAEUnit acting up?
Also - the test that is making this run slow has no more than 5 models of any kind in the datastore. I know this is potentially slow, but my app only does this once then subsequently caches it.

Ignoring the commented out recursion, I don't think this should be an infinite loop - you are just doing some for-loops over finite results sets.
However, it does seem like this would be really slow. You're looping over entire tables and then doing more datastore queries in every nested loop. It seems unlikely that this sort of request would complete in a timely manner on GAE unless your tables are really, really small.
Some rough numbers:
If H = # of entities in DepedencyArcHead and T = average # of tails in each DependencyArcHead then:
_getListArcTailsWithCourse is doing about H*T queries (understimate). In the "worst" case, the result returned from this function will have H*T elements.
_traverseForwards loops over all these results H times, and thus does another H*(H*T) queries.
Even if H and T are only on the order of 10s, you could be doing thousands of queries. If they're bigger, then ... (and this ignores any additional queries you'd do if you uncommented the recursive call).
In short, I think you may want to try to organize your data a little differently if possible. I'd make a specific suggestion, but what exactly you're trying to do isn't clear to me.

App Engine Datastore IN Operator - how to use?

Reading: http://code.google.com/appengine/docs/python/datastore/gqlreference.html
I want to use:
:= IN
but am unsure how to make it work. Let's assume the following
class User(db.Model):
name = db.StringProperty()
class UniqueListOfSavedItems(db.Model):
str = db.StringPropery()
datesaved = db.DateTimeProperty()
class UserListOfSavedItems(db.Model):
name = db.ReferenceProperty(User, collection='user')
str = db.ReferenceProperty(UniqueListOfSavedItems, collection='itemlist')
How can I do a query which gets me the list of saved items for a user? Obviously I can do:
q = db.Gql("SELECT * FROM UserListOfSavedItems WHERE name :=", user[0].name)
but that gets me a list of keys. I want to now take that list and get it into a query to get the str field out of UniqueListOfSavedItems. I thought I could do:
q2 = db.Gql("SELECT * FROM UniqueListOfSavedItems WHERE := str in q")
but something's not right...any ideas? Is it (am at my day job, so can't test this now):
q2 = db.Gql("SELECT * FROM UniqueListOfSavedItems __key__ := str in q)
side note: what a devilishly difficult problem to search on because all I really care about is the "IN" operator.

Since you have a list of keys, you don't need to do a second query - you can do a batch fetch, instead. Try this:
#and this should get me the items that a user saved
useritems = db.get(saveditemkeys)
(Note you don't even need the guard clause - a db.get on 0 entities is short-circuited appropritely.)
What's the difference, you may ask? Well, a db.get takes about 20-40ms. A query, on the other hand (GQL or not) takes about 160-200ms. But wait, it gets worse! The IN operator is implemented in Python, and translates to multiple queries, which are executed serially. So if you do a query with an IN filter for 10 keys, you're doing 10 separate 160ms-ish query operations, for a total of about 1.6 seconds latency. A single db.get, in contrast, will have the same effect and take a total of about 30ms.

+1 to Adam for getting me on the right track. Based on his pointer, and doing some searching at Code Search, I have the following solution.
usersaveditems = User.Gql(“Select * from UserListOfSavedItems where user =:1”, userkey)
saveditemkeys = []
for item in usersaveditems:
#this should create a list of keys (references) to the saved item table
saveditemkeys.append(item.str())
if len(usersavedsearches > 0):
#and this should get me the items that a user saved
useritems = db.Gql(“SELECT * FROM UniqueListOfSavedItems WHERE __key__ in :1’, saveditemkeys)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterating data store entities conveniently - python

Sounds like a good use case for mapreduce: http://code.google.com/p/appengine-mapreduce/wiki/GettingStartedInPython

Related

sqlite: Calling database each time function is called

Ordering objects by rating accounting for the number or ratings

Paginating requests to an API

Python: why does this code take forever (infinite loop?)

App Engine Datastore IN Operator - how to use?

Categories

Resources