Python: DISTINCT on GQuery result set (GQL, GAE)

Python: DISTINCT on GQuery result set (GQL, GAE) - python

Imagine you got an entity in the Google App Engine datastore, storing links for anonymous users.
You would like to perform the following SQL query, which is not supported:
SELECT DISTINCT user_hash FROM links
Instead you could use:
user = db.GqlQuery("SELECT user_hash FROM links")
How to use Python most efficiently to filter the result, so it returns a DISTINCT result set?
How to count the DISTINCT result set?

Reviving this question for completion:
The DISTINCT keyword has been introduced in release 1.7.4.
You can find the updated GQL reference (for example for Python) here.

A set is good way to deal with that:
>>> a = ['google.com', 'livejournal.com', 'livejournal.com', 'google.com', 'stackoverflow.com']
>>> b = set(a)
>>> b
set(['livejournal.com', 'google.com', 'stackoverflow.com'])
>>>
One suggestion w/r/t the first answer, is that sets and dicts are better at retrieving unique results quickly, membership in lists is O(n) versus O(1) for the other types, so if you want to store additional data, or do something like create the mentioned unique_results list, it may be better to do something like:
unique_results = {}
>>> for item in a:
unique_results[item] = ''
>>> unique_results
{'livejournal.com': '', 'google.com': '', 'stackoverflow.com': ''}

One option would be to put the results into a set object:
http://www.python.org/doc/2.6/library/sets.html#sets.Set
The resulting set will consist only of the distinct values passed into it.
Failing that, building up a new list containing only the unique objects would work. Something like:
unique_results = []
for obj in user:
if obj not in unique_results:
unique_results.append(obj)
That for loop can be condensed into a list comprehension as well.

Sorry to dig this question up but in GAE I cannot compare objects like that, I must use .key() for comparison like that:
Beware, this is very inefficient :
def unique_result(array):
urk={} #unique results with key
for c in array:
if c.key() not in urwk:
urk[str(c.key())]=c
return urk.values()
If anyone has a better solution, please share.

Related

Django turn lists into one list

I want to put this data into one list so i can sort it by timestamp. I tried itertools chain but that didn't really work.
Thank you for your help :)
I'm very bad at making clear what i want to do so im sorry upfront if this takes some explaning.
If i try a chain i get the value back like this.
I want to display it on the html page like this :
date, name , rating, text (newline)
likes comments
Which would work the way i did it but if i want to sort it by time, it wouldn't work so i tried to think of a way to make it into a sortable list. Which can be displayed. Is that understandable ?
['Eva Simia', 'Peter Alexander', {'scale': 5, 'value': 5}, {'scale': 5, 'value': 5}, 1, 0, 1, 0]
it should look like this:
['Peter Alexander, scale:5, value:5, 1,0]
['Eva Simia, scale:5, value:5, 1,0]
for i in user:
name.append(i['name'])
for i in next_level:
rating_values.append(i['rating'])
for i in comment_values:
comments_count.append(i['count'])
for i in likes_values:
likes_count.append(i['count'])
for s in rating_values:
ratings.append(s['value'])
for s in date:
ratings.append(s['date'])
ab = itertools.chain([name], [rating_values],
[comments_count], [likes_values],
[comment_values], [date])
list(ab)

Updated after clarification:
The problem as I understand it:
You have a dataset that is split into several lists, one list per field.
Every list has the records in the same order. That is, user[x]'s rating value is necessarily rating_values[x].
You need to merge that information into a single list of composite items. You'd use zip() for that:
merged = zip(user, next_level, comment_values, likes_values, rating_values, date)
# merged is now [(user[0], next_level[0], comment_values[0], ...),
# (user[1], next_level[1], comment_values[1], ...),
# ...]
From there, you can simply sort your list using sorted():
result = sorted(merged, key=lambda i: (i[5], i[0]))
The key argument must be a function. It is given each item in the list once, and must return the key that will be used to compare items. Here, we build a short function on the fly, that returns the date and username, effectively telling things will be sorted, first by date and if dates are equal, then by username.
[Past answer about itertools.chain, before the clarification]
ab = list(itertools.chain(
(i['name'] for i in user),
(i['rating'] for i in next_level),
(i['count'] for i in comment_values),
(i['count'] for i in likes_values),
(i['value'] for i in rating_values),
(i['date'] for i in date),
))
The point of using itertools.chain is usually to avoid needless copies and intermediary objects. To do that, you want to pass it iterators.
chain will return an iterator that will iterate through each of the given iterators, one at a time, moving to the next iterator when current stops.
Note every iterator has to be wrapped in parentheses, else python will complain. Do not make it square brackets, at it would cause an intermediary list to be built.

You can join list by simply using +.
l = name + rating_values + comments_count + ...

date, rating_values,likes_values,comment_values,next_level,user = (list(t) for t in zip(*sorted(zip(date, rating_values,likes_values,comment_values,next_level,user))))

Is there something simple like a set for un-hashable objects?

For hashable objects inside a dict I could easily pair down duplicate values store in a dict using a set. For example:
a = {'test': 1, 'key': 1, 'other': 2}
b = set(a.values())
print(b)
Would display [1,2]
Problem I have is I am using a dict to store mapping between variable keys in __dict__ and the corresponding processing functions that will be passed to an engine to order and process those functions, some of these functions may be fast some may be slower due to accessing an API. The problem is each function may use multiple variable, therefor need multiple mappings in the dict. I'm wondering if there is a way to do this or if I am stuck writing my own solution?
Ended up building a callable class, since caching could speed things up for me:
from collections.abc import Callable
class RemoveDuplicates(Callable):
input_cache = []
output_cache = []
def __call__(self, in_list):
if list in self.input_cache:
idx = self.input_cache.index(in_list)
return self.output_cache[idx]
else:
self.input_cache.append(in_list)
out_list = self._remove_duplicates(in_list)
self.output_cache.append(out_list)
return out_list
def _remove_duplicates(self, src_list):
result = []
for item in src_list:
if item not in result:
result.append(item)
return result

If the objects can be ordered, you can use itertools.groupby to eliminate the duplicates:
>>> a = {'test': 1, 'key': 1, 'other': 2}
>>> b = [k for k, it in itertools.groupby(sorted(a.values()))]
>>> print(b)
[1, 2]

Is there something simple like a set for un-hashable objects
Not in the standard library but you need to look beyond and search for BTree implementation of dictionary. I googled and found few hits where the first one (BTree)seems promising and interesting
Quoting from the wiki
The BTree-based data structures differ from Python dicts in several
fundamental ways. One of the most important is that while dicts
require that keys support hash codes and equality comparison, the
BTree-based structures don’t use hash codes and require a total
ordering on keys.
Off-course its trivial fact that a set can be implemented as a dictionary where the value is unused.

You could (indirectly) use the bisect module to create sorted collection of your values which would greatly speed-up the insertion of new values and value membership testing in general — which together can be utilized to unsure that only unique values get put into it.
In the code below, I've used un-hashable set values for the sake of illustration.
# see http://code.activestate.com/recipes/577197-sortedcollection
from sortedcollection import SortedCollection
a = {'test': {1}, 'key': {1}, 'other': {2}}
sc = SortedCollection()
for value in a.values():
if value not in sc:
sc.insert(value)
print(list(sc)) # --> [{1}, {2}]

Can I search for items in list of objects without a for loop?

Not sure if this is possible or not, but I'm using sqlalchemy and my queries returns all the items as objects. I want to search for items within that list of objects but not have to do a for loop each time.
Here's how I know now to currently do it:
for x in objects_from_query:
print x.name, ' - ', x.age
if I want to find out if I have data for a user named 'bob' in my list I would have to:
for x in objects_from_query:
if x.name == 'bob':
print 'bob exists!'
Because I have to do this a lot, I'm wondering if there's a faster way to find if bob exists without having to do a for loop every time? Typically with lists I do something like objects_from_query.index("bob") but is there something similar when instead of a normal list, its a list of objects?

If you're using sqlalchemy, you can use the .filter method on the Query object, which translates to SQL as a where clause. Something like the following would work:-
import sqlalchemy as sql
session = sql.orm.sessionmaker( bind=sql.create_engine( 'sqlite:///sql.db' ) )
Bob = session.query( Names ).filter( Names.name == "bob" )

Answer : no . You need a loop, whatever data structure you use, from van Emde Boas , arrays, heaps, etc.
This is why you should learn programming: in order to be able to find a data structure, such that the for loop to find what you need as fast as possible.

How about a filter to get all objects with the name bob, like:
filter(lambda x: x.name == 'bob', objects_from_query)
This still probably runs a loop underneath, but if you're looking for a more concise way of writing it, this is pretty OK.

I can't really talk about the first one, but for the second one, something like this could really speed up the approach
print(x.name == 'bob' in objects_from_query)

>>> class thing():
... name = ""
... age = 0
...
>>> [x.age for x in l if x.name == "bob"]
[1]
List comprehension.

How to compare an element of a tuple (int) to determine if it exists in a list

I have the two following lists:
# List of tuples representing the index of resources and their unique properties
# Format of (ID,Name,Prefix)
resource_types=[('0','Group','0'),('1','User','1'),('2','Filter','2'),('3','Agent','3'),('4','Asset','4'),('5','Rule','5'),('6','KBase','6'),('7','Case','7'),('8','Note','8'),('9','Report','9'),('10','ArchivedReport',':'),('11','Scheduled Task',';'),('12','Profile','<'),('13','User Shared Accessible Group','='),('14','User Accessible Group','>'),('15','Database Table Schema','?'),('16','Unassigned Resources Group','#'),('17','File','A'),('18','Snapshot','B'),('19','Data Monitor','C'),('20','Viewer Configuration','D'),('21','Instrument','E'),('22','Dashboard','F'),('23','Destination','G'),('24','Active List','H'),('25','Virtual Root','I'),('26','Vulnerability','J'),('27','Search Group','K'),('28','Pattern','L'),('29','Zone','M'),('30','Asset Range','N'),('31','Asset Category','O'),('32','Partition','P'),('33','Active Channel','Q'),('34','Stage','R'),('35','Customer','S'),('36','Field','T'),('37','Field Set','U'),('38','Scanned Report','V'),('39','Location','W'),('40','Network','X'),('41','Focused Report','Y'),('42','Escalation Level','Z'),('43','Query','['),('44','Report Template ','\\'),('45','Session List',']'),('46','Trend','^'),('47','Package','_'),('48','RESERVED','`'),('49','PROJECT_TEMPLATE','a'),('50','Attachments','b'),('51','Query Viewer','c'),('52','Use Case','d'),('53','Integration Configuration','e'),('54','Integration Command f'),('55','Integration Target','g'),('56','Actor','h'),('57','Category Model','i'),('58','Permission','j')]
# This is a list of resource ID's that we do not want to reference directly, ever.
unwanted_resource_types=[0,1,3,10,11,12,13,14,15,16,18,20,21,23,25,27,28,32,35,38,41,47,48,49,50,57,58]
I'm attempting to compare the two in order to build a third list containing the 'Name' of each unique resource type that currently exists in unwanted_resource_types. e.g. The final result list should be:
result = ['Group','User','Agent','ArchivedReport','ScheduledTask','...','...']
I've tried the following that (I thought) should work:
result = []
for res in resource_types:
if res[0] in unwanted_resource_types:
result.append(res[1])
and when that failed to populate result I also tried:
result = []
for res in resource_types:
for type in unwanted_resource_types:
if res[0] == type:
result.append(res[1])
also to no avail. Is there something i'm missing? I believe this would be the right place to perform list comprehension, but that's still in my grey basket of understanding fully (The Python docs are a bit too succinct for me in this case).
I'm also open to completely rethinking this problem, but I do need to retain the list of tuples as it's used elsewhere in the script. Thank you for any assistance you may provide.

Your resource types are using strings, and your unwanted resources are using ints, so you'll need to do some conversion to make it work.
Try this:
result = []
for res in resource_types:
if int(res[0]) in unwanted_resource_types:
result.append(res[1])
or using a list comprehension:
result = [item[1] for item in resource_types if int(item[0]) in unwanted_resource_types]

The numbers in resource_types are numbers contained within strings, whereas the numbers in unwanted_resource_types are plain numbers, so your comparison is failing. This should work:
result = []
for res in resource_types:
if int( res[0] ) in unwanted_resource_types:
result.append(res[1])

The problem is that your triples contain strings and your unwanted resources contain numbers, change the data to
resource_types=[(0,'Group','0'), ...
or use int() to convert the strings to ints before comparison, and it should work. Your result can be computed with a list comprehension as in
result=[rt[1] for rt in resource_types if int(rt[0]) in unwanted_resource_types]
If you change ('0', ...) into (0, ... you can leave out the int() call.
Additionally, you may change the unwanted_resource_types variable into a set, like
unwanted_resource_types=set([0,1,3, ... ])
to improve speed (if speed is an issue, else it's unimportant).

The one-liner:
result = map(lambda x: dict(map(lambda a: (int(a[0]), a[1]), resource_types))[x], unwanted_resource_types)
without any explicit loop does the job.
Ok - you don't want to use this in production code - but it's fun. ;-)
Comment:
The inner dict(map(lambda a: (int(a[0]), a[1]), resource_types)) creates a dictionary from the input data:
{0: 'Group', 1: 'User', 2: 'Filter', 3: 'Agent', ...
The outer map chooses the names from the dictionary.

How can I filter by key, or keys, a query in Python for Google App Engine?

I have a query and I can apply filters on them without any problem. This works fine:
query.filter('foo =', 'bar')
But what if I want to filter my query by key or a list of keys?
I have them as Key() property or as a string and by trying something like this, it didn't work:
query.filter('key =', 'some_key') #no success
query.filter('key IN', ['key1', 'key2']) #no success

Whilst it's possible to filter on key - see #dplouffe's answer - it's not a good idea. 'IN' clauses execute one query for each item in the clause, so you end up doing as many queries as there are keys, which is a particularly inefficient way to achieve your goal.
Instead, use a batch fetch operation, as #Luke documents, then filter any elements you don't want out of the list in your code.

You can filter queries by doing a GQL Query like this:
result = db.GqlQuery('select * from Model where __key__ IN :1', [db.Key.from_path('Model', 'Key1'), db.Key.from_path('Model', 'Key2')]).fetch(2)
or
result = Model.get([db.Key.from_path('Model', 'Key1'), db.Key.from_path('ProModelduct', 'Key2')])

You cannot filter on a Key. Oops, I was wrong about that. You can filter on a key and other properties at the same time if you have an index set up to handle it. It would look like this:
key = db.Key.from_path('MyModel', 'keyname')
MyModel.all().filter("__key__ =", key).filter('foo = ', 'bar')
You can also look up a number of models by their keys, key IDs, or key names with the get family of methods.
# if you have the key already, or can construct it from its path
models = MyModel.get(Key.from_path(...), ...)
# if you have keys with names
models = MyModel.get_by_key_name('asdf', 'xyz', ...)
# if you have keys with IDs
models = MyModel.get_by_id(123, 456, ...)
You can fetch many entities this way. I don't know the exact limit. If any of the keys doesn't exist, you'll get a None in the list for that entity.
If you need to filter on some property as well as the key, you'll have to do that in two steps. Either fetch by the keys and check for the property, or query on the property and validate the keys.
Here's an example of filtering after fetching. Note that you don't use the Query class's filter method. Instead just filter the list.
models = MyModels.get_by_key_name('asdf', ...)
filtered = itertools.ifilter(lambda x: x.foo == 'bar', models)

Have a look at: https://developers.google.com/appengine/docs/python/ndb/entities?hl=de#multiple
list_of_entities = ndb.get_multi(list_of_keys)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: DISTINCT on GQuery result set (GQL, GAE) - python

Reviving this question for completion: The DISTINCT keyword has been introduced in release 1.7.4. You can find the updated GQL reference (for example for Python) here.

Related

Django turn lists into one list

Is there something simple like a set for un-hashable objects?

Can I search for items in list of objects without a for loop?

How to compare an element of a tuple (int) to determine if it exists in a list

How can I filter by key, or keys, a query in Python for Google App Engine?

Categories

Resources