Matching all records in a datastore query - python

Is there a way to substitute:
def get_objects(attr1,attr2,..):
objects = Entities.all()
if attr1 != None:
objects.filter('attr1',attr1)
if attr2 != None:
objects.filter('attr2',attr2)
....
return objects
With a single query:
Entities.all().filter('attr1',attr1).filter('attr2',attr2)
By using some sort of 'match all' sign ( maybe a regexp query )?
The problem with the first query is that ( apart from being ugly ) it creates indexes for all possible filter sequences.

The datastore doesn't support regex queries or OR queries.
However, if you're only using equality filters, indexes shouldn't be automatically created; these types of queries can be served using a merge-join strategy as long as the number of filters remains low (if you try to add too many filters, you'll get an error indicating that the existing indexes can't be used to execute the query efficiently; however, trying to add the required indexes in a case like this will usually result in the exploding indexes problem.)
The ugliness in the first approach can probably be solved by passing a list to your function instead of individual variables, then using a list comprehension instead of a bunch of if statements.

Related

meaning of distinct and how to use it in python(pymongo)

I don't understand what is the meaning of distinct and how use of it. i have search for related answer but it seems like distinct is somehow related to list. really appreciate the help.
list_of_stocks = db.stocks.distinct("symbol")
As the OP confirmed, this is a PyMongo call to a MongoDB database, which allows for a distinct find, in the form of:
db.collection_name.distinct("property_name")
This returns all distinct values for a given property in a collection.
Optionally, if you specify a document filter (effectively a find() filter) as a second parameter, your query will first be reduced by that filter, then the distinct will be applied. For example:
list_of_stocks = db.stocks.distinct("symbol", {"exchange": "NASDAQ"})
Distinct keyword is used in DB and it is to return record set only with distinct elements for that particular column.

Peewee dynamically generate where clause to check for duplicates

I am currently using peewee as an ORM in a python project.
I am trying to determine given a set of objects if any of these objects already exist in the database. For objects where the uniqueness is based on one key that is simple -- I can generate a list of the keys and do a select in the database:
Model.select().where(Model.column << ids)
However, in some cases the uniqueness is determined by two columns. (Note that I don't have the primary key in hand at this moment, which is why I can't just rely on id.)
I tried to genericize the logic, where a list of all the column names that determined uniqueness could be passed in. Here is my code:
clauses = []
for obj in db_objects:
# get_unique_key returns a tuple of the all the column values
# that determine uniqueness for this object.
uniq_key = self._get_unique_key(obj)
subclause = [getattr(self.model_class, uniq_column) == value
for uniq_column, value in zip(self.uniq_columns, uniq_key)]
clauses.append(reduce(operator.and_, subclause))
dups = self.model_class.select().where(reduce(operator.or_, clauses)).execute()
Note that self.dup_columns contains the names of all the columns that together determine uniqueness, and _get_unique_key returns a tuple of those column values.
When I run this I get an error that max recursion depth has been exceeded. I suppose this is due to how peewee resolves expressions. One way around it might be to break up my clauses into some max amount of objects (i.e. create a clause for every 100 objects and then issue the query, and do this until all the objects have been processed).
Wanted to see if there was a better way instead.

How to filter on calculated column of a query and meanwhile preserve mapped entities

I have a query which selects an entity A and some calculated fields
q = session.query(Recipe,func.avg(Recipe.somefield).join(.....)
I then use what I select in a way which assumes I can subscript result with "Recipe" string:
for entry in q.all():
recipe=entry.Recipe # Access KeyedTuple by Recipe attribute
...
Now I need to wrap my query in an additional select, say to filter by calculated field AVG:
q=q.subquery();
q=session.query(q).filter(q.c.avg_1 > 1)
And now I cannot access entry.Recipe anymore!
Is there a way to make SQLAlchemy adapt a query to an enclosing one, like aliased(adapt_on_names=True) orselect_from_entity()`?
I tried using those but was given an error
As Michael Bayer mentioned in a relevant Google Group thread, such adaptation is already done via Query.from_self() method. My problem was that in this case I didn't know how to refer a column which I want to filter on
This is due to the fact, that it is calculated i.e. there is no table to refer to!
I might resort to using literals(.filter('avg_1>10')), but 'd prefer to stay in the more ORM-style
So, this is what I came up with - an explicit column expression
row_number_column = func.row_number().over(
partition_by=Recipe.id
).label('row_number')
query = query.add_column(
row_number_column
)
query = query.from_self().filter(row_number_column == 1)

GAE ndb filter by key(id)

I want to filter by Key value( If I can, by Id)
Developers console offers searching data with key value.
I want to do in my code just like:
DataModel.query(DataModel.key > ndb.Key('DataModel', id_value)).order(
DataModel.date,
DataModel.times).fetch(2000)
Which raise error...
my id_value is integers.
How can I search and filter to get data that have higher id than id_value?
The filtering by inequality on key is fine, what's wrong is that you cannot combine filtering by inequality on one thing, with ordering by another thing. To quote https://cloud.google.com/appengine/docs/python/ndb/queries under Limitations: ...:
combining too many filters, using inequalities for multiple
properties, or combining an inequality with a sort order on a
different property are all currently disallowed.
The last one of these three limitations is what you're running into.
One option is to fetch all the results from the query and then sort them in your program (instead of in the .order() clause).
A second option is to query on all three fields () to reduce the number of results, and then sort them as above:
DataModel.query(DataModel.key > ndb.Key('DataModel', id_value),
DataModel.date > some_date,
DataModel.times > some_times)
A third option is to use MapReduce to process large quantities of data.

Django models - how to filter out duplicate values by PK after the fact?

I build a list of Django model objects by making several queries. Then I want to remove any duplicates, (all of these objects are of the same type with an auto_increment int PK), but I can't use set() because they aren't hashable.
Is there a quick and easy way to do this? I'm considering using a dict instead of a list with the id as the key.
In general it's better to combine all your queries into a single query if possible. Ie.
q = Model.objects.filter(Q(field1=f1)|Q(field2=f2))
instead of
q1 = Models.object.filter(field1=f1)
q2 = Models.object.filter(field2=f2)
If the first query is returning duplicated Models then use distinct()
q = Model.objects.filter(Q(field1=f1)|Q(field2=f2)).distinct()
If your query really is impossible to execute with a single command, then you'll have to resort to using a dict or other technique recommended in the other answers. It might be helpful if you posted the exact query on SO and we could see if it would be possible to combine into a single query. In my experience, most queries can be done with a single queryset.
Is there a quick and easy way to do this? I'm considering using a dict instead of a list with the id as the key.
That's exactly what I would do if you were locked into your current structure of making several queries. Then a simply dictionary.values() will return your list back.
If you have a little more flexibility, why not use Q objects? Instead of actually making the queries, store each query in a Q object and use a bitwise or ("|") to execute a single query. This will achieve your goal and save database hits.
Django Q objects
You can use a set if you add the __hash__ function to your model definition so that it returns the id (assuming this doesn't interfere with other hash behaviour you may have in your app):
class MyModel(models.Model):
def __hash__(self):
return self.pk
If the order doesn't matter, use a dict.
Remove "duplicates" depends on how you define "duplicated".
If you want EVERY column (except the PK) to match, that's a pain in the neck -- it's a lot of comparing.
If, on the other hand, you have some "natural key" column (or short set of columns) than you can easily query and remove these.
master = MyModel.objects.get( id=theMasterKey )
dups = MyModel.objects.filter( fld1=master.fld1, fld2=master.fld2 )
dups.all().delete()
If you can identify some shorter set of key fields for duplicate identification, this works pretty well.
Edit
If the model objects haven't been saved to the database yet, you can make a dictionary on a tuple of these keys.
unique = {}
...
key = (anObject.fld1,anObject.fld2)
if key not in unique:
unique[key]= anObject
I use this one:
dict(zip(map(lambda x: x.pk,items),items)).values()

Categories