Peewee dynamically generate where clause to check for duplicates - python

I am currently using peewee as an ORM in a python project.
I am trying to determine given a set of objects if any of these objects already exist in the database. For objects where the uniqueness is based on one key that is simple -- I can generate a list of the keys and do a select in the database:
Model.select().where(Model.column << ids)
However, in some cases the uniqueness is determined by two columns. (Note that I don't have the primary key in hand at this moment, which is why I can't just rely on id.)
I tried to genericize the logic, where a list of all the column names that determined uniqueness could be passed in. Here is my code:
clauses = []
for obj in db_objects:
# get_unique_key returns a tuple of the all the column values
# that determine uniqueness for this object.
uniq_key = self._get_unique_key(obj)
subclause = [getattr(self.model_class, uniq_column) == value
for uniq_column, value in zip(self.uniq_columns, uniq_key)]
clauses.append(reduce(operator.and_, subclause))
dups = self.model_class.select().where(reduce(operator.or_, clauses)).execute()
Note that self.dup_columns contains the names of all the columns that together determine uniqueness, and _get_unique_key returns a tuple of those column values.
When I run this I get an error that max recursion depth has been exceeded. I suppose this is due to how peewee resolves expressions. One way around it might be to break up my clauses into some max amount of objects (i.e. create a clause for every 100 objects and then issue the query, and do this until all the objects have been processed).
Wanted to see if there was a better way instead.

Related

Checking whether specified multiple keys exist in Datastore table without fetching the entity

I have let's say 1000 key names whose existance I want to check in the Google App Engine datastore, but without fetching the entities themselves. One of the reason, beside possible speedups, is that keys-only fetching is free (no cost).
ndb.get_multi() allows me to pass in the list of keys, but it will retrieve the entities. I need a function to do just that but without fetching the entities, but just True or False based whether the specified keys exist.
I'd probably use a keys-only query...:
q = EntityKind.query(EntityKind.key.IN(wanted_keys))
keys_present = set(q.iter(keys_only=True))
That gives you keys_present as a set of those keys in wanted_keys that are actually present in the datastore. Not quite the same as your desired mapping from key to bool, but, the latter can be easily built:
key_there = {k: (k in keys_present) for k in wanted_keys}
...should you actually want it (a dict with bool values is usually more likely to be a less-wieldy hack for a set!-).

Django: distinct on a foreign key, then ordering

I have two models, Track and Pair. Each Pair has a track1, track2 and popularity. I'm trying to get an ordered list by popularity (descending) of pairs, with no two pairs having the same track1. Here's what I've tried so far:
lstPairs = Pair.objects.order_by('-popularity','track1__id').distinct('track1__id')[:iNumPairs].values_list('track1__id', 'track2__id', 'popularity')
This gave me the following error:
ProgrammingError: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
...so I tried this:
lstPairs = Pair.objects.order_by('-popularity','track1__id').distinct('popularity', 'track1__id')[:iNumPairs].values_list('track1__id', 'track2__id', 'popularity')
This gave me entries with duplicate track1__ids. Does anyone know of a way of solving this problem? I'm guessing I'll have to use raw() or something similar but I don't know how I'd approach a problem like this. I'm using PostgreSQL for the DB backend so DISTINCT should be supported.
First, let's clarify: DISTINCT is standard SQL, while DISTINCT ON is a PostgreSQL extension.
The error (DISTINCT ON expressions must match initial ORDER BY expressions) indicates, that you should fix your ORDER BY clause, not the DISTINT ON (if you do that, you'll end up with different results, like you already experienced).
The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group.
This will give you your expected results:
lstPairs = Pair.objects.order_by('track1__id','-popularity').distinct('track1__id')[:iNumPairs].values_list('track1__id', 'track2__id', 'popularity')
In SQL:
SELECT DISTINCT ON (track1__id) track1__id, track2__id, popularity
FROM pairs
ORDER BY track1__id, popularity DESC
But probably in a wrong order.
If you want your original order, you can use a sub-query here:
SELECT *
FROM (
SELECT DISTINCT ON (track1__id) track1__id, track2__id, popularity
FROM pairs
ORDER BY track1__id
-- LIMIT here, if necessary
)
ORDER BY popularity DESC, track1__id
See the documentation on distinct.
First:
On PostgreSQL only, you can pass positional arguments (*fields) in order to specify the names of fields to which the DISTINCT should apply.
You dont' specify what is your database backend, if it is not PostrgreSQL you have no chance to make it work.
Second:
When you specify field names, you must provide an order_by() in the QuerySet, and the fields in order_by() must start with the fields in distinct(), in the same order.
I think that you should use raw(), or get the entire list of Pairs ordered by popularity and then make the filtering by track1 uniqueness in Python.

Multiple linked list with SQLAlchemy and MySQL

I want to have multiple linked list in a SQL table, using MySQL and SQLAlchemy (0.7). All lists with it's first node with parent being 0, and ends with child being 0. The id represents the list, and not the indevidiual element. The element is identified by PK
With some omitted syntax (not relevant to the problem) it should look something like this:
id(INT, PK)
content (TEXT)
parent(INT, FK(id), PK)
child(INT, FK(id), PK)
As the table has multiple linked lists how can return the entire list from the database I select a specific ID and parent is 0?
For example:
SELECT * FROM ... WHERE id = 3 AND parent = 0
Given that you have multiple linked lists stored in the same table, I assume that you store either the HEAD and/or the TAIL of those in some other tables. Few ideas:
1) Keep the linked list:
The first big improvement (also proposed in the comments) from the data-querying perspective would be to have some common identifier (lets call it ListID) of all the nodes in the same list. Here there are few options:
If each list is referenced only from one object (data row) [I would even phrase the question like "Does the list belong to a single object?], then this ListID could simply be the (primary) identifier of the holder object with the ForeignKey on top to ensure data integrity.
In this case, querying all list is very simple. In fact, you can define the relationship and navigate it like my_object.my_list_items.
If the list is used/referenced by multiple objects, then one could create another table which will consist only of one column ListID (PK), and each Node/Item will again have a ForeignKey to it, or something similar
Else, large lists can be loaded in two queries/SQL statements:
query the HEAD/TAIL by its ID
query the whole list based on received ListID of the HEAD/TAIL
In fact, this can be done with one query like the one below (Single-query example), which is more efficient from the IO perspective, but doing it in two steps has the advantage that you immediately have a reference to the HEAD (or TAIL) node.
Single-query example:
# single-query using join (not tested)
Head = alias(Node)
qry = session.query(Node).join(Head, Node.ListID == Head.ListID).filter(Head.ID == head_node_id)
Iin any case, in order to traverse the linked list, you would have to get the HEAD/TAIL by its ID, then traverse as usual.
Note: Here I am not certain if SA would recognize that the reference objects are already loaded into session, or will issue other SQL statements for each of these, which will defeat the purpose of bulk loading.
2) Replace linked list with Ordering List extension:
Please read the Ordering List documentation. It well might be that Ordering List implementation will be good enough for you to use instead of the linked list

Which one is more efficient?

I have a Python program for deleting duplicates from a list of names.
But I'm in a dilemma and searching out for a most efficient way out of both means.
I have uploaded a list of names to a SQLite DB, into a column in a table.
Whether comparing the names and deleting the duplicates out of them in a DB is good or loading them to Python means getting them into Python and deleting the duplicates and pushing them back to the DB is good?
I'm confused and here is a piece of code to do it on SQLite:
dup_killer (member_id, date) SELECT * FROM talks GROUP BY member_id,
If you use the names as a key in the database, the database will make sure they are not duplicated. So there would be no reason to ship the list to Python and de-dup there.
If you haven't inserted the names into the database yet, you might as well de-dup them in Python first. It is probably faster to do it in Python using the built-in features than to incur the overhead of repeated attempts to insert to the database.
(By the way: you can really speed up the insertion of many names if you wrap all the inserts in a single transaction. Start a transaction, insert all the names, and finish the transaction. The database does some work to make sure that the database is consistent, and it's much more efficient to do that work once for a whole list of names, rather than doing it once per name.)
If you have the list in Python, you can de-dup it very quickly using built-in features. The two common features that are useful for de-duping are the set and the dict.
I have given you three examples. The simplest case is where you have a list that just contains names, and you want to get a list with just unique names; you can just put the list into a set. The second case is that your list contains records and you need to extract the name part to build the set. The third case shows how to build a dict that maps a name onto a record, then inserts the record into a database; like a set, a dict will only allow unique values to be used as keys. When the dict is built, it will keep the last value from the list with the same name.
# list already contains names
unique_names = set(list_of_all_names)
unique_list = list(unique_names) # lst now contains only unique names
# extract record field from each record and make set
unique_names = set(x.name for x in list_of_all_records)
unique_list = list(unique_names) # lst now contains only unique names
# make dict mapping name to a complete record
d = dict((x.name, x) for x in list_of_records)
# insert complete record into database using name as key
for name in d:
insert_into_database(d[name])

Matching all records in a datastore query

Is there a way to substitute:
def get_objects(attr1,attr2,..):
objects = Entities.all()
if attr1 != None:
objects.filter('attr1',attr1)
if attr2 != None:
objects.filter('attr2',attr2)
....
return objects
With a single query:
Entities.all().filter('attr1',attr1).filter('attr2',attr2)
By using some sort of 'match all' sign ( maybe a regexp query )?
The problem with the first query is that ( apart from being ugly ) it creates indexes for all possible filter sequences.
The datastore doesn't support regex queries or OR queries.
However, if you're only using equality filters, indexes shouldn't be automatically created; these types of queries can be served using a merge-join strategy as long as the number of filters remains low (if you try to add too many filters, you'll get an error indicating that the existing indexes can't be used to execute the query efficiently; however, trying to add the required indexes in a case like this will usually result in the exploding indexes problem.)
The ugliness in the first approach can probably be solved by passing a list to your function instead of individual variables, then using a list comprehension instead of a bunch of if statements.

Categories