meaning of distinct and how to use it in python(pymongo)

meaning of distinct and how to use it in python(pymongo) - python

I don't understand what is the meaning of distinct and how use of it. i have search for related answer but it seems like distinct is somehow related to list. really appreciate the help.
list_of_stocks = db.stocks.distinct("symbol")

As the OP confirmed, this is a PyMongo call to a MongoDB database, which allows for a distinct find, in the form of:
db.collection_name.distinct("property_name")
This returns all distinct values for a given property in a collection.
Optionally, if you specify a document filter (effectively a find() filter) as a second parameter, your query will first be reduced by that filter, then the distinct will be applied. For example:
list_of_stocks = db.stocks.distinct("symbol", {"exchange": "NASDAQ"})

Distinct keyword is used in DB and it is to return record set only with distinct elements for that particular column.

Related

Finding document containing array of nested names in pymongo (CrossRef data)

I have a dataset of CrossRef works records stored in a collection called works in MongoDB and I am using a Python application to query this database.
I am trying to find documents based on one author's name. Removing extraneous details, a document might look like this:
{'DOI':'some-doi',
'author':[{'given': 'Albert','family':'Einstein',affiliation:[]},
{'given':'R.','family':'Feynman',affiliation:[]},
{'given':'Paul','family':'Dirac',affiliation:['University of Florida']}]
}
It isn't clear to me how to combine the queries to get just Albert Einstein's papers.
I have indexes on author.family and author.given, I've tried:
cur = works.find({'author.family':'Einstein','author.given':'Albert'})
This returns all of the documents by people called 'Albert' and all of those by people called 'Einstein'. I can filter this manually, but it's obviously less than ideal.
I also tried:
cur = works.find({'author':{'given':'Albert','family':'Einstein','affiliation':[]}})
But this returns nothing (after a very long delay). I've tried this with and without 'affiliation'. There are a few questions on SO about querying nested fields, but none seem to concern the case where we're looking for 2 specific things in 1 nested field.

Your issue is that author is a list.
You can use an aggregate query to unwind this list to objects, and then your query would work:
cur = works.aggregate([{'$unwind': '$author'},
{'$match': {'author.family':'Einstein', 'author.given':'Albert'}}])
Alternatively, use $elemMatch which matches on arrays that match all the elements specified.
cur = works.find({"author": {'$elemMatch': {'family': 'Einstein', 'given': 'Albert'}}})
Also consider using multikey indexes.

How to filter on calculated column of a query and meanwhile preserve mapped entities

I have a query which selects an entity A and some calculated fields
q = session.query(Recipe,func.avg(Recipe.somefield).join(.....)
I then use what I select in a way which assumes I can subscript result with "Recipe" string:
for entry in q.all():
recipe=entry.Recipe # Access KeyedTuple by Recipe attribute
...
Now I need to wrap my query in an additional select, say to filter by calculated field AVG:
q=q.subquery();
q=session.query(q).filter(q.c.avg_1 > 1)
And now I cannot access entry.Recipe anymore!
Is there a way to make SQLAlchemy adapt a query to an enclosing one, like aliased(adapt_on_names=True) orselect_from_entity()`?
I tried using those but was given an error

As Michael Bayer mentioned in a relevant Google Group thread, such adaptation is already done via Query.from_self() method. My problem was that in this case I didn't know how to refer a column which I want to filter on
This is due to the fact, that it is calculated i.e. there is no table to refer to!
I might resort to using literals(.filter('avg_1>10')), but 'd prefer to stay in the more ORM-style
So, this is what I came up with - an explicit column expression
row_number_column = func.row_number().over(
partition_by=Recipe.id
).label('row_number')
query = query.add_column(
row_number_column
)
query = query.from_self().filter(row_number_column == 1)

Django: distinct on a foreign key, then ordering

I have two models, Track and Pair. Each Pair has a track1, track2 and popularity. I'm trying to get an ordered list by popularity (descending) of pairs, with no two pairs having the same track1. Here's what I've tried so far:
lstPairs = Pair.objects.order_by('-popularity','track1__id').distinct('track1__id')[:iNumPairs].values_list('track1__id', 'track2__id', 'popularity')
This gave me the following error:
ProgrammingError: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
...so I tried this:
lstPairs = Pair.objects.order_by('-popularity','track1__id').distinct('popularity', 'track1__id')[:iNumPairs].values_list('track1__id', 'track2__id', 'popularity')
This gave me entries with duplicate track1__ids. Does anyone know of a way of solving this problem? I'm guessing I'll have to use raw() or something similar but I don't know how I'd approach a problem like this. I'm using PostgreSQL for the DB backend so DISTINCT should be supported.

First, let's clarify: DISTINCT is standard SQL, while DISTINCT ON is a PostgreSQL extension.
The error (DISTINCT ON expressions must match initial ORDER BY expressions) indicates, that you should fix your ORDER BY clause, not the DISTINT ON (if you do that, you'll end up with different results, like you already experienced).
The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s). The ORDER BY clause will normally contain additional expression(s) that determine the desired precedence of rows within each DISTINCT ON group.
This will give you your expected results:
lstPairs = Pair.objects.order_by('track1__id','-popularity').distinct('track1__id')[:iNumPairs].values_list('track1__id', 'track2__id', 'popularity')
In SQL:
SELECT DISTINCT ON (track1__id) track1__id, track2__id, popularity
FROM pairs
ORDER BY track1__id, popularity DESC
But probably in a wrong order.
If you want your original order, you can use a sub-query here:
SELECT *
FROM (
SELECT DISTINCT ON (track1__id) track1__id, track2__id, popularity
FROM pairs
ORDER BY track1__id
-- LIMIT here, if necessary
)
ORDER BY popularity DESC, track1__id

See the documentation on distinct.
First:
On PostgreSQL only, you can pass positional arguments (*fields) in order to specify the names of fields to which the DISTINCT should apply.
You dont' specify what is your database backend, if it is not PostrgreSQL you have no chance to make it work.
Second:
When you specify field names, you must provide an order_by() in the QuerySet, and the fields in order_by() must start with the fields in distinct(), in the same order.
I think that you should use raw(), or get the entire list of Pairs ordered by popularity and then make the filtering by track1 uniqueness in Python.

Matching all records in a datastore query

Is there a way to substitute:
def get_objects(attr1,attr2,..):
objects = Entities.all()
if attr1 != None:
objects.filter('attr1',attr1)
if attr2 != None:
objects.filter('attr2',attr2)
....
return objects
With a single query:
Entities.all().filter('attr1',attr1).filter('attr2',attr2)
By using some sort of 'match all' sign ( maybe a regexp query )?
The problem with the first query is that ( apart from being ugly ) it creates indexes for all possible filter sequences.

The datastore doesn't support regex queries or OR queries.
However, if you're only using equality filters, indexes shouldn't be automatically created; these types of queries can be served using a merge-join strategy as long as the number of filters remains low (if you try to add too many filters, you'll get an error indicating that the existing indexes can't be used to execute the query efficiently; however, trying to add the required indexes in a case like this will usually result in the exploding indexes problem.)
The ugliness in the first approach can probably be solved by passing a list to your function instead of individual variables, then using a list comprehension instead of a bunch of if statements.

Django models - how to filter out duplicate values by PK after the fact?

I build a list of Django model objects by making several queries. Then I want to remove any duplicates, (all of these objects are of the same type with an auto_increment int PK), but I can't use set() because they aren't hashable.
Is there a quick and easy way to do this? I'm considering using a dict instead of a list with the id as the key.

In general it's better to combine all your queries into a single query if possible. Ie.
q = Model.objects.filter(Q(field1=f1)|Q(field2=f2))
instead of
q1 = Models.object.filter(field1=f1)
q2 = Models.object.filter(field2=f2)
If the first query is returning duplicated Models then use distinct()
q = Model.objects.filter(Q(field1=f1)|Q(field2=f2)).distinct()
If your query really is impossible to execute with a single command, then you'll have to resort to using a dict or other technique recommended in the other answers. It might be helpful if you posted the exact query on SO and we could see if it would be possible to combine into a single query. In my experience, most queries can be done with a single queryset.

Is there a quick and easy way to do this? I'm considering using a dict instead of a list with the id as the key.
That's exactly what I would do if you were locked into your current structure of making several queries. Then a simply dictionary.values() will return your list back.
If you have a little more flexibility, why not use Q objects? Instead of actually making the queries, store each query in a Q object and use a bitwise or ("|") to execute a single query. This will achieve your goal and save database hits.
Django Q objects

You can use a set if you add the __hash__ function to your model definition so that it returns the id (assuming this doesn't interfere with other hash behaviour you may have in your app):
class MyModel(models.Model):
def __hash__(self):
return self.pk

If the order doesn't matter, use a dict.

Remove "duplicates" depends on how you define "duplicated".
If you want EVERY column (except the PK) to match, that's a pain in the neck -- it's a lot of comparing.
If, on the other hand, you have some "natural key" column (or short set of columns) than you can easily query and remove these.
master = MyModel.objects.get( id=theMasterKey )
dups = MyModel.objects.filter( fld1=master.fld1, fld2=master.fld2 )
dups.all().delete()
If you can identify some shorter set of key fields for duplicate identification, this works pretty well.
Edit
If the model objects haven't been saved to the database yet, you can make a dictionary on a tuple of these keys.
unique = {}
...
key = (anObject.fld1,anObject.fld2)
if key not in unique:
unique[key]= anObject

I use this one:
dict(zip(map(lambda x: x.pk,items),items)).values()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

meaning of distinct and how to use it in python(pymongo) - python

I don't understand what is the meaning of distinct and how use of it. i have search for related answer but it seems like distinct is somehow related to list. really appreciate the help. list_of_stocks = db.stocks.distinct("symbol")

Distinct keyword is used in DB and it is to return record set only with distinct elements for that particular column.

Related

Finding document containing array of nested names in pymongo (CrossRef data)

How to filter on calculated column of a query and meanwhile preserve mapped entities

Django: distinct on a foreign key, then ordering

Matching all records in a datastore query

Django models - how to filter out duplicate values by PK after the fact?

Categories

Resources