Embedded document corrresponding to the maximum value in pymongo - python

I have the following schema for documents in my collection. Each document corresponds to list all the submissions made by a name.
- "_id": ObjectId
- "name": str
- "is_team": bool
- "submissions": List
- time: datetime
- score: float
Example:
{"name": "Intrinsic Nubs",
"is_team": true,
"submissions": [
{
"score": 61.77466359705439,
"time": {
"$date": {
"$numberLong": "1656009267652"
}
}
},
{
"score": 81.77466359705439,
"time": {
"$date": {
"$numberLong": "1656009267680"
}
}
}]}
I need to collect all those documents whose is_team is True and further get the name, Maximum Score and time corresponding to the maximum score.
Example:
[{"name": "Intrinsic Nubs", "MaxScore": 81.77466359705439, "time":{ "$date": {"$numberLong": "1656009267680"}}}]

Here's another way to produce your desired output.
db.collection.aggregate([
{ // limit docs
"$match": {"is_team": true}
},
{ // set MaxScore
"$set": {"MaxScore": {"$max": "$submissions.score"}}
},
{ "$project": {
"_id": 0,
"name": 1,
"MaxScore": 1,
"time": {
// get time at MaxScore
"$arrayElemAt": [
"$submissions.time",
{"$indexOfArray": ["$submissions.score", "$MaxScore"]}
]
}
}
}
])
Try it on mongoplayground.net.

Query
keep documents with is_team=true
reduce to find the member with the biggest score, and return it
you can $project, futher i kept all to see the change
Playmongo
aggregate(
[{"$match": {"is_team": {"$eq": true}}},
{"$set":
{"name": "$name",
"max-submision":
{"$reduce":
{"input": "$submissions",
"initialValue": {"score": 0},
"in":
{"$cond":
[{"$gt": ["$$this.score", "$$value.score"]}, "$$this",
"$$value"]}}}}}])

Related

How to get the sum of a value using an id condition over a date range with Elasticsearch?

I'm trying to write a query to get the sum of a value per month of documents with a particular Id. To do this I'm trying:
query = {
"size": 0,
"aggs" : {
"articles_over_time" : {
"date_histogram" : {
"field" : "timestamp",
"interval" : "month"
}
},
"value": {
"sum": {
"field": "generatedTotal"
}
}
}
}
This query will give me the sum of generatedTotal per month, but it is giving me the sum of generatedTotal for all documents. How can I specify to get the sum of generatedTotal per month for a particular generatorId?
Example of a document in the Elasticsearch index:
{'id': 0, 'timestamp': '2018-01-01', 'generatorId': '150', 'generatedTotal': 2166.8759558092734}
If you do it separately like that, it counts as 2 different aggregations. You first need to query for the specific generatorId that you want, then do the second aggs within the first aggs:
{
"size": 0,
"query": {
"term": {
"generatorId": "150"
}
},
"aggs": {
"articles_over_time": {
"date_histogram": {
"field": "timestamp",
"interval": "month"
},
"aggs": {
"monthlyGeneratedTotal": {
"sum": {
"field": "generatedTotal"
}
}
}
}
}
}
4 sample documents (1 with different generatorId, and not be counted in the aggregations)
{"timestamp": "2018-02-01", "generatedTotal": 3, "generatorId": "150"}
{"timestamp": "2018-01-01", "generatedTotal": 1, "generatorId": "150"}
{"timestamp": "2018-01-01", "generatedTotal": 2, "generatorId": "150"}
{"timestamp": "2018-01-01", "generatedTotal": 2, "generatorId": "160"}
Then you will have the aggregations as follow:
{
"aggregations": {
"articles_over_time": {
"buckets": [
{
"key_as_string": "2018-01-01T00:00:00.000Z",
"key": 1514764800000,
"doc_count": 2,
"monthlyGeneratedTotal": {
"value": 3.0
}
},
{
"key_as_string": "2018-02-01T00:00:00.000Z",
"key": 1517443200000,
"doc_count": 1,
"monthlyGeneratedTotal": {
"value": 3.0
}
}
]
}
}
}
I hope this answers your question.

Elasticsearch Hybrid Query - Always returning a score of 0

I’m currently trying to do a hybrid search on two indexes: a full text index and knn_vector (word embeddings) index. Currently, over 10’000 documents from Wikipedia are indexed on an ES stack, indexed on both of these fields (see mapping: “content”, “embeddings”). The queries are well known n-grams (1,2,3) that should yield results (words are taken from the wikipedia pages that are indexed).
It is also important to note that the knn_vector index is defined as a nested object.
This is the current mapping of the items indexed:
mapping = {
"settings": {
"index": {
"knn": True,
"knn.space_type": "cosinesimil"
}
},
"mappings": {
"dynamic": 'strict',
"properties": {
"elasticId":
{ 'type': 'text' },
"owners":
{ 'type': 'text' },
"type":
{ 'type': 'keyword' },
"accessLink":
{ 'type': 'keyword' },
"content":
{ 'type': 'text'},
"embeddings": {
'type': 'nested',
"properties": {
"vector": {
"type": "knn_vector",
"dimension": VECTOR_DIM,
},
},
},
}
My goal is to compare the query scores on both indexes to understand if one is more efficient than the other (full text vs. knn_vectors), and how elastic chooses to return an object from based on the score of each index.
I understand I could simply split the queries (two separate queries), but ideally, we might want to use a hybrid search of this type in production.
This is the current query that searches on both full text and the knn_vectors:
def MakeHybridSearch(query):
query_vector = convert_to_embeddings(query)
result = elastic.search({
"explain": True,
"profile": True,
"size": 2,
"query": {
"function_score": { #function_score
"functions": [
{
"filter": {
"match": {
"text": {
"query": query,
'boost': "5",
},
},
},
"weight": 2
},
{
"filter": {
'script': {
'source': 'knn_score',
'params': {
'field': 'doc_vector',
'vector': query_vector,
'space_type': "l2"
}
}
},
"weight": 4
}
],
"max_boost": 5,
"score_mode": "replace",
"boost_mode": "multiply",
"min_score": 5
}
}
}, index='files_en', size=1000)
The current problem is that all queries are not returning anything.
Result:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
},
Even when the query does return a response, it returns hits with a score of 0 (score =0).
Is there an error in the query structure ? Could this be on the mapping side ? If not, is there a better of way of doing this ?
Thank you for your help !
I've seen issues similar to this when a solution doesn't allow for prefiltering prior to running k-NN and requires k-NN be run prior to filters. This often leads to null set results as none of the hits from the initial k-NN align with the subsequent filters.
Here is an option that might help as it allows for prefiltering prior to k-NN: https://medium.com/gsi-technology/scalable-semantic-vector-search-with-elasticsearch-e79f9145ba8e
now, aws elasticsearch support post-filtering.
https://docs.aws.amazon.com/opensearch-service/latest/developerguide/knn.html
The sampe query in the page is
{
"size": 2,
"query": {
"knn": {
"my_vector2": {"vector": [2, 3, 5, 6], "k": 2}
}
},
"post_filter": {
"range": {
"price": {"gte": 6, "lte": 10}
}
}
}
aws elasticsearch also support pre-filtering using 'script_score' (but it is different from 'script_score' in original elasticsearch)
https://opendistro.github.io/for-elasticsearch-docs/docs/knn/knn-score-script/
The sample query in the page is
{
"size": 2,
"query": {
"script_score": {
"query": {
"bool": {
"filter": {
"term": {
"color": "BLUE"
}
}
}
},
"script": {
"lang": "knn",
"source": "knn_score",
"params": {
"field": "my_binary",
"query_value": "U29tZXRoaW5nIEltIGxvb2tpbmcgZm9y",
"space_type": "hammingbit"
}
}
}
}
}

MongoDB - Get SUM of values INSIDE of the array

I have JSON document recorded to MongoDB with structure like so:
[{ "SessionKey": "172e3b6b-509e-4ef3-950c-0c1dc5c83bab",
"Query": {"Date": "2020-03-04"},
"Flights": [
{"LegId":"13235",
"PricingOptions": [
{"Agents": [1963108],
"Price": 61763.64 },
{"Agents": [4035868],
"Price": 62395.83 }]},
{"LegId": "13236",
"PricingOptions": [{
"Agents": [2915951],
"Price": 37188.0}]}
...
The result I'm trying to get is "LegId":"sum_per_flight", in this case -> {'13235': (61763.64+62395.83), '13236': 37188.0} and then get flights with price < N
I've tried to run this pipeline for aggregation step (but it returns list of ALL prices - I don't know how to sum them up properly):
result = collection.aggregate([
{'$match': {'Query.Date': '2020-03-01'}},
{'$group': {'_id': {'Flight':'$Flights.LegId', 'Price':'$Flights.PricingOptions.Price'}}} ])
Also I've tried this pipeline, but it returns 0 for 'total_price_per_flight':
result = collection.aggregate({'$project': {
'Flights.LegId':1,
'total_price_per_flight': {'$sum': '$Flights.PricingOptions.Price'}
}})
You need to use $unwind to flatten Flights array to able iterate individually.
With $reduce operator, we iterate PricingOptions array and sum Price fields (accumulate prices).
The last step we return your documents into original structure. Before that, you may apply "get flights with price < N"
db.collection.aggregate([
{
"$match": {
"Query.Date": "2020-03-04"
}
},
{
$unwind: "$Flights"
},
{
$addFields: {
"Flights.LegId": {
$arrayToObject: [
[
{
k: "$Flights.LegId",
v: {
$reduce: {
input: "$Flights.PricingOptions",
initialValue: 0,
in: {
$add: [
"$$value",
"$$this.Price"
]
}
}
}
}
]
]
}
}
},
{
$group: {
_id: "$_id",
SessionKey: {
$first: "$SessionKey"
},
Query: {
$first: "$Query"
},
Flights: {
$push: "$Flights"
}
}
}
])
MongoPlayground

Filter MongoDB query to find documents only if a field in a list of objects is not empty

I have a MongoDB document structure like following:
Structure
{
"stores": [
{
"items": [
{
"feedback": [],
"item_category": "101",
"item_id": "10"
},
{
"feedback": [],
"item_category": "101",
"item_id": "11"
}
]
},
{
"items": [
{
"feedback": [],
"item_category": "101",
"item_id": "10"
},
{
"feedback": ["A feedback"],
"item_category": "101",
"item_id": "11"
},
{
"feedback": [],
"item_category": "101",
"item_id": "12"
},
{
"feedback": [],
"item_category": "102",
"item_id": "13"
},
{
"feedback": [],
"item_category": "102",
"item_id": "14"
}
],
"store_id": 500
}
]
}
This is a single document in a collection. Some field are deleted to produce minimal representation of the data.
What I want is to get items only if the feedback field in the items array is not empty. The expected result is:
Expected result
{
"stores": [
{
"items": [
{
"feedback": ["A feedback"],
"item_category": "101",
"item_id": "11"
}
],
"store_id": 500
}
]
}
This is what I tried based on examples in this, which I think pretty same situation, but it didn't work. What's wrong with my query, isn't it the same situation in zipcode search example in the link? It returns everything like in the first JSON code, Structure:
What I tried
query = {
'date': {'$gte': since, '$lte': until},
'stores.items': {"$elemMatch": {"feedback": {"$ne": []}}}
}
Thanks.
Please try this :
db.yourCollectionName.aggregate([
{ $match: { 'date': { '$gte': since, '$lte': until }, 'stores.items': { "$elemMatch": { "feedback": { "$ne": [] } } } } },
{ $unwind: '$stores' },
{ $match: { 'stores.items': { "$elemMatch": { "feedback": { "$ne": [] } } } } },
{ $unwind: '$stores.items' },
{ $match: { 'stores.items.feedback': { "$ne": [] } } },
{ $group: { _id: { _id: '$_id', store_id: '$stores.store_id' }, items: { $push: '$stores.items' } } },
{ $project: { _id: '$_id._id', store_id: '$_id.store_id', items: 1 } },
{ $group: { _id: '$_id', stores: { $push: '$$ROOT' } } },
{ $project: { 'stores._id': 0 } }
])
We've all these stages as you need to operate on an array of arrays, this query is written assuming you're dealing with a large set of data, Since you're filtering on dates just in case if your documents size is way less after first $match then you can avoid following $match stage which is in between two $unwind's.
Ref 's :
$match,
$unwind,
$project,
$group
This aggregate query gets the needed result (using the provided sample document and run from the mongo shell):
db.stores.aggregate( [
{ $unwind: "$stores" },
{ $unwind: "$stores.items" },
{ $addFields: { feedbackExists: { $gt: [ { $size: "$stores.items.feedback" }, 0 ] } } },
{ $match: { feedbackExists: true } },
{ $project: { _id: 0, feedbackExists: 0 } }
] )

How to make pymongo aggregation with count all elements and grouping by one request

I have a collection with fields like this:
{
"_id":"5cf54857bbc85fd0ff5640ba",
"book_id":"5cf172220fb516f706d00591",
"tags":{
"person":[
{"start_match":209, "length_match":6, "word":"kimmel"}
],
"organization":[
{"start_match":107, "length_match":12, "word":"philadelphia"},
{"start_match":209, "length_match":13, "word":"kimmel center"}
],
"location":[
{"start_match":107, "length_match":12, "word":"philadelphia"}
]
},
"deleted":false
}
I want to collect the different words in the categories and count it.
So, the output should be like this:
{
"response": [
{
"tag": "location",
"tag_list": [
{
"count": 31,
"phrase": "philadelphia"
},
{
"count": 15,
"phrase": "usa"
}
]
},
{
"tag": "organization",
"tag_list": [ ... ]
},
{
"tag": "person",
"tag_list": [ ... ]
},
]
}
The pipeline like this works:
def pipeline_func(tag):
return [
{'$replaceRoot': {'newRoot': '$tags'}},
{'$unwind': '${}'.format(tag)},
{'$group': {'_id': '${}.word'.format(tag), 'count': {'$sum': 1}}},
{'$project': {'phrase': '$_id', 'count': 1, '_id': 0}},
{'$sort': {'count': -1}}
]
But it make a request for each tag. I want to know how to make it in one request.
Thank you for attention.
As noted, there is a slight mismatch in the question data to the current claimed pipeline process since $unwind can only be used on arrays and the tags as presented in the question is not an array.
For the data presented in the question you basically want a pipeline like this:
db.collection.aggregate([
{ "$addFields": {
"tags": { "$objectToArray": "$tags" }
}},
{ "$unwind": "$tags" },
{ "$unwind": "$tags.v" },
{ "$group": {
"_id": {
"tag": "$tags.k",
"phrase": "$tags.v.word"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.tag",
"tag_list": {
"$push": {
"count": "$count",
"phrase": "$_id.phrase"
}
}
}}
])
Again as per the note, since tags is in fact an object then what you actually need in order to collect data based on it's sub-keys as the question is asking, is to turn that essentially into an array of items.
The usage of $replaceRoot in your current pipeline would seem to indicate that $objectToArray is of fair use here, as it is available from later patch releases of MongoDB 3.4, being the bare minimal version you should be using in production right now.
That $objectToArray actually does pretty much what the name says and produces an array ( or "list" to be more pythonic ) of entries broken into key and value pairs. These are essentially a "list" of objects ( or "dict" entries ) which have the keys k and v respectively. The output of the first pipeline stage would look like this on the supplied document:
{
"book_id": "5cf172220fb516f706d00591",
"tags": [
{
"k": "person",
"v": [
{
"start_match": 209,
"length_match": 6,
"word": "kimmel"
}
]
}, {
"k": "organization",
"v": [
{
"start_match": 107,
"length_match": 12,
"word": "philadelphia"
}, {
"start_match": 209,
"length_match": 13,
"word": "kimmel center"
}
]
}, {
"k": "location",
"v": [
{
"start_match": 107,
"length_match": 12,
"word": "philadelphia"
}
]
}
],
"deleted" : false
}
So you should be able to see how you can now easily access those k values and use them in grouping, and of course the v is the standard array as well. So it's just the two $unwind stages as shown and then two $group stages. Being the first $group in order to collection over the combination of keys, and the second to collect as per the main grouping key whilst adding the other accumulations to a "list" within that entry.
Of course the output by the above listing is not exactly how you asked for in the question, but the data is basically there. You can optionally add an $addFields or $project stage to essentially rename the _id key as the final aggregation stage:
{ "$addFields": {
"_id": "$$REMOVE",
"tag": "$_id"
}}
Or simply do something pythonic with a little list comprehension on the cursor output:
cursor = db.collection.aggregate([
{ "$addFields": {
"tags": { "$objectToArray": "$tags" }
}},
{ "$unwind": "$tags" },
{ "$unwind": "$tags.v" },
{ "$group": {
"_id": {
"tag": "$tags.k",
"phrase": "$tags.v.word"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.tag",
"tag_list": {
"$push": {
"count": "$count",
"phrase": "$_id.phrase"
}
}
}}
])
output = [{ 'tag': doc['_id'], 'tag_list': doc['tag_list'] } for doc in cursor]
print({ 'response': output });
And final output as a "list" you can use for response:
{
"tag_list": [
{
"count": 1,
"phrase": "philadelphia"
}
],
"tag": "location"
},
{
"tag_list": [
{
"count": 1,
"phrase": "kimmel"
}
],
"tag": "person"
},
{
"tag_list": [
{
"count": 1,
"phrase": "kimmel center"
}, {
"count": 1,
"phrase": "philadelphia"
}
],
"tag": "organization"
}
Noting that using a list comprehension approach you have a bit more control over the order of "keys" as output, as MongoDB itself would simply append NEW key names in a projection keeping existing keys ordered first. If that sort of thing is important to you that is. Though it really should not be since all Object/Dict like structures should not be considered to have any set order of keys. That's what arrays ( or lists ) are for.

Categories