I am using pymongo MongoClient to do multiple fields distinct count.
I found a similar example here: Link
But it doesn't works for me.
For example, by given:
data = [{"name": random.choice(all_names),
"value": random.randint(1, 1000)} for i in range(1000)]
collection.insert(data)
I want to count how many name, value combination. So I followed the Link above, and write this just for a test (I know this solution is not what I want, I just follow the pattern of the Link, and trying to get understand how it works, at least this code can returns me somthing):
collection.aggregate([
{
"$group": {
"_id": {
"name": "$name",
"value": "$value",
}
}
},
{
"$group": {
"_id": {
"name": "$_id.name",
},
"count": {"$sum": 1},
},
}
])
But the console gives me this:
on namespace test.$cmd failed: exception: A pipeline stage
specification object must contain exactly one field.
So, what is the right code to do this? Thank you for all your helps.
Finally I found a solution: Group by Null
res = col.aggregate([
{
"$group": {
"_id": {
"name": "$name",
"value": "$value",
},
}
},
{
"$group": {"_id": None, "count": {"$sum": 1}}
},
])
Related
I wrote a pymongo aggregation pipeline for my MongoDB, which not only takes a long time but also doesn't exactly return what it is supposed to. I expect the free_id field to be matched via regex, as it works with the $in command as well, but it doesn't seem to work, as this still returns documents with free_ids matching XXX-XXX/YXYXYXYXY/.*.
.aggregate([
{"$match": {"free_id": {
"$nin": ["XXX-XXX/YXYXYXYXY/.*"]}}},
{
"$group": {
"_id": "$free_id",
"count": {"$sum": 1},
"ts": {"$last": "$payload.ts"}
}
},
{
"$project": {
"internal_id": {"$slice": [{"$split": ["$_id", '/']}, 1, 1]},
"document_count": "$count",
"last_ts": "$ts"
}
},
{
"$group": {
"_id": {"$first": "$internal_id"},
"docs": {"$sum": "$document_count"},
"last": {"$last": "$last_ts"}
}
},
])
I already tried with negating a $in and double checked the calls etc. Theoretically I expect a similar response as I would get by using find({"free_id": {'$not': {'$regex': "XXX-XXX/YXYXYXYXY/.*"}}}).
I have a collection with fields like this:
{
"_id":"5cf54857bbc85fd0ff5640ba",
"book_id":"5cf172220fb516f706d00591",
"tags":{
"person":[
{"start_match":209, "length_match":6, "word":"kimmel"}
],
"organization":[
{"start_match":107, "length_match":12, "word":"philadelphia"},
{"start_match":209, "length_match":13, "word":"kimmel center"}
],
"location":[
{"start_match":107, "length_match":12, "word":"philadelphia"}
]
},
"deleted":false
}
I want to collect the different words in the categories and count it.
So, the output should be like this:
{
"response": [
{
"tag": "location",
"tag_list": [
{
"count": 31,
"phrase": "philadelphia"
},
{
"count": 15,
"phrase": "usa"
}
]
},
{
"tag": "organization",
"tag_list": [ ... ]
},
{
"tag": "person",
"tag_list": [ ... ]
},
]
}
The pipeline like this works:
def pipeline_func(tag):
return [
{'$replaceRoot': {'newRoot': '$tags'}},
{'$unwind': '${}'.format(tag)},
{'$group': {'_id': '${}.word'.format(tag), 'count': {'$sum': 1}}},
{'$project': {'phrase': '$_id', 'count': 1, '_id': 0}},
{'$sort': {'count': -1}}
]
But it make a request for each tag. I want to know how to make it in one request.
Thank you for attention.
As noted, there is a slight mismatch in the question data to the current claimed pipeline process since $unwind can only be used on arrays and the tags as presented in the question is not an array.
For the data presented in the question you basically want a pipeline like this:
db.collection.aggregate([
{ "$addFields": {
"tags": { "$objectToArray": "$tags" }
}},
{ "$unwind": "$tags" },
{ "$unwind": "$tags.v" },
{ "$group": {
"_id": {
"tag": "$tags.k",
"phrase": "$tags.v.word"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.tag",
"tag_list": {
"$push": {
"count": "$count",
"phrase": "$_id.phrase"
}
}
}}
])
Again as per the note, since tags is in fact an object then what you actually need in order to collect data based on it's sub-keys as the question is asking, is to turn that essentially into an array of items.
The usage of $replaceRoot in your current pipeline would seem to indicate that $objectToArray is of fair use here, as it is available from later patch releases of MongoDB 3.4, being the bare minimal version you should be using in production right now.
That $objectToArray actually does pretty much what the name says and produces an array ( or "list" to be more pythonic ) of entries broken into key and value pairs. These are essentially a "list" of objects ( or "dict" entries ) which have the keys k and v respectively. The output of the first pipeline stage would look like this on the supplied document:
{
"book_id": "5cf172220fb516f706d00591",
"tags": [
{
"k": "person",
"v": [
{
"start_match": 209,
"length_match": 6,
"word": "kimmel"
}
]
}, {
"k": "organization",
"v": [
{
"start_match": 107,
"length_match": 12,
"word": "philadelphia"
}, {
"start_match": 209,
"length_match": 13,
"word": "kimmel center"
}
]
}, {
"k": "location",
"v": [
{
"start_match": 107,
"length_match": 12,
"word": "philadelphia"
}
]
}
],
"deleted" : false
}
So you should be able to see how you can now easily access those k values and use them in grouping, and of course the v is the standard array as well. So it's just the two $unwind stages as shown and then two $group stages. Being the first $group in order to collection over the combination of keys, and the second to collect as per the main grouping key whilst adding the other accumulations to a "list" within that entry.
Of course the output by the above listing is not exactly how you asked for in the question, but the data is basically there. You can optionally add an $addFields or $project stage to essentially rename the _id key as the final aggregation stage:
{ "$addFields": {
"_id": "$$REMOVE",
"tag": "$_id"
}}
Or simply do something pythonic with a little list comprehension on the cursor output:
cursor = db.collection.aggregate([
{ "$addFields": {
"tags": { "$objectToArray": "$tags" }
}},
{ "$unwind": "$tags" },
{ "$unwind": "$tags.v" },
{ "$group": {
"_id": {
"tag": "$tags.k",
"phrase": "$tags.v.word"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.tag",
"tag_list": {
"$push": {
"count": "$count",
"phrase": "$_id.phrase"
}
}
}}
])
output = [{ 'tag': doc['_id'], 'tag_list': doc['tag_list'] } for doc in cursor]
print({ 'response': output });
And final output as a "list" you can use for response:
{
"tag_list": [
{
"count": 1,
"phrase": "philadelphia"
}
],
"tag": "location"
},
{
"tag_list": [
{
"count": 1,
"phrase": "kimmel"
}
],
"tag": "person"
},
{
"tag_list": [
{
"count": 1,
"phrase": "kimmel center"
}, {
"count": 1,
"phrase": "philadelphia"
}
],
"tag": "organization"
}
Noting that using a list comprehension approach you have a bit more control over the order of "keys" as output, as MongoDB itself would simply append NEW key names in a projection keeping existing keys ordered first. If that sort of thing is important to you that is. Though it really should not be since all Object/Dict like structures should not be considered to have any set order of keys. That's what arrays ( or lists ) are for.
I'm trying to improve the performance of my app and my knowledge of MongoDB. I have been able to execute a fire and forget query that both creates fields if they don't exist and otherwise increment a value as follows:
date = "2018-6"
sid = "012345"
cid = "06789"
key = "MESSAGES.{}.{}.{}.{}".format(date, sid, cid, hour)
db.stats.update({}, { "$inc": { key : 1 }})
This produces a single document with the following structure:
document:
{
"MESSAGES": {
"2018-6": {
"012345": {
"06789": 1
},
"011111": {
"06667": 5
}
},{
"2018-5": {
"012345": {
"06789": 20
},
"011111": {
"06667": 15
}
}
}
}
As you can probably imagine it has become a bit of a nightmare to query this structure with increasing data. I'd like to achieve the same fire and forget query but with the implementation of a better indexable schema. Something like:
documents:
[{
"SID": "012345",
"MESSAGES: {
"MONTHS": {
"KEY": "2018-6",
"CHANNELS": {
"KEY": "06789",
"COUNT": 1
}
},{
"KEY": "2018-5",
"CHANNELS": {
"KEY": "06667",
"COUNT": 20
}
}
}
},
{
"SID": "011111",
"MESSAGES: {
"MONTHS": {
"KEY": "2018-6",
"CHANNELS": {
"KEY": "06667",
"COUNT": 5
}
},{
"KEY": "2018-5",
"CHANNELS": {
"KEY": "06667",
"COUNT": 15
}
}
}
}]
I'm working with a quite a large amount of data and these queries can happen many times a second so it's important that I just execute a thing once if at all possible. Any advice you can give is very welcome, feel free to criticise anything you see here too as my goal is to learn.
Thanks in advance!
UPDATED WITH ATTEMPT:
db.test.updateOne({"SERVER_ID": "23894723487sdf"}, {
"$addToSet" : {
"MESSAGES" : {
"DATE": "2018-6",
"CHANNELS": [{
"ID": "239048349",
"COUNT": NumberInt(1)
}]
}
},
"$inc" : {
"MESSAGES.CHANNELS.$.COUNT" : 1
}},
{upsert: true})
With PyMongo, group by one key seems to be ok:
results = collection.group(key={"scan_status":0}, condition={'date': {'$gte': startdate}}, initial={"count": 0}, reduce=reducer)
results:
{u'count': 215339.0, u'scan_status': u'PENDING'} {u'count': 617263.0, u'scan_status': u'DONE'}
but when I try to do group by multiple keys I get an exception:
results = collection.group(key={"scan_status":0,"date":0}, condition={'date': {'$gte': startdate}}, initial={"count": 0}, reduce=reducer)
How can I do group by multiple fields correctly?
If you are trying to count over two keys then while it is possible using .group() your better option is via .aggregate().
This uses "native code operators" and not the JavaScript interpreted code as required by .group() to do the same basic "grouping" action as you are trying to achieve.
Particularly here is the $group pipeline operator:
result = collection.aggregate([
# Matchn the documents possible
{ "$match": { "date": { "$gte": startdate } } },
# Group the documents and "count" via $sum on the values
{ "$group": {
"_id": {
"scan_status": "$scan_status",
"date": "$date"
},
"count": { "$sum": 1 }
}}
])
In fact you probably want something that reduces the "date" into a distinct period. As in:
result = collection.aggregate([
# Matchn the documents possible
{ "$match": { "date": { "$gte": startdate } } },
# Group the documents and "count" via $sum on the values
{ "$group": {
"_id": {
"scan_status": "$scan_status",
"date": {
"year": { "$year": "$date" },
"month": { "$month" "$date" },
"day": { "$dayOfMonth": "$date" }
}
},
"count": { "$sum": 1 }
}}
])
Using the Date Aggregation Operators as shown here.
Or perhaps with basic "date math":
import datetime
from datetime import date
result = collection.aggregate([
# Matchn the documents possible
{ "$match": { "date": { "$gte": startdate } } },
# Group the documents and "count" via $sum on the values
# use "epoch" "1970-01-01" as a base to convert to integer
{ "$group": {
"_id": {
"scan_status": "$scan_status",
"date": {
"$subtract": [
{ "$subtract": [ "$date", date.fromtimestamp(0) ] },
{ "$mod": [
{ "$subtract": [ "$date", date.fromtimestamp(0) ] },
1000 * 60 * 60 * 24
]}
]
}
},
"count": { "$sum": 1 }
}}
])
Which will return integer values from "epoch" time instead of a compisite value object.
But all of these options are better than .group() as they use native coded routines and perform their actions much faster than the JavaScript code you need to supply otherwise.
I am using python to generate a mongodb database collection and I need to find some specific values from the database, the document is like:
{
"_id":ObjectId(215454541245),
"category":food
"venues":{"Thai Restaurant":251, "KFC":124, "Chinese Restaurant":21,.....}
}
My question is that, I want to query this database and find all venues which have a value smaller than 200, so in my example, "KFC" and "Chinese Restaurant" will be returned from this query.
Anyone knows how to do that?
If you can change your schema it would be much easier to issue queries against your collection. As it is, having dynamic values as your keys is considered a bad design pattern with MongoDB as they are extremely difficult to query.
A recommended approach would be to follow an embedded model like this:
{
"_id": ObjectId("553799187174b8c402151d06"),
"category": "food",
"venues": [
{
"name": "Thai Restaurant",
"value": 251
},
{
"name": "KFC",
"value": 124
},
{
"name": "Chinese Restaurant",
"value": 21
}
]
}
Thus with this structure you could then issue the query to find all venues which have a value smaller than 200:
db.collection.findOne({"venues.value": { "$lt": 200 } },
{
"venues": { "$elemMatch": { "value": { "$lt": 200 } } },
"_id": 0
});
This will return the result:
/* 0 */
{
"venues" : [
{
"name" : "KFC",
"value" : 124
}
]
}