i wanted to write pipleline code, that gives me the 5 users with the most tweets, i tried to use $push, i looked up the mongo db documentation and it also showed $sort. I get an syntax error on the text line, but atleast to me it is not an obvious one.
Would be really nice if someone could point me in the right direction, since i watched some videos and read pages, but did not find what is wrong with my code.
pipeline = [
{"$group" : {
"_id": "$user.screen_name",
{
"$push": {"texts" : "$text"}},
{
"$sort" : {"texts":-1}}},
{
"$limit" :5}}
]
This aggregation pipeline document gives you a very good structured way on how aggregation works, with examples.
And as per your question, you are asking the same things more than once.
Anyway, in your query $group should not contain $sort and $limit check syntax, and $push is placed wrongly $push syntax. So your aggregation query should be as below:
pipeline = [{
"$group": {
"_id": "$user.screen_name",
"teet_data": {
"$push": {
"texts": "$text"
}
}
}
}, {
"$sort": {
"texts": -1
}
}, {
"$limit": 5
}]
"I wanted to write pipleline code, that gives me the 5 users with the most tweets"
I can't say if this is an improvement over #yogesh' answer, but given your description, you only need to count the tweets. Not to pass them all along your pipeline. At the very least, using a $sum would be much more memory efficient:
pipeline = [{
"$group": {
"_id": "$user.screen_name",
"count": { "$sum": 1 }
}
}
}, {
"$sort": {
"count": -1
}
}, {
"$limit": 5
}]
Related
I wrote a pymongo aggregation pipeline for my MongoDB, which not only takes a long time but also doesn't exactly return what it is supposed to. I expect the free_id field to be matched via regex, as it works with the $in command as well, but it doesn't seem to work, as this still returns documents with free_ids matching XXX-XXX/YXYXYXYXY/.*.
.aggregate([
{"$match": {"free_id": {
"$nin": ["XXX-XXX/YXYXYXYXY/.*"]}}},
{
"$group": {
"_id": "$free_id",
"count": {"$sum": 1},
"ts": {"$last": "$payload.ts"}
}
},
{
"$project": {
"internal_id": {"$slice": [{"$split": ["$_id", '/']}, 1, 1]},
"document_count": "$count",
"last_ts": "$ts"
}
},
{
"$group": {
"_id": {"$first": "$internal_id"},
"docs": {"$sum": "$document_count"},
"last": {"$last": "$last_ts"}
}
},
])
I already tried with negating a $in and double checked the calls etc. Theoretically I expect a similar response as I would get by using find({"free_id": {'$not': {'$regex': "XXX-XXX/YXYXYXYXY/.*"}}}).
i'm trying to get, from my MongoDB, all documents with the higher date.
My db is look like :
_id:"1"
date:"21-12-20"
report:"some stuff"
_id:"2"
date:"11-11-11"
report:"qualcosa"
_id:5fe08735b5a28812866cbc8a
date:"21-12-20"
report:Object
_id:5fe0b35e2f465c2a2bbfc0fd
date:"20-12-20"
report:"ciao"
and i would like to have a result like :
_id:"1"
date:"21-12-20"
report:"some stuff"
_id:5fe08735b5a28812866cbc8a
date:"21-12-20"
report:Object
I tried to run this script :
db.collection.find({}).sort([("date", -1)]).limit(1)
but it gives me only one document.
How can I get all the documents with the greatest date automatically?
Try to remove limit(1) and it's gonna work
If you add .limit(1) it's only ever going to give you one document.
Either use the answer as a query to another .find(), or you can write an aggregate query. If you data set is a modest size, I prefer the former for clarity.
max_date = list(db.collection.find({}).sort([("date", -1)])).limit(1)
if len(max_date) > 0:
db.collection.find({'date': max_date[0]['date']})
Use an aggregation pipeline like this:
db.collection.aggregate([
{ $group: { _id: null, data: { $push: "$$ROOT" } } },
{
$set: {
data: {
$filter: {
input: "$data",
cond: { $eq: [{ $max: "$data.date" }, "$$this.date"] }
}
}
}
},
{ $unwind: "$data" },
{ $replaceRoot: { newRoot: "$data" } }
])
I'm performing a search with this aggregate and would like to get my total count (to deal with my pagination).
results = mongo.db.perfumes.aggregate(
[
{"$match": {"$text": {"$search": db_query}}},
{
"$lookup": {
"from": "users",
"localField": "author",
"foreignField": "username",
"as": "creator",
}
},
{"$unwind": "$creator"},
{
"$project": {
"_id": "$_id",
"perfumeName": "$name",
"perfumeBrand": "$brand",
"perfumeDescription": "$description",
"date_updated": "$date_updated",
"perfumePicture": "$picture",
"isPublic": "$public",
"perfumeType": "$perfume_type",
"username": "$creator.username",
"firstName": "$creator.first_name",
"lastName": "$creator.last_name",
"profilePicture": "$creator.avatar",
}
},
{"$sort": {"perfumeName": 1}},
]
)
How could I get the count of results in my route so I can pass it to my template?
I cannot use results.count() as it is a CommandCursor.
Help please? Thank you!!
Using len method to return no.of elements in an array would be easier but if you still wanted an aggregation query to return count and actual docs at the same time then try using $facet or $group :
Query 1 :
{
$facet: {
docs: [ { $match: {} } ], // passes all docs into an array field
count: [ { $count: "count" } ] // counts no.of docs
}
},
/** re-create count field from array of one object to just a number */
{
$addFields: { count: { $arrayElemAt: [ "$count.count", 0 ] } }
}
Test : mongoplayground
Query 2 :
/** Group all docs without any condition & push all docs into an array field & count no.of docs flowing through iteration using `$sum` */
{
$group: { _id: "", docs: { $push: "$$ROOT" }, count: { $sum: 1 } }
}
Test : mongoplayground
Note :
Add one of these queries at the end of your current aggregation pipeline and remember if there are no docs after $match or $unwind stages then first query would not have count field but has docs : [] but second query will just return [], code it accordingly.
If you look at the CommandCursor's docs, it does not support count()
You can use the length filter in jinja template.
{{ results | length }}
I hope the above helps.
I'm curious about the best approach to count the instances of a particular field, across all documents, in a given ElasticSearch index.
For example, if I've got the following documents in index goober:
{
'_id':'foo',
'field1':'a value',
'field2':'a value'
},
{
'_id':'bar',
'field1':'a value',
'field2':'a value'
},
{
'_id':'baz',
'field1':'a value',
'field3':'a value'
}
I'd like to know something like the following:
{
'index':'goober',
'field_counts':
'field1':3,
'field2':2,
'field3':1
}
Is this doable with a single query? or multiple? For what it's worth, I'm using python elasticsearch and elasticsearch-dsl clients.
I've successfully issued a GET request to /goober and retrieved the mappings, and am learning how to submit requests for aggregations for each field, but I'm interested in learning how many times a particular field appears across all documents.
Coming from using Solr, still getting my bearings with ES. Thanks in advance for any suggestions.
The below will return you the count of docs with "field2":
POST /INDEX/_search
{
"size": 0,
"query": {
"bool": {
"filter": {
"exists": {
"field": "field2"
}
}
}
}
}
And here is an example using multiple aggregates (will return each agg in a bucket with a count), using field exist counts:
POST /INDEX/_search
{
"size": 0,
"aggs": {
"field_has1": {
"filter": {
"exists": {
"field": "field1"
}
}
},
"field_has2": {
"filter": {
"exists": {
"field": "field2"
}
}
}
}
}
The behavior within each agg on the second example will mimic the behavior of the first query. In many cases, you can take a regular search query and nest those lookups within aggregate buckets.
Quick time-saver based on existing answer:
interesting_fields = ['field1', 'field2']
body = {
'size': 0,
'aggs': {f'has_{field_name}': {
"filter": {
"exists": {
"field": f'export.{field_name}'
}
}
} for field_name in interesting_fields},
}
print(requests.post('http://localhost:9200/INDEX/_search', json=body).json())
I am using pymongo MongoClient to do multiple fields distinct count.
I found a similar example here: Link
But it doesn't works for me.
For example, by given:
data = [{"name": random.choice(all_names),
"value": random.randint(1, 1000)} for i in range(1000)]
collection.insert(data)
I want to count how many name, value combination. So I followed the Link above, and write this just for a test (I know this solution is not what I want, I just follow the pattern of the Link, and trying to get understand how it works, at least this code can returns me somthing):
collection.aggregate([
{
"$group": {
"_id": {
"name": "$name",
"value": "$value",
}
}
},
{
"$group": {
"_id": {
"name": "$_id.name",
},
"count": {"$sum": 1},
},
}
])
But the console gives me this:
on namespace test.$cmd failed: exception: A pipeline stage
specification object must contain exactly one field.
So, what is the right code to do this? Thank you for all your helps.
Finally I found a solution: Group by Null
res = col.aggregate([
{
"$group": {
"_id": {
"name": "$name",
"value": "$value",
},
}
},
{
"$group": {"_id": None, "count": {"$sum": 1}}
},
])