I want to get the total number of records in an aggregate cursor in pymongo version 3.0+. Is there any way to get total count without iterating over the cursor?
cursor = db.collection.aggregate([{"$match": options},{"$group": {"_id": groupby,"count": {"$sum":1}}} ])
cursorlist = [c for c in cursor]
print len(cursorlist)
Is there any way to skip the above iteration?
You could add another group pipeline where you specify an _id value of None to calculate accumulated values for all the input documents as a whole, this is where you can get the total count, as well as the original grouped counts, albeit in an accumulated array:
>>> pipeline = [
... {"$match": options},
... {"$group": {"_id": groupby, "count": {"$sum":1}}},
... {"$group": {"_id": None, "total": {"$sum": 1}, "details":{"$push":{"groupby": "$_id", "count": "$count"}}}}
... ]
>>> list(db.collection.aggregate(pipeline))
Related
Input : Multiple csv with the same columns (800 million rows) [Time Stamp, User ID, Col1, Col2, Col3]
Memory available : 60GB of RAM and 24 core CPU
Input Output example
Problem : I want to group by User ID, sort by TimeStamp and take a unique of Col1 but dropping duplicates while retaining the order based on the TimeStamp.
Solutions Tried :
Tried using joblib to load csv in parallel and use pandas to sort and write to csv (Get an error at the sorting step)
Used dask (New to Dask); \
LocalCluster(dashboard_address=f':{port}', n_workers=4, threads_per_worker=4, memory_limit='7GB') ## Cannot use the full 60 gigs as there are others on the server
ddf = read_csv("/path/*.csv")
ddf = ddf.set_index("Time Stamp")
ddf.to_csv("/outdir/")
Questions :
Assuming dask will use disk to sort and write the multipart output, will it preserve the order after I read the output using read_csv?
How do I achieve the 2 part of the problem in dask. In pandas, I'd just apply and gather results in a new dataframe?
def getUnique(user_group): ## assuming the rows for each user are sorted by timestamp
res = list()
for val in user_group["Col1"]:
if val not in res:
res.append(val)
return res
Please direct me if there is a better alternative to dask.
So, I think I would approach this with two passes. In the first pass, I would look to run though all the csv files and build a data structure to hold the keys of user_id and col1 and the "best" timestamp. In this case, "best" will be the lowest.
Note: the use of dictionaries here only serves to clarify what we are attempting to do and if performance or memory was an issue, I would first look to reimplement without them where possible.
so, starting with CSV data like:
[
{"user_id": 1, "col1": "a", "timestamp": 1},
{"user_id": 1, "col1": "a", "timestamp": 2},
{"user_id": 1, "col1": "b", "timestamp": 4},
{"user_id": 1, "col1": "c", "timestamp": 3},
]
After processing all the csv files I hope to have an interim representation of:
{
1: {'a': 1, 'b': 4, 'c': 3}
}
Note that a representation like this could be created in parallel for each CSV and then re-distilled into a final interim representation via a pass 1.5 if you wanted to do that.
Now we can create a final representation based on the keys of this nested structure sorted by the inner most value. Giving us:
[
{'user_id': 1, 'col1': ['a', 'c', 'b']}
]
Here is how I might first approach this task before tweaking things for performance.
import csv
all_csv_files = [
"some.csv",
"bunch.csv",
"of.csv",
"files.csv",
]
data = {}
for csv_file in all_csv_files:
#with open(csv_file, "r") as file_in:
# rows = csv.DictReader(file_in)
## ----------------------------
## demo data
## ----------------------------
rows = [
{"user_id": 1, "col1": "a", "timestamp": 1},
{"user_id": 1, "col1": "a", "timestamp": 2},
{"user_id": 1, "col1": "b", "timestamp": 4},
{"user_id": 1, "col1": "c", "timestamp": 3},
]
## ----------------------------
## ----------------------------
## First pass to determine the "best" timestamp
## for a user_id/col1
## ----------------------------
for row in rows:
user_id = row['user_id']
col1 = row['col1']
ts_new = row['timestamp']
ts_old = (
data
.setdefault(user_id, {})
.setdefault(col1, ts_new)
)
if ts_new < ts_old:
data[user_id][col1] = ts_new
## ----------------------------
print(data)
## ----------------------------
## second pass to set order of col1 for a given user_id
## ----------------------------
data_out = [
{
"user_id": outer_key,
"col1": [
inner_kvp[0]
for inner_kvp
in sorted(outer_value.items(), key=lambda v: v[1])
]
}
for outer_key, outer_value
in data.items()
]
## ----------------------------
print(data_out)
I have a json like the one below
{
"headers": [
"product_id",
"price",
"is_discontinued"
],
"product_stores": [
[
2093085822,
23.58,
false
],
[
2093085837,
16.1,
true
],
...
This is a column in a dataframe that contais another columns such as ID, created_at, etc. I would like do have all the product_ids from the column, it could be a string like "2093085822,2093085837" but only if "is_discontinued" is true
Similair to Rafael's answer although you can also use named variables in the comprehension so it remains readable:
output = [pid for pid, price, discontinued in json_dict["product_stores"] if discontinued]
keep in mind all sub-lists need to be of length 3 for this to work
The solution is pretty simple. Just use some list comprehension:
json_dict = # Load the json here
what_you_want = [item[0] for item in json_dict["product_stores"] if item[2]]
totalHotelsInTown=hotels.aggregate([ {"$group": {"_id": "$Town", "TotalRestaurantInTown": {"$sum":1}} } ])
NumOfHotelsInTown={}
for item in totalHotelsInTown:
NumOfHotelsInTown[item['_id']]=item['TotalRestaurantInTown']
results = hotels.aggregate(
[{"$match": {"cuisine": cuisine}},
{"$group": {"_id": "$town", "HotelsCount": {"$sum": 1} }}, {"$project": {"HotelsCount":1,"Percent": {"$multiply": [{"$divide": ["$HotelsCount", NumOfHotelsInTown["$_id"]]}, 100]}}}, {"$sort": {"Percent": 1}},
{"$limit": 1}])
I want to pass the value of "_id" field as a key to python dictionary, but the interpreter is taking "$_id" itself as a key instead of its value and giving a KeyError because of that. Any help would be much appreciated. Thanks!
'NumOfHotelsInTown' dictionary has key value pairs of place and number of hotels
When I am trying to retrieve the value from NumOfHotelsInTown dictionary,
I am giving the key dynamically with "$_id".
The exact error I am getting is:
{"$group": {"_id": "$borough", "HotelsCount": {"$sum": 1} }}, {"$project": {"HotelsCount":1,"Percent": {"$multiply": [{"$divide": ["$HotelsCount", NumOfHotlesInTown["$_id"]]}, 100]}}}, {"$sort": {"Percent": 1}},
KeyError: '$_id'
I see what you're trying to do, but you can't dynamically run python code during a MongbDB aggregate.
What you should do instead:
Get the total counts for every borough (which you have already done)
Get the total counts for every borough for a given cuisine (which you have part of)
Use python to compare the 2 totals to produce a list of percentages and not MongoDB
For example:
group_by_borough = {"$group": {"_id": "$borough", "TotalRestaurantInBorough": {"$sum":1}} }
count_of_restaurants_by_borough = my_collection.aggregate([group_by_borough])
restaurant_count_by_borough = {doc["_id"]: doc["TotalRestaurantInBorough"] for doc in count_of_restaurants_by_borough}
count_of_cuisines_by_borough = my_collection.aggregate([{"$match": {"cuisine": cuisine}}, group_by_borough])
cuisine_count_by_borough = {doc["_id"]: doc["TotalRestaurantInBorough"] for doc in count_of_cuisines_by_borough}
percentages = {}
for borough, count in restaurant_count_by_borough.items():
percentages[borough] = cuisine_count_by_borough.get(borough, 0) / float(count) * 100
# And if you wanted it sorted you can use an OrderedDict
from collections import OrderedDict
percentages = OrderedDict(sorted(percentages.items(), key=lambda x: x[1]))
With RethinkDB, how do I update arrays in nested objects so that certain values are filtered out?
Consider the following program, I would like to know how to write an update query that filters out the value 2 from arrays contained in votes sub objects in documents from the 'dinners' table:
import rethinkdb as r
from pprint import pprint
with r.connect(db='mydb') as conn:
pprint(r.table('dinners').get('xxx').run(conn))
r.table('dinners').insert({
'id': 'xxx',
'votes': {
'1': [1, 2, ],
},
}, conflict='replace').run(conn)
# How can I update the 'xxx' document so that the value 2 is
# filtered out from all arrays contained in the 'votes' sub object?
You can use the usual filter method together with object coersion:
def update_dinner(dinner):
return {
'votes': dinner['votes']
.keys()
.map(lambda key: [
key,
dinner['votes'][key].filter(lambda vote_val: vote_val.ne(2)),
])
.coerce_to('object')
}
r.table('dinners').update(update_dinner).run(conn)
I have a MongoDB query as follows :
data = db.collection.aggregate([{"$match":{"created_at":{"$gte":start,"$lt":end}}},{"$group":{"_id":"$stage","count":{"$sum":1}}},{"$match":{"count":{"$gt":m{u'count': 296, u'_id': u'10.57.72.93'}
Which results in the following output:
{u'count': 230, u'_id': u'111.11.111.111'}
{u'count': 2240, u'_id': u'111.11.11.11'}
I am trying to sort the output by the 'count' column:
data.sort('count', pymongo.DESCENDING)
...but I am getting the following error:
'CommandCursor' object has no attribute 'sort'
Can anyone explain the reason for this error?
Using $sort as shown in Aggregation example:
from bson.son import SON
data = db.collection.aggregate([
{"$match":{"created_at":{"$gte":start,"$lt":end}}},
{"$group":{"_id":"$stage","count":{"$sum":1}}},
{"$match":{"count": ... }},
{"$sort": SON([("count", -1)])} # <---
])
alternative general solution: use sorted with custom key function:
data = db.collection.aggregate(...)
data = sorted(data, key=lambda x: x['count'], reverse=True)
You may want to use $sort. Check Mongo docs also.
Aggregation pipelines have a $sort pipeline stage:
data = db.collection.aggregate([
{ "$match":{"created_at":{"$gte":start,"$lt":end} }},
{ "$group":{ "_id":"$stage","count":{"$sum":1} }},
{ "$match":{
"count":{ "$gt": 296 } # query trimmed because your question listing is incomplete
}},
{ "$sort": { "count": -1 } } # Actual sort stage
])
The other .sort() method is for a "cursor" which is different from what the aggregation pipeline does.