With a schema like this
{
"doc1": {
"items": [
{
"item_id": 1
},
{
"item_id": 2
},
{
"item_id": 3
},
]
},
"doc2": {
"items": [
{
"item_id": 1
},
{
"item_id": 2
},
{
"item_id": 1
},
]
}
}
I want to query for documents that contain a duplicate item in their items array field. A duplicate means items with the same item_id field.
So the result for the example above should return doc2 only, because it has two items with the same item_id
Something like this?
qry = {
"items": {
"$size": {
"$ne": {
"items.unique_count" # obviously this doesn't exist, not sure how to do it
}
}
}
}
result = MyDocument.find(qry)
One option similar to #rickhg12hs and your suggestions is:
db.collection.aggregate([
{$match: {
$expr: {
$ne: [
{$size: "$items"},
{$size: {
$reduce: {
input: "$items",
initialValue: [],
in: {$setUnion: ["$$value", ["$$this.item_id"]]}
}
}
}
]
}
}
}
])
See how it works on the playground example
Pandas Nested Array Columns: Details of my question listed below:
I have a column that is a nested array in pandas, and is as below when you print
How can I get the length of the nested parameters array to be a new column value ?
print(df["arraycolumn"])
Prints:
[
{
"type":"auth",
"name":"activity",
"parameters":[
{
"name":"api_name",
"value":"admin"
},
{
"name":"method_name",
"value":"directory.users.list"
},
{
"name":"client_id",
"value":"722230783769-dsta4bi9fkom72qcu0t34aj3qpcoqloq.apps.googleusercontent.com"
},
{
"name":"num_response_bytes",
"intValue":"7158"
},
{
"name":"product_bucket",
"value":"GSUITE_ADMIN"
},
{
"name":"app_name",
"value":"Untitled project"
},
{
"name":"client_type",
"value":"WEB"
}
]
}
] }, {
"kind":"admin#reports#activity",
"id":{
"time":"2022-05-05T23:58:48.914Z",
"uniqueQualifier":"-4002873813067783265",
"applicationName":"token",
"customerId":"C02f6wppb"
},
"etag":"\"5T53xK7dpLei95RNoKZd9uz5Xb8LJpBJb72fi2HaNYM/9DTdB8t7uixvUbjo4LUEg53_gf0\"",
"actor":{
"email":"nancy.admin#hyenacapital.net",
"profileId":"100230688039070881323"
},
"ipAddress":"54.80.168.30",
"events":[
{
"type":"auth",
"name":"activity",
"parameters":[
{
"name":"api_name",
"value":"gmail"
},
{
"name":"method_name",
"value":"gmail.users.messages.list"
},
{
"name":"client_id",
"value":"927538837578.apps.googleusercontent.com"
},
{
"name":"num_response_bytes",
"intValue":"2"
},
{
"name":"product_bucket",
"value":"GMAIL"
},
{
"name":"app_name",
"value":"Zapier"
},
{
"name":"client_type",
"value":"WEB"
}
]
You can use .apply to apply an arbitrary Python function to each element of a Series:
def get_parameters(obj):
return obj[0]["parameters"]
df["arraycolumn_parameters_length"] = (
df["arraycolumn"]
.apply(lambda y: len(get_parameters(y)))
)
Of course, you might have to add error checking or other additional logic to the get_parameters function above, as needed.
How about using .str?
df['arraycolumn'].str['parameters'].str.len()
I am converting dictionary to dataframe in python and output has changed. Did i write anything wrong here.
In dictionary output we have source id for 1 time but after converted to dataframe source id coming 4 times but i want parsed_address should be multivalue. I want dictionary output same in dataframe only.
Dictionary Output:
dic--->
{
"source_id":123,
"parsed_address":[
{
"address_type":"Primary",
"address":[
{
"address_line_1":"18 Atherton",
"address_line_2":"nan"
}
]
},
{
"address_type":"directory",
"address":[
{
"address_line_1":"130 E Chapman Ave Apt 312",
"address_line_2":"nan"
}
]
},
{
"address_type":"home",
"address":[
{
"address_line_1":"9722 Walnut St",
"address_line_2":"nan"
}
]
},
{
"address_type":"work",
"address":[
{
"address_line_1":"1325 S Grand Ave",
"address_line_2":"nan"
}
]
}
]
}
DataFrame Output:
[
{
"parsed_address":{
"address_type":"Primary",
"address":[
{
"address_line_1":"18 Atherton",
"address_line_2":null
}
]
},
"source_id":"90353002"
},
{
"parsed_address":{
"address_type":"directory",
"address":[
{
"address_line_1":"130 E Chapman Ave Apt 312",
"address_line_2":null
}
]
},
"source_id":"90353002"
},
{
"parsed_address":{
"address_type":"home",
"address":[
{
"address_line_1":"9722 Walnut St",
"address_line_2":null
}
]
},
"source_id":"90353002"
},
{
"parsed_address":{
"address_type":"work",
"address":[
{
"address_line_1":"1325 S Grand Ave",
"address_line_2":null
}
]
},
"source_id":"90353002"
}
]
code
print("dic--->",dic)
df4 = pd.DataFrame(dic)
print("df4--->",df4)
Converted dictionary to dataframe and output has changed.
This is recycling of scalar values.
This behavior is also why you can assign a scalar to a new column and the number of rows stays the same.
You can wrap the parsed address into another list to get a single row.
I have JSON document recorded to MongoDB with structure like so:
[{ "SessionKey": "172e3b6b-509e-4ef3-950c-0c1dc5c83bab",
"Query": {"Date": "2020-03-04"},
"Flights": [
{"LegId":"13235",
"PricingOptions": [
{"Agents": [1963108],
"Price": 61763.64 },
{"Agents": [4035868],
"Price": 62395.83 }]},
{"LegId": "13236",
"PricingOptions": [{
"Agents": [2915951],
"Price": 37188.0}]}
...
The result I'm trying to get is "LegId":"sum_per_flight", in this case -> {'13235': (61763.64+62395.83), '13236': 37188.0} and then get flights with price < N
I've tried to run this pipeline for aggregation step (but it returns list of ALL prices - I don't know how to sum them up properly):
result = collection.aggregate([
{'$match': {'Query.Date': '2020-03-01'}},
{'$group': {'_id': {'Flight':'$Flights.LegId', 'Price':'$Flights.PricingOptions.Price'}}} ])
Also I've tried this pipeline, but it returns 0 for 'total_price_per_flight':
result = collection.aggregate({'$project': {
'Flights.LegId':1,
'total_price_per_flight': {'$sum': '$Flights.PricingOptions.Price'}
}})
You need to use $unwind to flatten Flights array to able iterate individually.
With $reduce operator, we iterate PricingOptions array and sum Price fields (accumulate prices).
The last step we return your documents into original structure. Before that, you may apply "get flights with price < N"
db.collection.aggregate([
{
"$match": {
"Query.Date": "2020-03-04"
}
},
{
$unwind: "$Flights"
},
{
$addFields: {
"Flights.LegId": {
$arrayToObject: [
[
{
k: "$Flights.LegId",
v: {
$reduce: {
input: "$Flights.PricingOptions",
initialValue: 0,
in: {
$add: [
"$$value",
"$$this.Price"
]
}
}
}
}
]
]
}
}
},
{
$group: {
_id: "$_id",
SessionKey: {
$first: "$SessionKey"
},
Query: {
$first: "$Query"
},
Flights: {
$push: "$Flights"
}
}
}
])
MongoPlayground
I'm trying to do PyMongo aggregate - $group averages of arrays, and I cannot find any examples that matches my problem.
Data example
{
Subject: "Dave",
Strength: [1,2,3,4]
},
{
Subject: "Dave",
Strength: [1,2,3,5]
},
{
Subject: "Dave",
Strength: [1,2,3,6]
},
{
Subject: "Stuart",
Strength: [4,5,6,7]
},
{
Subject: "Stuart",
Strength: [6,5,6,7]
},
{
Subject: "Kevin",
Strength: [1,2,3,4]
},
{
Subject: "Kevin",
Strength: [9,4,3,4]
}
Wanted results
{
Subject: "Dave",
mean_strength = [1,2,3,5]
},
{
Subject: "Stuart",
mean_strength = [5,5,6,7]
},
{
Subject: "Kevin",
mean_strength = [5,3,3,4]
}
I have tried this approach but MongoDB is interpreting the arrays as Null?
pipe = [{'$group': {'_id': 'Subject', 'mean_strength': {'$avg': '$Strength'}}}]
results = db.Walk.aggregate(pipeline=pipe)
Out: [{'_id': 'SubjectID', 'total': None}]
I've looked through the MongoDB documentation and I cannot find or understand if there is any way to do this?
You could use $unwind with includeArrayIndex. As the name suggests, includeArrayIndex adds the array index to the output. This allows for grouping by Subject and array position in Strength. After calculating the average, the results need to be sorted to ensure the second $group and $push add the results back into the right order. Finally there is a $project to include and rename the relevant columns.
db.test.aggregate([{
"$unwind": {
"path": "$Strength",
"includeArrayIndex": "rownum"
}
},
{
"$group": {
"_id": {
"Subject": "$Subject",
"rownum": "$rownum"
},
"mean_strength": {
"$avg": "$Strength"
}
}
},
{
"$sort": {
"_id.Subject": 1,
"_id.rownum": 1
}
},
{
"$group": {
"_id": "$_id.Subject",
"mean_strength": {
"$push": "$mean_strength"
}
}
},
{
"$project": {
"_id": 0,
"Subject": "$_id",
"mean_strength": 1
}
}
])
For your test input, this returns:
{ "mean_strength" : [ 5, 5, 6, 7 ], "Subject" : "Stuart" }
{ "mean_strength" : [ 5, 3, 3, 4 ], "Subject" : "Kevin" }
{ "mean_strength" : [ 1, 2, 3, 5 ], "Subject" : "Dave" }
You can try below aggregation.
For example, Dave has [[1,2,3,4], [1,2,3,5], [1,2,3,6]] after group stage.
Here is the matrix
Reduce function
Pass Current Value (c) Accumulated Value (b) Next Value
First: [1,2,3,5] [[1],[2],[3],[4]] [[1,1],[2,2],[3,3],[5, 4]]
Second: [1,2,3,6] [[1,1],[2,2],[3,3],[5, 4]] [[1,1,1],[2,2,2],[3,3,3],[5, 4, 6]]
Map function - Calculates avg for each array value from reduce stage to output [1,2,3,5]
[{"$group":{"_id":"$Subject","Strength":{"$push":"$Strength"}}}, //Push all arrays
{"$project":{"mean_strength":{
"$map":{//Calculate avg for each reduced indexed pairs.
"input":{
"$reduce":{
"input":{"$slice":["$Strength",1,{"$subtract":[{"$size":"$Strength"},1]}]}, //Start from second array.
"initialValue":{ //Initialize to the first array with all elements transformed to array of single values.
"$map":{
"input":{"$range":[0,{"$size":{"$arrayElemAt":["$Strength",0]}}]},
"as":"a",
"in":[{"$arrayElemAt":[{"$arrayElemAt":["$Strength",0]},"$$a"]}]
}
},
"in":{
"$let":{"vars":{"c":"$$this","b":"$$value"}, //Create variables for current and accumulated values
"in":{"$map":{ //Creates map of same indexed values from each iteration
"input":{"$range":[0,{"$size":"$$b"}]},
"as":"d",
"in":{
"$concatArrays":[ //Concat values at same index
{"$arrayElemAt":["$$c","$$d"]}, //current
[{"$arrayElemAt":["$$b","$$d"]}] //accumulated
]
}
}
}
}
}
}
},
"as":"e",
"in":{"$avg":"$$e"}
}
}}}
]
According to description as mentioned into above question, as a solution to it please try executing following aggregate query
db.collection.aggregate(
// Pipeline
[
// Stage 1
{
$unwind: { path: "$Strength", includeArrayIndex: "arrayIndex" }
},
// Stage 2
{
$group: {
_id:{Subject:'$Subject',arrayIndex:'$arrayIndex'},
mean_strength:{$avg:'$Strength'}
}
},
// Stage 3
{
$group: {
_id:{'Subject':'$_id.Subject'},
mean_strength:{$push:'$mean_strength'}
}
},
// Stage 4
{
$project: {
Subject:'$_id.Subject',
mean_strength:'$mean_strength',
_id:0
}
}
]
);