Find the longest group after groupby on normalized json in pandas

Find the longest group after groupby on normalized json in pandas - python

My code below groups by values and creates a list of values that were once the length of arrays. But how can I return the id that has the largest sum of each number in the elements:
Original Json read into df (not same data as printed because it was too long)
{
"kind":"admin#reports#activities",
"etag":"\"5g8\"",
"nextPageToken":"A:1651795128914034:-4002873813067783265:151219070090:C02f6wppb",
"items":[
{
"kind":"admin#reports#activity",
"id":{
"time":"2022-05-05T23:59:39.421Z",
"uniqueQualifier":"5526793068617678141",
"applicationName":"token",
"customerId":"cds"
},
"etag":"\"jkYcURYoi8\"",
"actor":{
"email":"blah#blah.net",
"profileId":"1323"
},
"ipAddress":"107.178.193.87",
"events":[
{
"type":"auth",
"name":"activity",
"parameters":[
{
"name":"api_name",
"value":"admin"
},
{
"name":"method_name",
"value":"directory.users.list"
},
{
"name":"client_id",
"value":"722230783769-dsta4bi9fkom72qcu0t34aj3qpcoqloq.apps.googleusercontent.com"
},
{
"name":"num_response_bytes",
"intValue":"7158"
},
{
"name":"product_bucket",
"value":"GSUITE_ADMIN"
},
{
"name":"app_name",
"value":"Untitled project"
},
{
"name":"client_type",
"value":"WEB"
}
]
}
]
},
{
"kind":"admin#reports#activity",
"id":{
"time":"2022-05-05T23:58:48.914Z",
"uniqueQualifier":"-4002873813067783265",
"applicationName":"token",
"customerId":"df"
},
"etag":"\"5T53xK7dpLei95RNoKZd9uz5Xb8LJpBJb72fi2HaNYM/9DTdB8t7uixvUbjo4LUEg53_gf0\"",
"actor":{
"email":"blah.blah#bebe.net",
"profileId":"1324"
},
"ipAddress":"54.80.168.30",
"events":[
{
"type":"auth",
"name":"activity",
"parameters":[
{
"name":"api_name",
"value":"gmail"
},
{
"name":"method_name",
"value":"gmail.users.messages.list"
},
{
"name":"client_id",
"value":"927538837578.apps.googleusercontent.com"
},
{
"name":"num_response_bytes",
"intValue":"2"
},
{
"name":"product_bucket",
"value":"GMAIL"
},
{
"name":"client_type",
"value":"WEB"
}
]
}
]
}
]
}
current code:
df = pd.json_normalize(response['items'])
df['test'] = df.groupby('actor.profileId')['events'].apply(lambda x: [len(x.iloc[i][0]['parameters']) for i in range(len(x))])
output:
ID
1002306 [7, 7, 7, 5]
1234444 [3,5,6]
1222222 [1,3,4,5]
desired output
id total
1002306 26
Sorry had to fill up more space, as there was so much code

There’s no need to construct the intermediate df and do groupby on it. You can use pass the record and meta paths to json_normalize to directly flatten the json data. Then your job seems to be about counting the number of rows per actor.profileId and finding the maximum.
df = pd.json_normalize(response['items'], ['events','parameters'], ['actor'])
df['actor.profileId'] = df['actor'].str['profileId']
out = df.value_counts('actor.profileId').pipe(lambda x: x.iloc[[0]])
Output:
actor.profileId
1323 7
dtype: int64

Related

Select mongo documents whose subdoc array has duplicate field?

With a schema like this
{
"doc1": {
"items": [
{
"item_id": 1
},
{
"item_id": 2
},
{
"item_id": 3
},
]
},
"doc2": {
"items": [
{
"item_id": 1
},
{
"item_id": 2
},
{
"item_id": 1
},
]
}
}
I want to query for documents that contain a duplicate item in their items array field. A duplicate means items with the same item_id field.
So the result for the example above should return doc2 only, because it has two items with the same item_id
Something like this?
qry = {
"items": {
"$size": {
"$ne": {
"items.unique_count" # obviously this doesn't exist, not sure how to do it
}
}
}
}
result = MyDocument.find(qry)

One option similar to #rickhg12hs and your suggestions is:
db.collection.aggregate([
{$match: {
$expr: {
$ne: [
{$size: "$items"},
{$size: {
$reduce: {
input: "$items",
initialValue: [],
in: {$setUnion: ["$$value", ["$$this.item_id"]]}
}
}
}
]
}
}
}
])
See how it works on the playground example

Pandas Nested Array Columns

Pandas Nested Array Columns: Details of my question listed below:
I have a column that is a nested array in pandas, and is as below when you print
How can I get the length of the nested parameters array to be a new column value ?
print(df["arraycolumn"])
Prints:
[
{
"type":"auth",
"name":"activity",
"parameters":[
{
"name":"api_name",
"value":"admin"
},
{
"name":"method_name",
"value":"directory.users.list"
},
{
"name":"client_id",
"value":"722230783769-dsta4bi9fkom72qcu0t34aj3qpcoqloq.apps.googleusercontent.com"
},
{
"name":"num_response_bytes",
"intValue":"7158"
},
{
"name":"product_bucket",
"value":"GSUITE_ADMIN"
},
{
"name":"app_name",
"value":"Untitled project"
},
{
"name":"client_type",
"value":"WEB"
}
]
}
] }, {
"kind":"admin#reports#activity",
"id":{
"time":"2022-05-05T23:58:48.914Z",
"uniqueQualifier":"-4002873813067783265",
"applicationName":"token",
"customerId":"C02f6wppb"
},
"etag":"\"5T53xK7dpLei95RNoKZd9uz5Xb8LJpBJb72fi2HaNYM/9DTdB8t7uixvUbjo4LUEg53_gf0\"",
"actor":{
"email":"nancy.admin#hyenacapital.net",
"profileId":"100230688039070881323"
},
"ipAddress":"54.80.168.30",
"events":[
{
"type":"auth",
"name":"activity",
"parameters":[
{
"name":"api_name",
"value":"gmail"
},
{
"name":"method_name",
"value":"gmail.users.messages.list"
},
{
"name":"client_id",
"value":"927538837578.apps.googleusercontent.com"
},
{
"name":"num_response_bytes",
"intValue":"2"
},
{
"name":"product_bucket",
"value":"GMAIL"
},
{
"name":"app_name",
"value":"Zapier"
},
{
"name":"client_type",
"value":"WEB"
}
]

You can use .apply to apply an arbitrary Python function to each element of a Series:
def get_parameters(obj):
return obj[0]["parameters"]
df["arraycolumn_parameters_length"] = (
df["arraycolumn"]
.apply(lambda y: len(get_parameters(y)))
)
Of course, you might have to add error checking or other additional logic to the get_parameters function above, as needed.

How about using .str?
df['arraycolumn'].str['parameters'].str.len()

I am converting dictionary to dataframe in python and output has changed

I am converting dictionary to dataframe in python and output has changed. Did i write anything wrong here.
In dictionary output we have source id for 1 time but after converted to dataframe source id coming 4 times but i want parsed_address should be multivalue. I want dictionary output same in dataframe only.
Dictionary Output:
dic--->
{
"source_id":123,
"parsed_address":[
{
"address_type":"Primary",
"address":[
{
"address_line_1":"18 Atherton",
"address_line_2":"nan"
}
]
},
{
"address_type":"directory",
"address":[
{
"address_line_1":"130 E Chapman Ave Apt 312",
"address_line_2":"nan"
}
]
},
{
"address_type":"home",
"address":[
{
"address_line_1":"9722 Walnut St",
"address_line_2":"nan"
}
]
},
{
"address_type":"work",
"address":[
{
"address_line_1":"1325 S Grand Ave",
"address_line_2":"nan"
}
]
}
]
}
DataFrame Output:
[
{
"parsed_address":{
"address_type":"Primary",
"address":[
{
"address_line_1":"18 Atherton",
"address_line_2":null
}
]
},
"source_id":"90353002"
},
{
"parsed_address":{
"address_type":"directory",
"address":[
{
"address_line_1":"130 E Chapman Ave Apt 312",
"address_line_2":null
}
]
},
"source_id":"90353002"
},
{
"parsed_address":{
"address_type":"home",
"address":[
{
"address_line_1":"9722 Walnut St",
"address_line_2":null
}
]
},
"source_id":"90353002"
},
{
"parsed_address":{
"address_type":"work",
"address":[
{
"address_line_1":"1325 S Grand Ave",
"address_line_2":null
}
]
},
"source_id":"90353002"
}
]
code
print("dic--->",dic)
df4 = pd.DataFrame(dic)
print("df4--->",df4)
Converted dictionary to dataframe and output has changed.

This is recycling of scalar values.
This behavior is also why you can assign a scalar to a new column and the number of rows stays the same.
You can wrap the parsed address into another list to get a single row.

MongoDB - Get SUM of values INSIDE of the array

I have JSON document recorded to MongoDB with structure like so:
[{ "SessionKey": "172e3b6b-509e-4ef3-950c-0c1dc5c83bab",
"Query": {"Date": "2020-03-04"},
"Flights": [
{"LegId":"13235",
"PricingOptions": [
{"Agents": [1963108],
"Price": 61763.64 },
{"Agents": [4035868],
"Price": 62395.83 }]},
{"LegId": "13236",
"PricingOptions": [{
"Agents": [2915951],
"Price": 37188.0}]}
...
The result I'm trying to get is "LegId":"sum_per_flight", in this case -> {'13235': (61763.64+62395.83), '13236': 37188.0} and then get flights with price < N
I've tried to run this pipeline for aggregation step (but it returns list of ALL prices - I don't know how to sum them up properly):
result = collection.aggregate([
{'$match': {'Query.Date': '2020-03-01'}},
{'$group': {'_id': {'Flight':'$Flights.LegId', 'Price':'$Flights.PricingOptions.Price'}}} ])
Also I've tried this pipeline, but it returns 0 for 'total_price_per_flight':
result = collection.aggregate({'$project': {
'Flights.LegId':1,
'total_price_per_flight': {'$sum': '$Flights.PricingOptions.Price'}
}})

You need to use $unwind to flatten Flights array to able iterate individually.
With $reduce operator, we iterate PricingOptions array and sum Price fields (accumulate prices).
The last step we return your documents into original structure. Before that, you may apply "get flights with price < N"
db.collection.aggregate([
{
"$match": {
"Query.Date": "2020-03-04"
}
},
{
$unwind: "$Flights"
},
{
$addFields: {
"Flights.LegId": {
$arrayToObject: [
[
{
k: "$Flights.LegId",
v: {
$reduce: {
input: "$Flights.PricingOptions",
initialValue: 0,
in: {
$add: [
"$$value",
"$$this.Price"
]
}
}
}
}
]
]
}
}
},
{
$group: {
_id: "$_id",
SessionKey: {
$first: "$SessionKey"
},
Query: {
$first: "$Query"
},
Flights: {
$push: "$Flights"
}
}
}
])
MongoPlayground

Mongodb group average array

I'm trying to do PyMongo aggregate - $group averages of arrays, and I cannot find any examples that matches my problem.
Data example
{
Subject: "Dave",
Strength: [1,2,3,4]
},
{
Subject: "Dave",
Strength: [1,2,3,5]
},
{
Subject: "Dave",
Strength: [1,2,3,6]
},
{
Subject: "Stuart",
Strength: [4,5,6,7]
},
{
Subject: "Stuart",
Strength: [6,5,6,7]
},
{
Subject: "Kevin",
Strength: [1,2,3,4]
},
{
Subject: "Kevin",
Strength: [9,4,3,4]
}
Wanted results
{
Subject: "Dave",
mean_strength = [1,2,3,5]
},
{
Subject: "Stuart",
mean_strength = [5,5,6,7]
},
{
Subject: "Kevin",
mean_strength = [5,3,3,4]
}
I have tried this approach but MongoDB is interpreting the arrays as Null?
pipe = [{'$group': {'_id': 'Subject', 'mean_strength': {'$avg': '$Strength'}}}]
results = db.Walk.aggregate(pipeline=pipe)
Out: [{'_id': 'SubjectID', 'total': None}]
I've looked through the MongoDB documentation and I cannot find or understand if there is any way to do this?

You could use $unwind with includeArrayIndex. As the name suggests, includeArrayIndex adds the array index to the output. This allows for grouping by Subject and array position in Strength. After calculating the average, the results need to be sorted to ensure the second $group and $push add the results back into the right order. Finally there is a $project to include and rename the relevant columns.
db.test.aggregate([{
"$unwind": {
"path": "$Strength",
"includeArrayIndex": "rownum"
}
},
{
"$group": {
"_id": {
"Subject": "$Subject",
"rownum": "$rownum"
},
"mean_strength": {
"$avg": "$Strength"
}
}
},
{
"$sort": {
"_id.Subject": 1,
"_id.rownum": 1
}
},
{
"$group": {
"_id": "$_id.Subject",
"mean_strength": {
"$push": "$mean_strength"
}
}
},
{
"$project": {
"_id": 0,
"Subject": "$_id",
"mean_strength": 1
}
}
])
For your test input, this returns:
{ "mean_strength" : [ 5, 5, 6, 7 ], "Subject" : "Stuart" }
{ "mean_strength" : [ 5, 3, 3, 4 ], "Subject" : "Kevin" }
{ "mean_strength" : [ 1, 2, 3, 5 ], "Subject" : "Dave" }

You can try below aggregation.
For example, Dave has [[1,2,3,4], [1,2,3,5], [1,2,3,6]] after group stage.
Here is the matrix
Reduce function
Pass Current Value (c) Accumulated Value (b) Next Value
First: [1,2,3,5] [[1],[2],[3],[4]] [[1,1],[2,2],[3,3],[5, 4]]
Second: [1,2,3,6] [[1,1],[2,2],[3,3],[5, 4]] [[1,1,1],[2,2,2],[3,3,3],[5, 4, 6]]
Map function - Calculates avg for each array value from reduce stage to output [1,2,3,5]
[{"$group":{"_id":"$Subject","Strength":{"$push":"$Strength"}}}, //Push all arrays
{"$project":{"mean_strength":{
"$map":{//Calculate avg for each reduced indexed pairs.
"input":{
"$reduce":{
"input":{"$slice":["$Strength",1,{"$subtract":[{"$size":"$Strength"},1]}]}, //Start from second array.
"initialValue":{ //Initialize to the first array with all elements transformed to array of single values.
"$map":{
"input":{"$range":[0,{"$size":{"$arrayElemAt":["$Strength",0]}}]},
"as":"a",
"in":[{"$arrayElemAt":[{"$arrayElemAt":["$Strength",0]},"$$a"]}]
}
},
"in":{
"$let":{"vars":{"c":"$$this","b":"$$value"}, //Create variables for current and accumulated values
"in":{"$map":{ //Creates map of same indexed values from each iteration
"input":{"$range":[0,{"$size":"$$b"}]},
"as":"d",
"in":{
"$concatArrays":[ //Concat values at same index
{"$arrayElemAt":["$$c","$$d"]}, //current
[{"$arrayElemAt":["$$b","$$d"]}] //accumulated
]
}
}
}
}
}
}
},
"as":"e",
"in":{"$avg":"$$e"}
}
}}}
]

According to description as mentioned into above question, as a solution to it please try executing following aggregate query
db.collection.aggregate(
// Pipeline
[
// Stage 1
{
$unwind: { path: "$Strength", includeArrayIndex: "arrayIndex" }
},
// Stage 2
{
$group: {
_id:{Subject:'$Subject',arrayIndex:'$arrayIndex'},
mean_strength:{$avg:'$Strength'}
}
},
// Stage 3
{
$group: {
_id:{'Subject':'$_id.Subject'},
mean_strength:{$push:'$mean_strength'}
}
},
// Stage 4
{
$project: {
Subject:'$_id.Subject',
mean_strength:'$mean_strength',
_id:0
}
}
]
);

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find the longest group after groupby on normalized json in pandas - python

Related

Select mongo documents whose subdoc array has duplicate field?

Pandas Nested Array Columns

I am converting dictionary to dataframe in python and output has changed

MongoDB - Get SUM of values INSIDE of the array

Mongodb group average array

Categories

Resources