Python Pandas group by mean() for a certain count of rows - python

I need to group by mean() for the first 2 values of each category, how I define that.
df like
category value
-> a 2
-> a 5
a 4
a 8
-> b 6
-> b 3
b 1
-> c 2
-> c 2
c 7
by reading only the arrowed data where the output be like
category mean
a 3.5
b 4.5
c 2
how can I do this
I am trying but do not know where to define the to get only 1st 2 observation from each categrory
output = df.groupby(['category'])['value'].mean().reset_index()
your help is appreciated, thanks in advance

You can also do this via groupby() and agg():
out=df.groupby('category',as_index=False)['value'].agg(lambda x:x.head(2).mean())

Try apply on each group of values and use head(2) to just get the first 2 values then mean:
import pandas as pd
df = pd.DataFrame({
'category': {0: 'a', 1: 'a', 2: 'a', 3: 'a', 4: 'b', 5: 'b',
6: 'b', 7: 'c', 8: 'c', 9: 'c'},
'value': {0: 2, 1: 5, 2: 4, 3: 8, 4: 6, 5: 3, 6: 1, 7: 2,
8: 2, 9: 7}
})
output = df.groupby('category', as_index=False)['value'] \
.apply(lambda a: a.head(2).mean())
print(output)
output:
category value
0 a 3.5
1 b 4.5
2 c 2.0
Or create a boolean index to filter df with:
m = df.groupby('category').cumcount().lt(2)
output = df[m].groupby('category')['value'].mean().reset_index()
print(output)
category value
0 a 3.5
1 b 4.5
2 c 2.0

Related

Filter for column value and set its other column to an array

I have a df such as
Letter | Stats
B 0
B 1
C 22
B 0
C 0
B 3
How can I filter for a value in the Letter column and also then convert the stats column for that value into an array?
Basically want to filter for B and convert the Stats column to an array, Thanks!
here is one way to do it
# function received, dataframe and letter as parameter
# return stats values as list for the passed Letter
def grp(df, letter):
return df.loc[df['Letter'].eq(letter)]['Stats'].values.tolist()
# pass the dataframe, and the letter
result=grp(df,'B')
print(result)
[0, 1, 0, 3]
data used
data ={'Letter': {0: 'B', 1: 'B', 2: 'C', 3: 'B', 4: 'C', 5: 'B'},
'Stats': {0: 0, 1: 1, 2: 22, 3: 0, 4: 0, 5: 3}}
df=pd.DataFrame(data)
Although I believe that solution proposed by #Naveed is enough for this problem one little extension could be suggested.
If you would like to get result as an pandas series and obtain some statistic for the series:
data ={'Letter': {0: 'B', 1: 'B', 2: 'C', 3: 'B', 4: 'C', 5: 'B'},
'Stats': {0: 0, 1: 1, 2: 22, 3: 0, 4: 0, 5: 3}}
df = pd.DataFrame(data)
letter = 'B'
ser = pd.Series(name=letter, data=df.loc[df['Letter'].eq(letter)]['Stats'].values)
print(f"Max value: {ser.max()} | Min value: {ser.min()} | Median value: {ser.median()}") etc.
Output:
Max value: 3 | Min value: 0 | Median value: 0.5

Pandas group_by multiple columns with conditional count

I have a dataframe, for instance:
df = pd.DataFrame({'Host': {0: 'N',
1: 'B',
2: 'N',
3: 'N',
4: 'N',
5: 'V',
6: 'B'},
'Registration': {0: 'Registered',
1: 'MR',
2: 'Registered',
3: 'Registered',
4: '',
5: 'Registered',
6: 'Registered'},
'Val': {0: 'N',
1: 'B',
2: 'N',
3: 'N',
4: '',
5: 'V',
6: 'B'},
'Sum': {0: 100.0,
1: 0.0,
2: 300.0,
3: 150.0,
4: 0.0,
5: 0.0,
6: 20.0}})
I want to get the count, for each Host. Something like:
df.groupby("Host").count()
"""
Host Registration Val Sum
B 2 2 2
N 4 4 4
V 1 1 1
"""
But I want it conditional as a function of each column. For example, I want to count in Sum, only those rows that have more than 0.0, and in the others the ones that are not empty. So my expected output would be:
Host Registration Val Sum
B 2 2 1
N 3 3 3
V 1 1 0
"""
Not sure how to do that. My best attempt has been:
df.groupby("Host").agg({'Registration': lambda x: (x != "").count(),
'Val':lambda x: (x != "").count(),
'Sum': lambda x: (x != 0).count()})
But this produces the same output as df.groupby("Host").count()
Any suggestion?
First your solution - for count Trues values use sum:
df = df.groupby("Host").agg({'Registration': lambda x: (x != "").sum(),
'Val':lambda x: (x != "").sum(),
'Sum': lambda x: (x != 0).sum()})
print (df)
Registration Val Sum
Host
B 2 2 1.0
N 3 3 3.0
V 1 1 0.0
Improved solution - create boolean columns before aggregation sum:
df = df.assign(Registration = df['Registration'].ne(""),
Val = df['Val'].ne(""),
Sum = df['Sum'].ne(0)).groupby("Host").sum()
print (df)
Registration Val Sum
Host
B 2 2 1
N 3 3 3
V 1 1 0

Finding unique values of one column for specific column values

given the following df:
data = {'identifier': {0: 'a',
1: 'a',
3: 'b',
4: 'b',
5: 'c'},
'gt_50': {0: 1, 1: 1, 3: 0, 4: 0, 5: 0},
'gt_10': {0: 1, 1: 1, 3: 1, 4: 1, 5: 1}}
df = pd.DataFrame(data)
i want to find the nuniques of the column "identifier" for each column that starts with "gt_" and where the value is one.
Expected output:
- gt_50 1
- gt_10 3
I could make a for loop and filter the frame in each loop on one gt column and then count the uniques but I think it's not very clean.
Is there a way to do this in a clean way?
Use DataFrame.melt with filter _gt columns for unpivot, then get rows with 1 in DataFrame.query and last count unique values by DataFrameGroupBy.nunique:
out = (df.melt('identifier', value_vars=df.filter(regex='^gt_').columns)
.query('value == 1')
.groupby('variable')['identifier']
.nunique())
print (out)
variable
gt_10 3
gt_50 1
Name: identifier, dtype: int64
Or:
s = df.set_index('identifier').filter(regex='^gt_').stack()
out = s[s.eq(1)].reset_index().groupby('level_1')['identifier'].nunique()
print (df)
level_1
gt_10 3
gt_50 1
Name: identifier, dtype: int64

Creating aggregated summed table in Python [duplicate]

This question already has answers here:
How do I Pandas group-by to get sum?
(11 answers)
Closed 3 years ago.
Is there an easy way of creating a summed aggregate table in Python from columns in an existing table? I have only used SQL previously and this would be done with a code such as:
select AreaCode, Measure, sum(Value) as 'VALUE'
from Table
group by AreaCode, Measure
In my current table (sticking with the example above) I have hundreds of rows containing AreaCode, Measure and Value that i want to aggregate in a new table in Python
Given a pandas dataframe named table looking like this:
table
# AreaCode Measure Value
#0 A M1 13
#1 A M1 1
#2 B M1 15
#3 B M1 1
#4 A M2 54
#5 A M2 1
#6 B M2 17
#7 B M2 1
The code to perform the aggregation would be:
table.groupby(['AreaCode', 'Measure'], as_index=False).sum()
# AreaCode Measure Value
#0 A M1 14
#1 A M2 55
#2 B M1 16
#3 B M2 18
Code to generate table and test this solution:
table = pd.DataFrame({'AreaCode': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'A', 5: 'A', 6: 'B', 7: 'B'}, 'Measure': {0: 'M1', 1: 'M1', 2: 'M1', 3: 'M1', 4: 'M2', 5: 'M2', 6: 'M2', 7: 'M2'}, 'Value': {0: 13, 1: 1, 2: 15, 3: 1, 4: 54, 5: 1, 6: 17, 7: 1}})

Performing groupby function using two columns as parameters regardless of the order of the columns

Given the following dataframe:
Node_1 Node_2 Time
A B 6
A B 4
B A 2
B C 5
How can one obtain, using groupby or other methods, the dataframe as follows:
Node_1 Node_2 Mean_Time
A B 4
B C 5
The first row's Mean_Time being obtained by finding the average of all routes A->B and B->A, i.e. (6 + 4 + 2)/3 = 4
You could sort each row of the Node_1 and Node_2 columns using np.sort:
nodes = df.filter(regex='Node')
arr = np.sort(nodes.values, axis=1)
df.loc[:, nodes.columns] = arr
which results in df now looking like:
Node_1 Node_2 Time
0 A B 6
1 A B 4
2 A B 2
3 B C 5
With the Node columns sorted, you can groupby/agg as usual:
result = df.groupby(cols).agg('mean').reset_index()
import numpy as np
import pandas as pd
data = {'Node_1': {0: 'A', 1: 'A', 2: 'B', 3: 'B'},
'Node_2': {0: 'B', 1: 'B', 2: 'A', 3: 'C'},
'Time': {0: 6, 1: 4, 2: 2, 3: 5}}
df = pd.DataFrame(data)
nodes = df.filter(regex='Node')
arr = np.sort(nodes.values, axis=1)
cols = nodes.columns.tolist()
df.loc[:, nodes.columns] = arr
result = df.groupby(cols).agg('mean').reset_index()
print(result)
yields
Node_1 Node_2 Time
0 A B 4
1 B C 5
Something in the lines of should give you the desired result... This got a lot uglier than it was :D
import pandas as pd
data = {'Node_1': {0: 'A', 1: 'A', 2: 'B', 3: 'B'},
'Node_2': {0: 'B', 1: 'B', 2: 'A', 3: 'C'},
'Time': {0: 6, 1: 4, 2: 2, 3: 5}}
df = pd.DataFrame(data)
# Create new column to group by
df["Node"] = df[["Node_1","Node_2"]].apply(lambda x: tuple(sorted(x)),axis=1)
# Create Mean_time column
df["Mean_time"] = df.groupby('Node').transform('mean')
# Drop duplicate rows and drop Node and Time columns
df = df.drop_duplicates("Node").drop(['Node','Time'],axis=1)
print(df)
Returns:
Node_1 Node_2 Mean_time
0 A B 4
3 B C 5
An alternative would be to use:
df = (df.groupby('Node', as_index=False)
.agg({'Node_1':lambda x: list(x)[0],
'Node_2':lambda x: list(x)[0],
'Time': np.mean})
.drop('Node',axis=1))

Categories