Statistics for Grouped DataFrames with Pandas - python

I have a DataFrame that can be grouped basically by two columns: Level and Sub_level.
The data looks like this:
Level_1 Sub_level Value
0 Group A A1 100
1 Group A A2 200
2 Group A A1 150
3 Group B B1 100
4 Group B B2 200
5 Group A A1 200
6 Group A A1 300
7 Group A A1 400
8 Group B B2 450
...
I would like to get the frequency/count in each Sub_level compared to each comparable Level_1, i.e
Level_1 Sub_level Pct_of_total
Group A A1 5 / 6 (as there are 6 Group A instances in 'Level_1', and 5 A1:s in 'Sub_level')
A2 1 / 6
Group B B1 1 / 3 (as there are 3 Group B instances in 'Level_1', and 1 B1:s in 'Sub_level')
B2 2 / 3
Of course the fractions in the new column Pct_of_total should be expressed in
percentage.
Any clues?
Thanks,
/N

I think you need groupby + size for first df and then groupby by first level (Level_1) and transform sum. Last divide by div:
df1 = df.groupby(['Level_1','Sub_level'])['Value'].size()
print (df1)
Level_1 Sub_level
Group A A1 5
A2 1
Group B B1 1
B2 2
Name: Value, dtype: int64
df2 = df1.groupby(level=0).transform('sum')
print (df2)
Level_1 Sub_level
Group A A1 6
A2 6
Group B B1 3
B2 3
Name: Value, dtype: int64
df3 = df1.div(df2).reset_index(name='Pct_of_total')
print (df3)
Level_1 Sub_level Pct_of_total
0 Group A A1 0.833333
1 Group A A2 0.166667
2 Group B B1 0.333333
3 Group B B2 0.666667

Related

How to sum values in dataframe until certain values in other column by group?

I have a dataframe:
id life_day value
a1 1 10
a1 2 20
a1 3 10
a1 4 5
a1 5 5
a1 6 1
b2 1 7
b2 3 11
b2 4 10
b2 5 20
I want to sum values for each id till life_day 4. So desired result is:
id life_day value
a1 4 45
b2 4 28
How to do that? I tried df[df["life_day"] == 90].groupby("id).sum() but brings wrong results
Your approach almost works, but I don't know why you wrote == 90 in df["life_day"] == 90, and it looks like you want the max of life_day, not the sum.
df[df['life_day'] <= 4].groupby('id').agg({'life_day': 'max', 'value': 'sum'})
life_day value
id
a1 4 45
b2 4 28
Use the pandas where condition to mask and then groupby agg
df.where(df['life_day'].le(4)).groupby('id').agg({'life_day':'last','value':'sum'}).reset_index()
id life_day value
0 a1 4.0 45.0
1 b2 4.0 28.0

How to filter values in data frame by grouped values in column

I have a dataframe:
id value
a1 0
a1 1
a1 2
a1 3
a2 0
a2 1
a3 0
a3 1
a3 2
a3 3
I want to filter id's and leave only those which have value higher than 3. So in this example id a2 must be removed since it only has values 0 and 1. So desired result is:
id value
a1 0
a1 1
a1 2
a1 3
a3 0
a3 1
a3 2
a3 3
a3 4
a3 5
How to to that in pandas?
Updated.
Group by IDs and find their max values. Find the IDs whose max value is at or above 3:
keep = df.groupby('id')['value'].max() >= 3
Select the rows with the IDs that match:
df[df['id'].isin(keep[keep].index)]
Use boolean mask to keep rows that match condition then replace bad id (a2) by the next id (a3). Finally, group again by id an apply a cumulative sum.
mask = df.groupby('id')['value'] \
.transform(lambda x: sorted(x.tolist()) == [0, 1, 2, 3])
df1 = df[mask].reindex(df.index).bfill()
df1['value'] = df1.groupby('id').agg('cumcount')
Output:
>>> df1
id value
0 a1 0
1 a1 1
2 a1 2
3 a1 3
4 a3 0
5 a3 1
6 a3 2
7 a3 3
8 a3 4
9 a3 5

Pandas sort by subtotal of each group

Still new into pandas but is there a way to sort df by subtotal of each group.
Area Unit Count
A A1 5
A A2 2
B B1 10
B B2 1
B B3 3
C C1 10
So I want to sort them by subtotal of each Area which results to A subtotal = 7, B subtotal=14, C subtotal = 10
The sort should be like
Area Unit Count
B B1 10
B B2 1
B B3 3
C C1 10
A A1 5
A A2 2
*Note that despite the value of B3 > B1 it should not be affected by the sort.
create a helper column 'sorter', which is the sum of the count variable, and sort ur dataframe with it
df['sorter'] = df.groupby("Area").Count.transform('sum')
df.sort_values('sorter',ascending=False).reset_index(drop=True).drop('sorter',axis=1)
Area Unit Count
0 B B1 10
1 B B2 1
2 B B3 3
3 C C1 10
4 A A1 5
5 A A2 2

How to count the following number of rows in pandas (new)

Relate to the question below,I would like to count the number of following rows.
Thanks to the answer,I could handle data.
But I met some trouble and exception.
How to count the number of following rows in pandas
A B
1 a0
2 a1
3 b1
4 a0
5 b2
6 a2
7 a2
First,I would like to cut df.with startswith("a")
df1
A B
1 a0
df2
A B
2 a1
3 b1
df3
A B
4 a0
5 b2
df4
A B
6 a2
df5
A B
7 a2
I would like to count each df's rows
"a" number
a0 1
a1 2
a0 2
a2 1
a2 1
How could be this done?
I am happy someone tell me how to handle this kind of problem.
You can use aggregate by custom Series created with cumsum:
print (df.B.str.startswith("a").cumsum())
0 1
1 2
2 2
3 3
4 3
5 4
6 5
Name: B, dtype: int32
df1 = df.B.groupby(df.B.str.startswith("a").cumsum()).agg(['first', 'size'])
df1.columns =['"A"','number']
df1.index.name = None
print (df1)
"A" number
1 a0 1
2 a1 2
3 a0 2
4 a2 1
5 a2 1

How to select rows which matches certain row

I have a dataframe below
A B
a0 1
b0 1
c0 2
a1 3
b1 4
b2 3
First,If df.A startswith "a",I would like to cut df.
df[df.A.str.startswith("a")]
A B
a0 1
a1 3
Therefore I would like to cut df like below.
sub1
A B
a0 1
b0 1
c0 2
sub2
A B
a1 3
b1 4
b2 3
then I would like to extract rows whose column B number matches the rows whose column A startswith"a"
sub1
A B
a0 1
b0 1
sub2
A B
a1 3
b2 3
then append.
result
A B
a0 1
b0 1
a1 3
b2 3
How can I cut and append df like this.
I tried cut method but didn't work well.
I think you can use where with mask for creating NaN which are forward filled by B values with ffill:
Notice is necessary values starts with a has to be first in each group for using ffill
print (df.B.where(df.A.str.startswith("a")))
0 1.0
1 NaN
2 NaN
3 3.0
4 NaN
5 NaN
Name: B, dtype: float64
print (df.B.where(df.A.str.startswith("a")).ffill())
0 1.0
1 1.0
2 1.0
3 3.0
4 3.0
5 3.0
Name: B, dtype: float64
df = df[df.B == df.B.where(df.A.str.startswith("a")).ffill()]
print (df)
A B
0 a0 1
1 b0 1
3 a1 3
5 b2 3

Categories