Pandas sort by subtotal of each group - python

Still new into pandas but is there a way to sort df by subtotal of each group.
Area Unit Count
A A1 5
A A2 2
B B1 10
B B2 1
B B3 3
C C1 10
So I want to sort them by subtotal of each Area which results to A subtotal = 7, B subtotal=14, C subtotal = 10
The sort should be like
Area Unit Count
B B1 10
B B2 1
B B3 3
C C1 10
A A1 5
A A2 2
*Note that despite the value of B3 > B1 it should not be affected by the sort.

create a helper column 'sorter', which is the sum of the count variable, and sort ur dataframe with it
df['sorter'] = df.groupby("Area").Count.transform('sum')
df.sort_values('sorter',ascending=False).reset_index(drop=True).drop('sorter',axis=1)
Area Unit Count
0 B B1 10
1 B B2 1
2 B B3 3
3 C C1 10
4 A A1 5
5 A A2 2

Related

Pandas: how to unpivot df correctly?

I have the following dataframe df:
A B Var Value
0 A1 B1 T1name T1
1 A2 B2 T1name T1
2 A1 B1 T2name T2
3 A2 B2 T2name T2
4 A1 B1 T1res 1
5 A2 B2 T1res 1
6 A1 B1 T2res 2
7 A2 B2 T2res 2
I now want to 'half' my dataframe because Var contains variables that should not go under the same column. My intended outcome is:
A B Name Value
0 A1 B1 T1 1
1 A2 B2 T1 1
2 A1 B1 T2 2
3 A2 B2 T2 2
What should I use to unpivot this correctly?
Just filter where the string contains res and assign a new column with the first two characters of the var columns
df[df['Var'].str.contains('res')].assign(Name=df['Var'].str[:2]).drop(columns='Var')
A B Value Name
4 A1 B1 1 T1
5 A2 B2 1 T1
6 A1 B1 2 T2
7 A2 B2 2 T2
Note that this creates a slice of the original DataFrame and not a copy
then :
df = df[~df['Var'].isin(['T1name','T2name'])]
output :
A B Var Value
4 A1 B1 T1res 1
5 A2 B2 T1res 1
6 A1 B1 T2res 2
7 A2 B2 T2res 2
There are different options available looking at the df. Regex seems to be on top of the list. If regex doesn't work, maybe think of redefining your problem:
Filter Value by dtype, replace unwanted characters in df and rename columns. Code below
df[df['Value'].str.isnumeric()].replace(regex=r'res$', value='').rename(columns={'Var':'Name'})
A B Name Value
4 A1 B1 T1 1
5 A2 B2 T1 1
6 A1 B1 T2 2
7 A2 B2 T2 2

How to filter values in data frame by grouped values in column

I have a dataframe:
id value
a1 0
a1 1
a1 2
a1 3
a2 0
a2 1
a3 0
a3 1
a3 2
a3 3
I want to filter id's and leave only those which have value higher than 3. So in this example id a2 must be removed since it only has values 0 and 1. So desired result is:
id value
a1 0
a1 1
a1 2
a1 3
a3 0
a3 1
a3 2
a3 3
a3 4
a3 5
How to to that in pandas?
Updated.
Group by IDs and find their max values. Find the IDs whose max value is at or above 3:
keep = df.groupby('id')['value'].max() >= 3
Select the rows with the IDs that match:
df[df['id'].isin(keep[keep].index)]
Use boolean mask to keep rows that match condition then replace bad id (a2) by the next id (a3). Finally, group again by id an apply a cumulative sum.
mask = df.groupby('id')['value'] \
.transform(lambda x: sorted(x.tolist()) == [0, 1, 2, 3])
df1 = df[mask].reindex(df.index).bfill()
df1['value'] = df1.groupby('id').agg('cumcount')
Output:
>>> df1
id value
0 a1 0
1 a1 1
2 a1 2
3 a1 3
4 a3 0
5 a3 1
6 a3 2
7 a3 3
8 a3 4
9 a3 5

Count the value of a column if is greater than 0 in a groupby result

I have a groupby array in which I need to group by A, then show a count of instances of B separated by B1 and B2 and finally the percentage of those instances that are > 0.1 so I did this to get the first 2:
A B C
id
118 a1 B1 0
119 a1 B1 0
120 a1 B1 101.1
121 a1 B1 106.67
122 a1 B2 103.33
237 a1 B2 100
df = pd.DataFrame(df.groupby(
['A', 'B'])['B'].aggregate('count')).unstack(level=1)
to which I get the first part right:
B
B B1 B2
A
a1 4 2
a2 7 9
a3 9 17
a4 8 8
a5 7 8
But then when I need to get the percentage of the count that is > 0
prcnt_complete = df[['A', 'B', 'C']]
prcnt_complete['passed'] = prcnt_complete['C'].apply(lambda x: (float(x) > 1))
prcnt_complete = prcnt_complete.groupby(['A', 'B', 'passed']).count()
I get weird values that make no sense, sometimes the sum between True and False doesn't even add up. I'm trying to understand what in the order of things I'm doing wrong so that I can make sense of it.
The result I'm looking for is something like this:
B passed
B B1 B2 B1 B2
A
a1 4 2 2 2
a2 7 9 7 6
a3 9 17 9 5
Thanks,
You can do:
(df['C'].gt(1).groupby([df['A'],df['B']])
.agg(['size','sum'])
.rename(columns={'size':'B','sum':'passed'})
.unstack('B')
)
Output (from sample data):
B passed
B B1 B2 B1 B2
A
a1 4 2 2 2
While working on your problem, I also wanted to see if I can get the average percentage for B (while ignoring 0s). I was able to accomplish this as well while getting the counts.
DataFrame for this exercise:
A B C
0 a1 B1 0.00
1 a1 B1 0.00
2 a1 B1 98.87
3 a1 B1 101.10
4 a1 B2 106.67
5 a1 B2 103.00
6 a2 B1 0.00
7 a2 B1 0.00
8 a2 B1 33.00
9 a2 B1 100.00
10 a2 B2 80.00
11 a3 B1 90.00
12 a3 B2 99.00
Average while excluding the zeros
for this I had to add .replace(0, np.nan) before the groupby function.
A = ['a1','a1','a1','a1','a1','a1','a2','a2','a2','a2','a2','a3','a3']
B = ['B1','B1','B1','B1','B2','B2','B1','B1','B1','B1','B2','B1','B2']
C = [0,0,98.87,101.1,106.67,103,0,0,33,100,80,90,99]
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':A,'B':B,'C':C})
df = pd.DataFrame(df.replace(0, np.nan)
.groupby(['A', 'B'])
.agg({'B':'size','C':['count','mean']})
.rename(columns={'size':'Count','count':'Passed','mean':'Avg Score'})).unstack(level=1)
df.columns = df.columns.droplevel(0)
Count Passed Avg Score
B B1 B2 B1 B2 B1 B2
A
a1 4 2 2 2 99.985 104.835
a2 4 1 2 1 66.500 80.000
a3 1 1 1 1 90.000 99.000

Replace value with the value of nearest neighbor in Pandas dataframe

I have a problem with getting nearest values for some rows in pandas dataframe and fill another column with values from those rows.
data sample I have:
id su_id r_value match_v
A A1 0 1
A A2 0 1
A A3 70 2
A A4 120 100
A A5 250 3
A A6 250 100
B B1 0 1
B B2 30 2
The thing is, wherever match_v is equal to 100, I need to replace that 100 with a value from the row where r_value is the closest to r_value from origin row(where match_v is equal to 100), but just withing group (grouped by id)
Expected output
id su_id r_value match_v
A A1 0 1
A A2 0 1
A A3 70 2
A A4 120 2
A A5 250 3
A A6 250 3
B B1 0 1
B B2 30 2
I have tried with creating lead and leg with shift and then finding differences. But doesn't work well and it somehow messed up already good values.
I haven't tried anything else cause I really don't have any idea.
Any help or hint is welcomed and I if you need any additional info, I'm here.
Thanks in advance.
More like merge_asof
s=df.loc[df.match_v!=100]
s=pd.merge_asof(df.sort_values('r_value'),s.sort_values('r_value'),on='r_value',by='id',direction='nearest')
df['match_v']=df['su_id'].map(s.set_index('su_id_x')['match_v_y'])
df
Out[231]:
id su_id r_value match_v
0 A A1 0 1
1 A A2 0 1
2 A A3 70 2
3 A A4 120 2
4 A A5 250 3
5 A A6 250 3
6 B B1 0 1
7 B B2 30 2
Here is another way using numpy broadcast , build for speed up calculation
l=[]
for x , y in df.groupby('id'):
s1=y.r_value.values
s=abs((s1-s1[:,None])).astype(float)
s[np.tril_indices(s.shape[0], 0)] = 999999
s=s.argmin(0)
s2=y.match_v.values
l.append(s2[s][s2==100])
df.loc[df.match_v==100,'match_v']=np.concatenate(l)
df
Out[264]:
id su_id r_value match_v
0 A A1 0 1
1 A A2 0 1
2 A A3 70 2
3 A A4 120 2
4 A A5 250 3
5 A A6 250 3
6 B B1 0 1
7 B B2 30 2
You could define a custom function which does the calculation and substitution, and then use it with groupby and apply.
def mysubstitution(x):
for i in x.index[x['match_v'] == 100]:
diff = (x['r_value'] - (x['r_value'].iloc[i])).abs()
exclude = x.index.isin([i])
closer_idx = diff[~exclude].idxmin()
x['match_v'].iloc[i] = x['match_v'].iloc[closer_idx]
return x
ddf = df.groupby('id').apply(mysubstitution)
ddf is:
id su_id r_value match_v
0 A A1 0 1
1 A A2 0 1
2 A A3 70 2
3 A A4 120 2
4 A A5 250 3
5 A A6 250 3
6 B B1 0 1
7 B B2 30 2
Assuming there is always at least one valid value within the group when a 100 is first encountered.
m = dict()
for i in range(len(df)):
if df.loc[i, "match_v"] == 100:
df.loc[i, "match_v"] = m[df.loc[i, "id"]]
else:
m[df.loc[i, "id"]] = df.loc[i, "match_v"]

How to count the following number of rows in pandas (new)

Relate to the question below,I would like to count the number of following rows.
Thanks to the answer,I could handle data.
But I met some trouble and exception.
How to count the number of following rows in pandas
A B
1 a0
2 a1
3 b1
4 a0
5 b2
6 a2
7 a2
First,I would like to cut df.with startswith("a")
df1
A B
1 a0
df2
A B
2 a1
3 b1
df3
A B
4 a0
5 b2
df4
A B
6 a2
df5
A B
7 a2
I would like to count each df's rows
"a" number
a0 1
a1 2
a0 2
a2 1
a2 1
How could be this done?
I am happy someone tell me how to handle this kind of problem.
You can use aggregate by custom Series created with cumsum:
print (df.B.str.startswith("a").cumsum())
0 1
1 2
2 2
3 3
4 3
5 4
6 5
Name: B, dtype: int32
df1 = df.B.groupby(df.B.str.startswith("a").cumsum()).agg(['first', 'size'])
df1.columns =['"A"','number']
df1.index.name = None
print (df1)
"A" number
1 a0 1
2 a1 2
3 a0 2
4 a2 1
5 a2 1

Categories