pandas groupby one column and then groupby another column

pandas groupby one column and then groupby another column - python

I have a df,
code id amount
BB10 531 20
BB10 531 30
BB10 532 50
BR11 631 10
BR11 632 5
IN20 781 10
IN20 781 20
IN20 781 30
I want to first groupby df using code and get the total amount within each group,
df.groupby('code')['amount'].agg('sum')
then I like to know the percentage of amount for a specific id within a specific code group, e.g. for 531 its amount is 50 within BB10, with a amount percentage of 50%; the result df should look like,
code id amount pct
BB10 531 50 50%
BB10 532 50 50%
BR11 631 10 66.7%
BR11 632 5 33.3%
IN20 781 60 100%

First aggregate by both columns sum, then get total per code divide amount, multiple by 100 and round:
df1 = df.groupby(['code','id'], as_index=False)['amount'].sum()
df1['pct']=df1['amount'].div(df1.groupby('code')['amount'].transform('sum')).mul(100).round(1)
print (df1)
code id amount pct
0 BB10 531 50 50.0
1 BB10 532 50 50.0
2 BR11 631 10 66.7
3 BR11 632 5 33.3
4 IN20 781 60 100.0
Last if need percentages convert values to strings and add %:
df1['pct'] = df1['pct'].astype(str) + '%'
print (df1)
code id amount pct
0 BB10 531 50 50.0%
1 BB10 532 50 50.0%
2 BR11 631 10 66.7%
3 BR11 632 5 33.3%
4 IN20 781 60 100.0%

Related

After pandas .agg the histogram produced is taking a count of each record instead of taking the value within the column

Based on the data below I want to create a histogram. How can I do this? The code below is reading the data as a count of 1 for each record instead of taking the value from patient_unique_count
df5=df.groupby(["PANDEMIC","PATIENT_AGE_GRP"]).agg(
{"PATIENT_ID": pd.Series.nunique}).rename(columns={'PATIENT_ID':'PATIENT_UNIQUE_COUNT'})
sns.histplot(x='PATIENT_AGE_GRP', data=df5, kde=True, hue='PANDEMIC')
plt.show()
PATIENT_UNIQUE_COUNT
PANDEMIC PATIENT_AGE_GRP
AFTER 15-19 14
20-24 21
25-29 58
30-34 90
35-39 156
40-44 194
45-49 266
50-54 369
55-59 535
60-64 660
65-69 829
70-74 823
75-79 713
80-84 657
85-89 576
90+ 595
<1 1
NA 5
BEFORE 15-19 13
20-24 14
25-29 41
30-34 56
35-39 144
40-44 179
45-49 279
50-54 466
55-59 758
60-64 873
65-69 929
70-74 890
75-79 860
80-84 789
85-89 726
90+ 757
NA 11
The code is reading the data as a count of 1 for each record instead of taking the value from patient_unique_count. I would also like the histograms to be side by side for before and after pandemic flag.

Aggregate functions on a 3-level pandas grupby object

I want to make a new df with simple metrics like mean, sum, min, max calculated on the Value column in the df visible below, grouped by ID, Date and Key.
index
ID
Key
Date
Value
x
y
z
0
655
321
2021-01-01
50
546
235
252345
1
675
321
2021-01-01
50
345
345
34545
2
654
356
2021-02-02
70
345
346
543
I am doing it like this:
final = df.groupby(['ID','Date','Key'])['Value'].first().mean(level=[0,1]).reset_index().rename(columns={'Value':'Value_Mean'})
I use .first() because one Key can occur multiple times in the df but they all have the same Value. I want to aggregate on ID and Date so I am using level=[0,1].
and then I am adding next metrics with pandas merge as:
final = final.merge(df.groupby(['ID','Date','Key'])['Value'].first().max(level=[0,1]).reset_index().rename(columns={'Value':'Value_Max'}), on=['ID','Date'])
And I go like that with other metrics. I wonder if there is a more sophisticated way to do it than repeat it in multiple lines. I know that you can use .agg() and pass a dict with functions but it seems like in that way it isn't possible to specify the level which is important here.

Use DataFrame.drop_duplicates with named aggregation:
df = pd.DataFrame({'ID':[655,655,655,675,654], 'Key':[321,321,333,321,356],
'Date':['2021-01-01','2021-01-01','2021-01-01','2021-01-01','2021-02-02'],
'Value':[50,30,10,50,70]})
print (df)
ID Key Date Value
0 655 321 2021-01-01 50
1 655 321 2021-01-01 30
2 655 333 2021-01-01 10
3 675 321 2021-01-01 50
4 654 356 2021-02-02 70
final = (df.drop_duplicates(['ID','Date','Key'])
.groupby(['ID','Date'], as_index=False).agg(Value_Mean=('Value','mean'),
Value_Max=('Value','max')))
print (final)
ID Date Value_Mean Value_Max
0 654 2021-02-02 70 70
1 655 2021-01-01 30 50
2 675 2021-01-01 50 50
final = (df.groupby(['ID','Date','Key'], as_index=False)
.first()
.groupby(['ID','Date'], as_index=False).agg(Value_Mean=('Value','mean'),
Value_Max=('Value','max')))
print (final)
ID Date Value_Mean Value_Max
0 654 2021-02-02 70 70
1 655 2021-01-01 30 50
2 675 2021-01-01 50 50
df = (df.groupby(['ID','Date','Key'], as_index=False)
.first()
.groupby(['ID','Date'], as_index=False)['Value']
.agg(['mean', 'max'])
.add_prefix('Value_')
.reset_index())
print (df)
ID Date Value_Mean Value_Max
0 654 2021-02-02 70 70
1 655 2021-01-01 30 50
2 675 2021-01-01 50 50

Rounding off to the nearest 50's pandas dataframe

I have a pandas dataframe , screenshot shown below:
ID Price
100 1040.0
101 1025.0
102 750.0
103 891.0
104 924.0
Expected output shown below
ID Price Price_new
100 1040.0 1050
101 1025.0 1050
102 750.0 750
103 891.0 900
104 920.0 900
This is what I have done but it's not what I want. I want to round off to the nearest fifty in such a way that at 1025 it should round to 1050.
df['Price_new'] = (df['Price'] / 50).round().astype(int) * 50

This is due to the issue : round with python 3
s = (df['Price'] % 50)
df['new'] = df['Price'] + np.where(s>=25,50-s,-s)
df
Out[33]:
ID Price new
0 100 1040 1050
1 101 1025 1050
2 102 750 750
3 103 891 900
4 104 924 900

Follow my suggestion:
import pandas as pd
dt = pd.DataFrame({'ID':[100,101,102,103,104], 'Price':
[1040,1025,750,891,924]})
#VERSION1
dt['Price_new'] = round((dt['Price']+1)/50).astype(int)*50
#VERSION2
dt['Price_new_v2'] = dt['Price']-(dt['Price'].map(lambda x: x%50)) +
(dt['Price'].map(lambda x: round((((x%50)+1)/50))))*50
ID Price Price_new Price_new_V2
0 100 1040 1050 1050
1 101 1025 1050 1050
2 102 750 750 750
3 103 891 900 900
4 104 924 900 900
Just plus 1 in your math you will be able to find your correct answer. But there is another way to do it, my opnião is more understandable than the second version even though I used the modulo operator.

How could i count the rating for each item_id?

From the u.item file, which is divided into [100000 rows x 4columns],
I have to find out which are the best movies.
I try, for each unique item_id (which is 1682) to find the overall rating for each one separately
import pandas as pd
import csv
ratings = pd.read_csv("erg3/files/u.data", encoding="utf-8", delim_whitespace=True,
names = ["user_id", "item_id", "rating", "timestamp"]
)
The data has this form:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
....
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
My expected output :
item_id
1 1753
2 420
3 273
4 742
...
1570 1
1486 1
1626 1
1580 1
i used this best_m = ratings.groupby("item_id")["rating"].sum()
followed by best_m = best_m.sort_values(ascending=False)
And the output looks like :
50 2541
100 2111
181 2032
258 1936
174 1786
...
1581 1
1570 1
1486 1
1626 1
1580 1

Understanding groupby in pandas

I'm looking to get the sum of some values in a dataframe after it has been grouped.
some sample data:
Race officeID CandidateId total_votes precinct
Mayor 10 705 20 Bell
Mayor 10 805 30 Bell
Treasurer 12 505 10 Bell
Treasurer 12 506 40 Bell
Treasurer 12 507 30 Bell
Mayor 10 705 50 Park
Mayor 10 805 10 Park
Treasurer 12 505 5 Park
Treasurer 12 506 13 Park
Treasurer 12 507 16 Park
To get the sum of the votes for each candidate, I can do:
cand_votes = df.groupby('CandidateId').sum().total_votes
print cand_votes
CandidateId
505 15
506 53
507 46
705 70
805 40
To get total votes per office:
total_votes = df.groupby('officeID').sum().total_votes
print total_votes
officeID
10 110
12 114
But what if I want to get the percentage of the vote each candidate got? Would I have to apply some sort of function on each data object? Ideally I would like the final data object to look like:
officeID CandidateID total_votes vote_pct
10 705 70 .6363
10 805 40 .37

First, create a frame that that has the votes by candidate and office.
gb = df.groupby(['officeID','CandidateId'], as_index=False)['total_votes'].sum()
Then with that, you can aggregate by office and use a transform (which returns like indexed data) to calculate a percent of office.
gb['vote_pct'] = gb['total_votes'] / gb.groupby('officeID')['total_votes'].transform('sum')
In [146]: gb
Out[146]:
officeID CandidateId total_votes vote_pct
0 10 705 70 0.636364
1 10 805 40 0.363636
2 12 505 15 0.131579
3 12 506 53 0.464912
4 12 507 46 0.403509

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas groupby one column and then groupby another column - python

Related

After pandas .agg the histogram produced is taking a count of each record instead of taking the value within the column

Aggregate functions on a 3-level pandas grupby object

Rounding off to the nearest 50's pandas dataframe

How could i count the rating for each item_id?

Understanding groupby in pandas

Categories

Resources