I want to group my dataframe by two columns and then sort the aggregated results within those groups.
In [167]: df
Out[167]:
count job source
0 2 sales A
1 4 sales B
2 6 sales C
3 3 sales D
4 7 sales E
5 5 market A
6 3 market B
7 2 market C
8 4 market D
9 1 market E
In [168]: df.groupby(['job','source']).agg({'count':sum})
Out[168]:
count
job source
market A 5
B 3
C 2
D 4
E 1
sales A 2
B 4
C 6
D 3
E 7
I would now like to sort the 'count' column in descending order within each of the groups, and then take only the top three rows. To get something like:
count
job source
market A 5
D 4
B 3
sales E 7
C 6
B 4
You could also just do it in one go, by doing the sort first and using head to take the first 3 of each group.
In[34]: df.sort_values(['job','count'],ascending=False).groupby('job').head(3)
Out[35]:
count job source
4 7 sales E
2 6 sales C
1 4 sales B
5 5 market A
8 4 market D
6 3 market B
What you want to do is actually again a groupby (on the result of the first groupby): sort and take the first three elements per group.
Starting from the result of the first groupby:
In [60]: df_agg = df.groupby(['job','source']).agg({'count':sum})
We group by the first level of the index:
In [63]: g = df_agg['count'].groupby('job', group_keys=False)
Then we want to sort ('order') each group and take the first three elements:
In [64]: res = g.apply(lambda x: x.sort_values(ascending=False).head(3))
However, for this, there is a shortcut function to do this, nlargest:
In [65]: g.nlargest(3)
Out[65]:
job source
market A 5
D 4
B 3
sales E 7
C 6
B 4
dtype: int64
So in one go, this looks like:
df_agg['count'].groupby('job', group_keys=False).nlargest(3)
Here's other example of taking top 3 on sorted order, and sorting within the groups:
In [43]: import pandas as pd
In [44]: df = pd.DataFrame({"name":["Foo", "Foo", "Baar", "Foo", "Baar", "Foo", "Baar", "Baar"], "count_1":[5,10,12,15,20,25,30,35], "count_2" :[100,150,100,25,250,300,400,500]})
In [45]: df
Out[45]:
count_1 count_2 name
0 5 100 Foo
1 10 150 Foo
2 12 100 Baar
3 15 25 Foo
4 20 250 Baar
5 25 300 Foo
6 30 400 Baar
7 35 500 Baar
### Top 3 on sorted order:
In [46]: df.groupby(["name"])["count_1"].nlargest(3)
Out[46]:
name
Baar 7 35
6 30
4 20
Foo 5 25
3 15
1 10
dtype: int64
### Sorting within groups based on column "count_1":
In [48]: df.groupby(["name"]).apply(lambda x: x.sort_values(["count_1"], ascending = False)).reset_index(drop=True)
Out[48]:
count_1 count_2 name
0 35 500 Baar
1 30 400 Baar
2 20 250 Baar
3 12 100 Baar
4 25 300 Foo
5 15 25 Foo
6 10 150 Foo
7 5 100 Foo
Try this Instead, which is a simple way to do groupby and sorting in descending order:
df.groupby(['companyName'])['overallRating'].sum().sort_values(ascending=False).head(20)
If you don't need to sum a column, then use #tvashtar's answer. If you do need to sum, then you can use #joris' answer or this one which is very similar to it.
df.groupby(['job']).apply(lambda x: (x.groupby('source')
.sum()
.sort_values('count', ascending=False))
.head(3))
When grouped dataframe contains more than one grouped column ("multi-index"), using other methods erases other columns:
edf = pd.DataFrame({"job":["sales", "sales", "sales", "sales", "sales",
"market", "market", "market", "market", "market"],
"source":["A", "B", "C", "D", "E", "A", "B", "C", "D", "E"],
"count":[2, 4,6,3,7,5,3,2,4,1],
"other_col":[1,2,3,4,56,6,3,4,6,11]})
gdf = edf.groupby(["job", "source"]).agg({"count":sum, "other_col":np.mean})
gdf.groupby(level=0, group_keys=False).apply(lambda g:g.sort_values("count", ascending=False))
This keeps other_col as well as ordering by count column within each group
I was getting this error without using "by":
TypeError: sort_values() missing 1 required positional argument: 'by'
So, I changed it to this and now it's working:
df.groupby(['job','source']).agg({'count':sum}).sort_values(by='count',ascending=False).head(20)
You can do it in one line -
df.groupby(['job']).apply(lambda x: x.sort_values(['count'], ascending=False).head(3)
.drop('job', axis=1))
what apply() does is that it takes each group of groupby and assigns it to the x in lambda function.
#joris answer helped a lot.
This is what worked for me.
df.groupby(['job'])['count'].nlargest(3)
Related
I have a dataset, df, where I would like to:
filter the values in the 'id' column, group these values, take their average, and then sum these values
id use free total G_Used G_Free G_Total
a 5 5 10 4 1 5
b 14 6 20 5 1 6
a 10 5 15 9 1 10
c 6 4 10 10 10 20
b 10 5 15 5 5 10
b 5 5 10 1 4 5
c 4 1 5 3 1 4
Desired Output
use free total
9.5 7.5 20
filter only values that contain 'a' or 'b'
group by each id
take the mean of the 'use', 'free' and 'total' columns
sum these values
Intermediate steps:
filter out only the a and c values
id use free total G_Used G_Free G_Total
a 5 5 10 4 1 5
a 10 5 15 9 1 10
c 6 4 10 10 10 20
c 4 1 5 3 1 4
take mean of a
a
use free total
7.5 5 12.5
take mean of c
c
use free total
2 2.5 7.5
sum both a and c values for final desired output
use free total
9.5 7.5 20
This is what I am doing, however the syntax is not correct for some of the code. I am still researching. Any suggestion is appreciated
df1 = df[df.id = 'a' | 'b']
df2 = df1.groupby(['id'], as_index=False).agg({'use': 'mean', 'free': 'mean', 'total': 'mean'})
df3= df2.sum(['id'], axis = 0)
Use Series.isin for test membership first and then filter columns with mean, ouput is summed and converted Series to one row DataFrame by Series.to_frame and DataFrame.T for transpose:
df1 = df[df.id.isin(['a','c'])]
df2 = df1.groupby('id')[['use','free','total']].mean().sum().to_frame().T
Your solution is similar, only used GroupBy.agg:
df1 = df[df.id.isin(['a','c'])]
df2 = df1.groupby('id').agg({'use': 'mean', 'free': 'mean', 'total': 'mean'}).sum().to_frame().T
print (df2)
use free total
0 12.5 7.5 20.0
I have this data:
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_tuples(list(zip(*[['one', 'one', 'two', 'two'],['foo', 'bar', 'foo', 'bar']])))
df = pd.DataFrame(np.arange(12).reshape((3,4)), columns=index)
one two
foo bar foo bar
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
Is there a way to do simple vectorized calculations (like addition) for each level 0 group columns on each of the level 1 columns without having to reference the specific column level pairs like:
df[('one','add')] = df[('one','foo')]+df[('one','bar')]
I'd like to get
one two
foo bar add foo bar add
0 0 1 1 2 3 5
1 4 5 9 6 7 13
2 8 9 17 10 11 21
I fiddled around with it for a bit and here is a one-liner that solves the problem in my opinion. It's fully vectorized and doesn't address specific column names. It also puts the add column in the right place.
df.stack(0).assign(add=df.stack(0).sum(axis=1)).stack(0).unstack(0).T
Unfortunately, because of the property of stack / unstack to do the stacking / unstacking into the innermost level, it needs the cryptic .stack(0).unstack(0) operation. It seems like those two operations should cancel each other out, but they actually shuffle the index levels while preserving order.
Here is the same thing split into 3 lines without assign statement.
df = df.stack(0)
df['add'] = df.sum(axis=1)
df = df.stack(0).unstack(0).T
Use pandas.DataFrame.sum with axis=1 and level=0:
df2 = df.sum(axis=1, level=0)
print(df2)
Output:
one two
0 1 5
1 9 13
2 17 21
You can then add new column names to pandas.concat:
df2.columns = [(c, "add") for c in df2]
df2 = pd.concat([df, df2], 1).sort_index(1)
print(df2)
Output:
one two
add bar foo add bar foo
0 1 1 0 5 3 2
1 9 5 4 13 7 6
2 17 9 8 21 11 10
An alternative solution, here, using the same sum solution, but without pd.concat :
df[("one", "add")] = None
df[("two", "add")] = None
df.iloc[:, -2:] = df.sum(axis=1, level=0).to_numpy()
df.sort_index(1)
one two
add bar foo add bar foo
0 1.0 1 0 5.0 3 2
1 9.0 5 4 13.0 7 6
2 17.0 9 8 21.0 11 10
Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])
I currently have a dataset with two indexes, year and zip code, but multiple observations (prices) per zip code. How can I get the average price per zip code, so that I only have distinct observations per zip code and year.
Screenshot of current table
Use DataFrame.mean with level parameter:
df = s.mean(level=[0,1])
Sample:
s = pd.DataFrame({
'B':[5,5,4,5,5,4],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
}).set_index(['F','B'])['E']
print (s)
F B
a 5 5
5 3
4 6
b 5 9
5 2
4 4
Name: E, dtype: int64
df = s.mean(level=[0,1]).reset_index()
print (df)
F B E
0 a 5 4.0
1 a 4 6.0
2 b 5 5.5
3 b 4 4.0
New to Pandas so wondering if there is a more Pandithic (coining it!) way to sort some data, group it, and then sum part of it. The problem is to find the 3 largest values in a series of values and then sum only them.
census_cp is a dataframe with information about counties of states. My current solution is:
cen_sort = census_cp.groupby('STNAME').head(3)
cen_sort = cen_sort.groupby('STNAME').sum().sort_values(by='CENSUS2010POP', ascending=False).head(n=3)
cen_sort = cen_sort.reset_index()
print(cen_sort['STNAME'].values.tolist())
Im specifically curious if there is a better way to do this as well as why i cant put the sum at the end of the previous line and chain together what seems to me to be obviously connected items (get the top 3 of each and add them together).
I think you can use head with sum first with groupby and then nlargest:
df = census_cp.groupby('STNAME')
.apply(lambda x: x.head(3).sum(numeric_only=True))
.reset_index()
.nlargest(3, 'CENSUS2010POP')
Sample:
census_cp = pd.DataFrame({'STNAME':list('abscscbcdbcsscae'),
'CENSUS2010POP':[4,5,6,5,6,2,3,4,5,6,4,5,4,3,6,5]})
print (census_cp)
CENSUS2010POP STNAME
0 4 a
1 5 b
2 6 s
3 5 c
4 6 s
5 2 c
6 3 b
7 4 c
8 5 d
9 6 b
10 4 c
11 5 s
12 4 s
13 3 c
14 6 a
15 5 e
df = census_cp.groupby('STNAME') \
.apply(lambda x: x.head(3).sum(numeric_only=True)) \
.reset_index() \
.nlargest(3, 'CENSUS2010POP')
print (df)
STNAME CENSUS2010POP
5 s 17
1 b 14
2 c 11
If need double top 3 nlargest per groups and then nlargest of summed values use:
df1 = census_cp.groupby('STNAME')['CENSUS2010POP']
.apply(lambda x: x.nlargest(3).sum())
.nlargest(3)
.reset_index()
print (df1)
STNAME CENSUS2010POP
0 s 17
1 b 14
2 c 13
Or:
df1 = census_cp.groupby('STNAME')['CENSUS2010POP'].nlargest(3)
.groupby(level=0)
.sum()
.nlargest(3)
.reset_index()
print (df1)
STNAME CENSUS2010POP
0 s 17
1 b 14
2 c 13