I want to group my dataframe by two columns and then sort the aggregated results within the groups.
In [167]:df
count job source
0 2 sales A
1 4 sales B
2 6 sales C
3 3 sales D
4 7 sales E
5 5 market A
6 3 market B
7 2 market C
8 4 market D
9 1 market E
df.groupby(['job','source']).agg({'count':sum})
Out[168]:
job source count
market A 5
B 3
C 2
D 4
E 1
sales A 2
B 4
C 6
D 3
E 7
I would now like to sort the count column in descending order within each of the groups. And then take only the top three rows. To get something like:
job source count
market A 5
D 4
B 3
sales E 7
C 6
B 4
I want to further sort this problem w.r.t job, so if the sum of count for sales is more, I want the data to be printed as
job source count
sales E 7
C 6
B 4
market A 5
D 4
B 3
I am unable to get the top 5 job
IIUC, we can do a further groupby and use nlargest(3) to get the top n values.
then we can create an ordered list to sort your top values to sort and create a categorical column.
s = df.groupby(['job','source']).agg({'count':sum}).groupby(level=0)['count']\
.nlargest(3).reset_index(0,drop=True).to_frame()
# see which of your indices is higher and create a sorting list.
sorter = s.groupby(level=0)['count'].sum().sort_values(ascending=False).index
#Index(['sales', 'market'], dtype='object', name='job')
s['sort'] = pd.Categorical(s.index.get_level_values(0),sorter)
df2 = s.sort_values('sort').drop('sort',axis=1)
print(df2)
count
job source
sales E 7
C 6
B 4
market A 5
D 4
B 3
You could use the sort_values mentioned in another similar answer sorting after aggregation and again group by job to get the top N from the job like,
>>> df
count job source
0 2 sales A
1 4 sales B
2 6 sales C
3 3 sales D
4 7 sales E
5 5 market A
6 3 market B
7 2 market C
8 4 market D
9 1 market E
>>> agg = df.groupby(['job','source']).agg({'count':sum})
>>> agg
count
job source
market A 5
B 3
C 2
D 4
E 1
sales A 2
B 4
C 6
D 3
E 7
>>> agg.reset_index().sort_values(['job', 'count'], ascending=False).set_index(['job', 'source']).groupby('job').head(3)
count
job source
sales E 7
C 6
B 4
market A 5
D 4
B 3
>>>
Related
This is a tricky one and I'm having a difficult time aggregating this data by week. So, starting on 5/26/20, for each week what is the total quantity? That is the desired dataframe. My data has 3 months worth of data points where some 'products' have 0 quantities and this needs to be reflected in the desired df.
Original DF:
Product Date Qty
A 5/26/20 4
A 5/28/20 2
A 5/31/20 2
A 6/02/20 1
A 6/03/20 5
A 6/05/20 2
B 5/26/20 1
B 5/27/20 8
B 6/02/20 2
B 6/06/20 10
B 6/14/20 7
Desired DF
Product Week Qty
A 1 9
A 2 7
A 3 0
B 1 11
B 2 10
B 3 7
We can do it with transform , then create the new week with subtract
s = (df.Date-df.groupby('Product').Date.transform('min')).dt.days//7 + 1
s = df.groupby([df.Product, s]).Qty.sum().unstack(fill_value=0).stack().reset_index()
s
Out[348]:
Product Date 0
0 A 1 8
1 A 2 8
2 A 3 0
3 B 1 9
4 B 2 12
5 B 3 7
I have two columns in a dataframe, one of them are strings (country's) and the other are integers related to each country. How do I ask which country has the biggest value using python pandas?
Setup
df = pd.DataFrame(dict(Num=[*map(int, '352741845')], Country=[*'ABCDEFGHI']))
df
Num Country
0 3 A
1 5 B
2 2 C
3 7 D
4 4 E
5 1 F
6 8 G
7 4 H
8 5 I
idxmax
df.loc[[df.Num.idxmax()]]
Num Country
6 8 G
nlargest
df.nlargest(1, columns=['Num'])
Num Country
6 8 G
sort_values and tail
df.sort_values('Num').tail(1)
Num Country
6 8 G
Hi I am dealing with some data by using pandas.
I am facing a problem but here I'll try to simplify it.
Suppose I have a dataset looks like this:
# Incidents Place Month
0 3 A 1
1 5 B 1
2 2 C 2
3 2 B 2
4 6 C 3
5 3 A 1
So I want to sum the # of incidents by the place, that is, I want to have a result like
P #
A 3
B 7(5+2)
C 8(2+6)
stored in a pandas DataFrame. I don't care about other columns at this point.
Next question is, now if I want to use the data in Month column as well, I'd like to have result looks like
P M #
A 1 6(3+3)
B 1 5
B 2 2
C 2 2
C 3 6
How can I achieve these results in pandas? I have tried groupby and some other functions but I cannot reach the point...
Any help is appreciated!
You can do it in this way:
In [35]: df
Out[35]:
# Incidents Place Month
0 3 A 1
1 5 B 1
2 2 C 2
3 2 B 2
4 6 C 3
5 3 A 1
In [36]: df.groupby('Place')['# Incidents'].sum().reset_index()
Out[36]:
Place # Incidents
0 A 6
1 B 7
2 C 8
In [37]: df.groupby(['Place', 'Month'])['# Incidents'].sum().reset_index()
Out[37]:
Place Month # Incidents
0 A 1 6
1 B 1 5
2 B 2 2
3 C 2 2
4 C 3 6
Please find here a Pandas documentation with lots of examples.
I want to group by two columns and get their cumulative count. I tried looking for relevant code in the group ,couldn't find it, but got few hints based on what I have coded, but it is ending up with an error. Can this be solved?
ID ABC XYZ
1 A .512
2 A .123
3 B .999
4 B .999
5 B .999
6 C .456
7 C .456
8 C .888
9 d .888
10 d .888
The output should be as below[Either ABC or XYZ has new value counter should be incremented].
ID ABC XYZ GID
1 A .123 1
2 A .512 2
3 B .999 3
4 B .999 3
5 B .999 3
6 C .456 4
7 C .456 4
8 C .888 5
9 d .888 6
10 d .888 6
The code is as below
DF=DF.sort(['ABC','XYZ'] ,ascending = [1,0])
DF['GID'] = DF.groupby('ABC','XYZ').cumcount()
But it is ending up with an Error:
ValueError: No axis named XYZ for object type
I got the desired results like this.
c1 = df.ABC != DF.ABC.shift()
c2 = df.XYZ != DF.XYZ.shift()
DF['GID'] = (c1 | c2).cumsum()
DF
Table 1
Category Date Value
A 01/01/2015 4
A 02/01/2015 1
B 01/01/2015 6
B 02/01/2015 7
Table 1 above has the values for each category organized by month.
Table 2
Category Date Value
A 03/01/2015 10
C 03/01/2015 66
D 03/01/2015 9
Suppose table 2 comes in, which has the values for each category in March, 2015.
Table 3
Category Date Value
A 01/01/2015 4
A 02/01/2015 1
A 03/01/2015 10
B 01/01/2015 6
B 02/01/2015 7
B 03/01/2015 0
C 01/01/2015 0
C 02/01/2015 0
C 03/01/2015 66
D 01/01/2015 0
D 02/01/2015 0
D 03/01/2015 9
I want to "outer-join" the two tables "vertically" on Python:
If Table2 has a category that Table1 doesn't have, then it adds that category to Table3 and assign a value of 0 for 01/01/2015 and 02/01/2015. Also, the category that is in table1 but not in table2 will also be added in table 3, by assigning a value of 0 for 03/01/2015. If both have the same categories, they will just be added vertically with the values in the table1 and table2.
Any advice or help will be greatly appreciated.. I've been thinking about this all day and still can't find an efficient way to do this.
Thanks so much!
I would do this using Pandas as follows (I'll call your tables df1 and df2):
First prepare the list of dates and categories for the final table using set together with concat to select only the unique values from your original tables:
import pandas as pd
dates = set(pd.concat([df1.Date,df2.Date]))
cats = set(pd.concat([df1.Category,df2.Category]))
Then we create the landing table by iterating through these sets (that's where itertools.product is a neat trick although note that you have to cast it as a list to feed it into the dataframe constructor):
df3 = pd.DataFrame(list(itertools.product(cats,dates)),columns = ['Category','Date'])
df3
Out[88]:
Category Date
0 D 01/01/2015
1 D 03/01/2015
2 D 02/01/2015
3 C 01/01/2015
4 C 03/01/2015
5 C 02/01/2015
6 A 01/01/2015
7 A 03/01/2015
8 A 02/01/2015
9 B 01/01/2015
10 B 03/01/2015
11 B 02/01/2015
Finally we bring in the values using merge (setting on='left'), using np.fmax to consolidate the two sets (you have use fmax instead of max to ignore the nans):
df3['Value'] = np.fmax(pd.merge(df3,df1,how='left')['Value'],pd.merge(df3,df2,how='left')['Value'])
df3
Out[127]:
Category Date Value
0 D 01/01/2015 NaN
1 D 03/01/2015 9
2 D 02/01/2015 NaN
3 C 01/01/2015 NaN
4 C 03/01/2015 66
5 C 02/01/2015 NaN
6 A 01/01/2015 4
7 A 03/01/2015 10
8 A 02/01/2015 1
9 B 01/01/2015 6
10 B 03/01/2015 NaN
11 B 02/01/2015 7