Pandas how to aggregate more than one column - python

Here is the snippet:
test = pd.DataFrame({'userid': [1,1,1,2,2], 'order_id': [1,2,3,4,5], 'fee': [2,1,5,3,1]})
I'd like to group based on userid and count the 'order_id' column and sum the 'fee' column:
test.groupby('userid').order_id.count()
test.groupby('userid').fee.sum()
Is it possible to perform these two operations in one line of code so that I can get a resulting df looks like this:
userid counts sum
...
I've tried pivot_table:
test.pivot_table(index='userid', values=['order_id', 'fee'], aggfunc=[np.size, np.sum])
It gives something like this:
size sum
fee order_id fee order_id
userid
1 3 3 8 6
2 2 2 4 9
Is it possible to tell pandas to use np.size & np.sum on one column but not both?

Use DataFrameGroupBy.agg with rename columns:
d = {'order_id':'counts','fee':'sum'}
df = test.groupby('userid').agg({'order_id':'count', 'fee':'sum'})
.rename(columns=d)
.reset_index()
print (df)
userid sum counts
0 1 8 3
1 2 4 2
But better is aggregate by size, because count is used if need exclude NaNs:
df = test.groupby('userid')
.agg({'order_id':'size', 'fee':'sum'})
.rename(columns=d).reset_index()
print (df)
userid sum counts
0 1 8 3
1 2 4 2

Related

Combining two pandas dataframes into one based on conditions

I got two dataframes, simplified they look like this:
Dataframe A
ID
item
1
apple
2
peach
Dataframe B
ID
flag
price ($)
1
A
3
1
B
2
2
B
4
2
A
2
ID: unique identifier for each item
flag: unique identifier for each vendor
price: varies for each vendor
In this simplified case I want to extract the price values of dataframe B and add them to dataframe A in separate columns depending on their flag value.
The result should look similar to this
Dataframe C
ID
item
price_A
price_B
1
apple
3
2
2
peach
2
4
I tried to split dataframe B into two dataframes the different flag values and merge them afterwards with dataframe A, but there must be an easier solution.
Thank you in advance! :)
*edit: removed the pictures
You can use pd.merge and pd.pivot_table for this:
df_C = pd.merge(df_A, df_B, on=['ID']).pivot_table(index=['ID', 'item'], columns='flag', values='price')
df_C.columns = ['price_' + alpha for alpha in df_C.columns]
df_C = df_C.reset_index()
Output:
>>> df_C
ID item price_A price_B
0 1 apple 3 2
1 2 peach 2 4
(dfb
.merge(dfa, on="ID")
.pivot_table(index=['ID', 'item'], columns='flag', values='price ($)')
.add_prefix("price_")
.reset_index()
)

Pandas in Python: how to exclude results with a count == 1?

Here's the code I currently have:
df.groupby(df['LOCAL_COUNCIL']).agg({'CRIME_RATING': ['mean', 'count']}).reset_index()
which returns something like the following (I've made these values up):
CRIME_RATING
mean count
0 3.000000 1
1 3.118397 39
2 2.790698 32
3 5.125000 18
4 4.000000 1
5 4.222222 22
but I'd quite like to exclude indexes 0 and 4 from the resulting dataframe given that they both have a count of 1. Can this be done?
Use Series.ne for filter not equal 1 with tuple for select MultiIndex columns and filter in boolean indexing:
df1 = df.groupby(df['LOCAL_COUNCIL']).agg({'CRIME_RATING': ['mean', 'count']}).reset_index()
df2 = df1[df1[('CRIME_RATING','count')].ne(1)]
If want avoid MultiIndex use named aggregation:
df1 = df.groupby(df['LOCAL_COUNCIL']).agg(mean = ('CRIME_RATING','mean'),
count = ('CRIME_RATING','count'))
df2 = df1[df1['count'].ne(1)]

pandas reset index after performing groupby and retain selective columns

I want to take a pandas dataframe, do a count of unique elements by a column and retain 2 of the columns. But I get a multi-index dataframe after groupby which I am unable to (1) flatten (2) select only relevant columns. Here is my code:
import pandas as pd
df = pd.DataFrame({
'ID':[1,2,3,4,5,1],
'Ticker':['AA','BB','CC','DD','CC','BB'],
'Amount':[10,20,30,40,50,60],
'Date_1':['1/12/2018','1/14/2018','1/12/2018','1/14/2018','2/1/2018','1/12/2018'],
'Random_data':['ax','','nan','','by','cz'],
'Count':[23,1,4,56,34,53]
})
df2 = df.groupby(['Ticker']).agg(['nunique'])
df2.reset_index()
print(df2)
df2 still comes out with two levels of index. And has all the columns: Amount, Count, Date_1, ID, Random_data.
How do I reduce it to one level of index?
And retain only ID and Random_data columns?
Try this instead:
1) Select only the relevant columns (['ID', 'Random_data'])
2) Don't pass a list to .agg - just 'nunique' - the list is what is causing the multi index behaviour.
df2 = df.groupby(['Ticker'])['ID', 'Random_data'].agg('nunique')
df2.reset_index()
Ticker ID Random_data
0 AA 1 1
1 BB 2 2
2 CC 2 2
3 DD 1 1
Use SeriesGroupBy.nunique and filter columns in list after groupby:
df2 = df.groupby('Ticker')['Date_1','Count','ID'].nunique().reset_index()
print(df2)
Ticker Date_1 Count ID
0 AA 1 1 1
1 BB 2 2 2
2 CC 2 2 2
3 DD 1 1 1

How to count data in a column based on another column separately?

I have two dataframe like this:
df1 = pd.DataFrame({'a':[1,2]})
df2 = pd.DataFrame({'a':[1,1,1,2,2,3,4,5,6,7,8]})
I want to count the two numbers of df1 separately in df2, the correct answer like:
No Amount
1 3
2 2
Instead of:
No Amount
1 5
2 5
How can I solve this problem?
First filter df2 for values that are contained in df1['a'], then apply value_counts. The rest of the code just presents the data in your desired format.
result = (
df2[df2['a'].isin(df1['a'].unique())]['a']
.value_counts()
.reset_index()
)
result.columns = ['No', 'Amount']
>>> result
No Amount
0 1 3
1 2 2
In pandas 0.21.0 you can use set_axis to rename columns as chained method. Here's a one line solution:
df2[df2.a.isin(df1.a)]\
.squeeze()\
.value_counts()\
.reset_index()\
.set_axis(['No','Amount'], axis=1, inplace=False)
Output:
No Amount
0 1 3
1 2 2
You can simply find value_counts of second df and map that with first df i.e
df1['Amount'] = df1['a'].map(df2['a'].value_counts())
df1 = df1.rename(columns={'a':'No'})
Output :
No Amount
0 1 3
1 2 2

python pandas pivot_table count frequency in one column

I am still new to Python pandas' pivot_table and would like to ask a way to count frequencies of values in one column, which is also linked to another column of ID. The DataFrame looks like the following.
import pandas as pd
df = pd.DataFrame({'Account_number':[1,1,2,2,2,3,3],
'Product':['A', 'A', 'A', 'B', 'B','A', 'B']
})
For the output, I'd like to get something like the following:
Product
A B
Account_number
1 2 0
2 1 2
3 1 1
So far, I tried this code:
df.pivot_table(rows = 'Account_number', cols= 'Product', aggfunc='count')
This code gives me the two same things. What is the problems with the code above? A part of the reason why I am asking this question is that this DataFrame is just an example. The real data that I am working on has tens of thousands of account_numbers.
You need to specify the aggfunc as len:
In [11]: df.pivot_table(index='Account_number', columns='Product',
aggfunc=len, fill_value=0)
Out[11]:
Product A B
Account_number
1 2 0
2 1 2
3 1 1
It looks like count, is counting the instances of each column (Account_number and Product), it's not clear to me whether this is a bug...
Solution: Use aggfunc='size'
Using aggfunc=len or aggfunc='count' like all the other answers on this page will not work for DataFrames with more than three columns. By default, pandas will apply this aggfunc to all the columns not found in index or columns parameters.
For instance, if we had two more columns in our original DataFrame defined like this:
df = pd.DataFrame({'Account_number':[1, 1, 2 ,2 ,2 ,3 ,3],
'Product':['A', 'A', 'A', 'B', 'B','A', 'B'],
'Price': [10] * 7,
'Quantity': [100] * 7})
Output:
Account_number Product Price Quantity
0 1 A 10 100
1 1 A 10 100
2 2 A 10 100
3 2 B 10 100
4 2 B 10 100
5 3 A 10 100
6 3 B 10 100
If you apply the current solutions to this DataFrame, you would get the following:
df.pivot_table(index='Account_number',
columns='Product',
aggfunc=len,
fill_value=0)
Output:
Price Quantity
Product A B A B
Account_number
1 2 0 2 0
2 1 2 1 2
3 1 1 1 1
Solution
Instead, use aggfunc='size'. Since size always returns the same number for each column, pandas does not call it on every single column and just does it once.
df.pivot_table(index='Account_number',
columns='Product',
aggfunc='size',
fill_value=0)
Output:
Product A B
Account_number
1 2 0
2 1 2
3 1 1
In new version of Pandas, slight modification is required. I had to spend some time figuring out so just wanted to add that here so that someone can directly use this.
df.pivot_table(index='Account_number', columns='Product', aggfunc=len,
fill_value=0)
You can use count:
df.pivot_table(index='Account_number', columns='Product', aggfunc='count')
I know this question is about pivot_table but for the problem given in the question, we can use crosstab:
out = pd.crosstab(df['Account_number'], df['Product'])
Output:
Product A B
Account_number
1 2 0
2 1 2
3 1 1

Categories