SQL : Select Max(A) , Min (B) , C from Table group by C
I want to do the same operation in pandas on a dataframe. The closer I got was till :
DF2= DF1.groupby(by=['C']).max()
where I land up getting max of both the columns , how do i do more than one operation while grouping by.
You can use function agg:
DF2 = DF1.groupby('C').agg({'A': max, 'B': min})
Sample:
print DF1
A B C D
0 1 5 a a
1 7 9 a b
2 2 10 c d
3 3 2 c c
DF2 = DF1.groupby('C').agg({'A': max, 'B': min})
print DF2
A B
C
a 7 5
c 3 2
GroupBy-fu: improvements in grouping and aggregating data in pandas - nice explanations.
try agg() function:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=list('ABC'))
print(df)
print(df.groupby('C').agg({'A': max, 'B':min}))
Output:
A B C
0 2 3 0
1 2 2 1
2 4 0 1
3 0 1 4
4 3 3 2
5 0 4 3
6 2 4 2
7 3 4 0
8 4 2 2
9 3 2 1
10 2 3 1
11 4 1 0
12 4 3 2
13 0 0 1
14 3 1 1
15 4 1 1
16 0 0 0
17 4 0 1
18 3 4 0
19 0 2 4
A B
C
0 4 0
1 4 0
2 4 2
3 0 4
4 0 1
Alternatively you may want to check pandas.read_sql_query() function...
You can use the agg function
import pandas as pd
import numpy as np
df.groupby('something').agg({'column1': np.max, 'columns2': np.min})
Related
i have a dataframe like this
Index
A
B
C
D
E
0
4
2
4
4
1
1
1
4
1
4
4
2
3
1
2
0
1
3
1
0
2
2
4
4
0
1
1
0
2
i want to take the square for each cell in a row and add them up then put the result in a column "sum of squares", how to do that ?
i expect this result :
Index
A
B
C
D
E
sum of squares
0
4
2
4
4
1
53
1
1
4
1
4
4
50
2
3
1
2
0
1
15
3
1
0
2
2
4
25
4
0
1
1
0
2
6
By using apply() and sum().
Code:-
import pandas as pd
lis=[(4,2,4,4,1),
(1,4,1,4,4),
(3,1,2,0,1),
(1,0,2,2,4),
(0,1,1,0,2)]
df = pd.DataFrame(lis)
df.columns =['A', 'B', 'C', 'D','E']
#print(df)
# Main code
new=df.apply(lambda num: num**2) #Square of each number stored in new.
#Creating new column sum_of_squares applying sum() function on new
df['sum_of_squares']=new.sum(axis=1)
print(df)
Output:-
A B C D E sum_of_squares
0 4 2 4 4 1 53
1 1 4 1 4 4 50
2 3 1 2 0 1 15
3 1 0 2 2 4 25
4 0 1 1 0 2 6
From the dataframe
import pandas as pd
df1 = pd.DataFrame({'A':[1,1,1,1,2,2,2,2],'B':[1,2,3,4,5,6,7,8]})
print(df1)
A B
0 1 1
1 1 2
2 1 3
3 1 4
4 2 5
5 2 6
6 2 7
7 2 8
I want to pop 2 rows where 'A' == 2, preferably in a single statement like
df2 = df1.somepopfunction(...)
to generate the following result:
print(df1)
A B
0 1 1
1 1 2
2 1 3
3 1 4
4 2 7
5 2 8
print(df2)
A B
0 2 5
1 2 6
The pandas pop function sounds promising, but only pops complete colums.
What statement can replace the pseudocode
df2 = df1.somepopfunction(...)
to generate the desired results?
Pop function for remove rows does not exist in pandas, need filter first and then remove filtred rows from df1:
df2 = df1[df1.A.eq(2)].head(2)
print (df2)
A B
4 2 5
5 2 6
df1 = df1.drop(df2.index)
print (df1)
A B
0 1 1
1 1 2
2 1 3
3 1 4
6 2 7
7 2 8
I've generated the following dummy data, where the number of rows per id range from 1 to 5:
import pandas as pd, numpy as np
import random
import functools
import operator
uid = functools.reduce(operator.iconcat, [np.repeat(x,random.randint(1,5)).tolist() for x in range(10)], [])
df = pd.DataFrame(columns=list("ABCD"), data= np.random.randint(1, 5, size=(len(uid), 4)))
df['id'] = uid
df.head()
A B C D id
0 1 1 2 2 0
1 1 2 4 4 0
2 2 3 3 2 0
3 4 3 3 1 1
4 1 3 4 4 1
I would like to group by id then sum all the values, I.E:
A B C D id
0 1 1 2 2 0
1 1 2 4 4 0
2 2 3 3 2 0 = (1+1+2+2+1+2+4+4+2+2+3+3+2) = 29
Then duplicate the value for the group so the result would be:
A B C D id val
0 1 1 2 2 0 29
1 1 2 4 4 0 29
2 2 3 3 2 0 29
I've tried to call sum df.groupby('id').sum() but it sums each column separately.
You can use set_index and then stack folowed by groupby and sum, then series.map
df['val'] = df['id'].map(df.set_index("id").stack().groupby(level=0).sum())
Or as suggested by #jezrael, sum has a level=0 arg which does the same as above:
df['val'] = df['id'].map(df.set_index("id").stack().sum(level=0))
A B C D id val
0 4 2 4 1 0 34
1 3 4 4 2 0 34
2 2 4 3 1 0 34
3 1 1 1 3 1 6
4 2 3 1 4 2 50
You can first sum all columns without id and then using GroupBy.transform:
df['val'] = df.drop('id',1).sum(axis=1).groupby(df['id']).transform('sum')
Another idea:
df['val'] = df['id'].map( df.groupby('id').sum().sum(axis=1))
M=df.groupby("id").sum()
np.sum(M,axis=1)
print(M)
print(np.sum(M,axis=1))
First M is the sum of all columns which grouped by id. Then after grouping by id, when we sum all data in a row basically we have wanted result
I'd like to add up all the columns in a DataFrame, and I'd like this sum added as a new column in the DataFrame.
I want "all" the columns available, without mentioning the first and last columns in my query.
Is this possible?
Use sum:
import pandas as pd
import numpy as np
#random dataframe
np.random.seed(1)
df1 = pd.DataFrame(np.random.randint(10, size=(3,5)))
df1.columns = list('ABCDE')
print df1
A B C D E
0 5 8 9 5 0
1 0 1 7 6 9
2 2 4 5 2 4
df1['sum'] = df1.sum(axis=1)
print df1
A B C D E sum
0 5 8 9 5 0 27
1 0 1 7 6 9 23
2 2 4 5 2 4 17
Another solution for creating new columns is assign:
print df1.assign(sum=df1.sum(axis=1))
A B C D E sum
0 5 8 9 5 0 27
1 0 1 7 6 9 23
2 2 4 5 2 4 17
Solution
pd.concat([df, df.sum(axis=1)], axis=1)
you can do it like this:
df['sum'] = df.sum(axis=1)
I have the following python pandas data frame:
df = pd.DataFrame( {
'A': [1,1,1,1,2,2,2,3,3,4,4,4],
'B': [5,5,6,7,5,6,6,7,7,6,7,7],
'C': [1,1,1,1,1,1,1,1,1,1,1,1]
} );
df
A B C
0 1 5 1
1 1 5 1
2 1 6 1
3 1 7 1
4 2 5 1
5 2 6 1
6 2 6 1
7 3 7 1
8 3 7 1
9 4 6 1
10 4 7 1
11 4 7 1
I would like to have another column storing a value of a sum over C values for fixed (both) A and B. That is, something like:
A B C D
0 1 5 1 2
1 1 5 1 2
2 1 6 1 1
3 1 7 1 1
4 2 5 1 1
5 2 6 1 2
6 2 6 1 2
7 3 7 1 2
8 3 7 1 2
9 4 6 1 1
10 4 7 1 2
11 4 7 1 2
I have tried with pandas groupby and it kind of works:
res = {}
for a, group_by_A in df.groupby('A'):
group_by_B = group_by_A.groupby('B', as_index = False)
res[a] = group_by_B['C'].sum()
but I don't know how to 'get' the results from res into df in the orderly fashion. Would be very happy with any advice on this. Thank you.
Here's one way (though it feels this should work in one go with an apply, I can't get it).
In [11]: g = df.groupby(['A', 'B'])
In [12]: df1 = df.set_index(['A', 'B'])
The size groupby function is the one you want, we have to match it to the 'A' and 'B' as the index:
In [13]: df1['D'] = g.size() # unfortunately this doesn't play nice with as_index=False
# Same would work with g['C'].sum()
In [14]: df1.reset_index()
Out[14]:
A B C D
0 1 5 1 2
1 1 5 1 2
2 1 6 1 1
3 1 7 1 1
4 2 5 1 1
5 2 6 1 2
6 2 6 1 2
7 3 7 1 2
8 3 7 1 2
9 4 6 1 1
10 4 7 1 2
11 4 7 1 2
You could also do a one liner using transform applied to the groupby:
df['D'] = df.groupby(['A','B'])['C'].transform('sum')
You could also do a one liner using merge as follows:
df = df.merge(pd.DataFrame({'D':df.groupby(['A', 'B'])['C'].size()}), left_on=['A', 'B'], right_index=True)
you can use this method :
columns = ['col1','col2',...]
df.groupby('col')[columns].sum()
if you want you can also use .sort_values(by = 'colx', ascending = True/False) after .sum() to sort the final output by a specific column (colx) and in an ascending or descending order.