Pandas sum all columns of group result into a single result - python

I've generated the following dummy data, where the number of rows per id range from 1 to 5:
import pandas as pd, numpy as np
import random
import functools
import operator
uid = functools.reduce(operator.iconcat, [np.repeat(x,random.randint(1,5)).tolist() for x in range(10)], [])
df = pd.DataFrame(columns=list("ABCD"), data= np.random.randint(1, 5, size=(len(uid), 4)))
df['id'] = uid
df.head()
A B C D id
0 1 1 2 2 0
1 1 2 4 4 0
2 2 3 3 2 0
3 4 3 3 1 1
4 1 3 4 4 1
I would like to group by id then sum all the values, I.E:
A B C D id
0 1 1 2 2 0
1 1 2 4 4 0
2 2 3 3 2 0 = (1+1+2+2+1+2+4+4+2+2+3+3+2) = 29
Then duplicate the value for the group so the result would be:
A B C D id val
0 1 1 2 2 0 29
1 1 2 4 4 0 29
2 2 3 3 2 0 29
I've tried to call sum df.groupby('id').sum() but it sums each column separately.

You can use set_index and then stack folowed by groupby and sum, then series.map
df['val'] = df['id'].map(df.set_index("id").stack().groupby(level=0).sum())
Or as suggested by #jezrael, sum has a level=0 arg which does the same as above:
df['val'] = df['id'].map(df.set_index("id").stack().sum(level=0))
A B C D id val
0 4 2 4 1 0 34
1 3 4 4 2 0 34
2 2 4 3 1 0 34
3 1 1 1 3 1 6
4 2 3 1 4 2 50

You can first sum all columns without id and then using GroupBy.transform:
df['val'] = df.drop('id',1).sum(axis=1).groupby(df['id']).transform('sum')
Another idea:
df['val'] = df['id'].map( df.groupby('id').sum().sum(axis=1))

M=df.groupby("id").sum()
np.sum(M,axis=1)
print(M)
print(np.sum(M,axis=1))
First M is the sum of all columns which grouped by id. Then after grouping by id, when we sum all data in a row basically we have wanted result

Related

groupby cumsum (or cumcount) with cyclical data

I have a dataframe that looks like
ID
SWITCH
A
ON
A
ON
A
ON
A
OFF
A
OFF
A
OFF
A
ON
A
ON
A
ON
...
...
B
ON
B
ON
B
ON
B
OFF
B
OFF
B
OFF
B
ON
B
ON
B
ON
Column 'SWITCH' is cyclical data and I'd like to count the number of ON and OFF for each cycle like this:
ID
SWITCH
Cum. Count
A
ON
1
A
ON
2
A
ON
3
A
OFF
1
A
OFF
2
A
OFF
3
A
ON
1
A
ON
2
A
ON
3
...
...
B
ON
1
B
ON
2
B
OFF
1
B
OFF
2
B
OFF
3
B
ON
1
B
ON
2
B
ON
3
I'd tried cumsum or cumcount but it didn't reset the count when the next 'ON' cycle has come (it keeps counting on the number from the previous cycle).
What can I do?
You need to create a new column which indicates the change in the 'SWITCH' column, then you can use 'groupby' to perform the cumulative count.
import pandas as pd
# Create sample data
df = pd.DataFrame({'ID': ['A'] * 9 + ['B'] * 9,
'SWITCH': ['ON'] * 3 + ['OFF'] * 3 + ['ON'] * 3 + ['ON'] * 3 + ['OFF'] * 3 + ['OFF'] * 3})
df['SWITCH_CHANGE'] = (df['SWITCH'] != df['SWITCH'].shift()).astype(int)
df['Cum. Count'] = df.groupby(['ID', df.SWITCH_CHANGE.cumsum()])['SWITCH'].cumcount() + 1
print(df)
Result:
ID SWITCH
SWITCH_CHANGE
Cum. Count
0
A
ON
1
1
1
A
ON
0
2
2
A
ON
0
3
3
A
OFF
1
1
4
A
OFF
0
2
5
A
OFF
0
3
6
A
ON
1
1
7
A
ON
0
2
8
A
ON
0
3
9
B
ON
0
1
10
B
ON
0
2
11
B
ON
0
3
12
B
OFF
1
1
13
B
OFF
0
2
14
B
OFF
0
3
15
B
OFF
0
4
16
B
OFF
0
5
17
B
OFF
0
6
Try put in the cumsum of the difference as well:
switch_blocsk = df['SWITCH'].ne(df['SWITCH'].shift()).cumsum()
df['cum.count'] = df.groupby(['ID', switch_blocks]).cumcount().add(1)

Concat() alternate group by python3.0

My goal here is to concat() alternate groups between two dataframe.
desired result :
group ordercode quantity
0 A 1
B 1
C 1
D 1
0 A 1
B 3
1 A 1
B 2
C 1
1 A 1
B 1
C 2
My dataframe:
import pandas as pd
df1=pd.DataFrame([[0,"A",1],[0,"B",1],[0,"C",1],[0,"D",1],[1,"A",1],[1,"B",2],[1,"C",1]],columns=["group","ordercode","quantity"])
df2=pd.DataFrame([[0,"A",1],[0,"B",3],[1,"A",1],[1,"B",1],[1,"C",2]],columns=["group","ordercode","quantity"])
print(df1)
print(df2)
I have used dfff=pd.concat([df1,df2]).sort_index(kind="merge")
but I have got the below result:
group ordercode quantity
0 0 A 1
0 0 A 1
1 B 1
1 B 3
2 C 1
3 D 1
4 1 A 1
4 1 A 1
5 B 2
5 B 1
6 C 1
6 C 2
You can see here the concatenate is formed between each rows not by group.
It has to print like
group 0 of df1
group0 of df2
group1 of df1
group1 of df2 and so on
Note:
I have created these DataFrame using groupby() function
df = pd.DataFrame(np.concatenate(df.apply(lambda x: [x[0]] * x[1], 1).as_matrix()),
columns=['ordercode'])
df['quantity'] = 1
df['group'] = sorted(list(range(0, len(df)//3, 1)) * 4)[0:len(df)]
df=df.groupby(['group', 'ordercode']).sum()
Question:
Where I went wrong?
Its sorting out by taking index
I have used .set_index("group") but It didnt work either.
Use cumcount for helper column used for sorting by sort_values :
df1['g'] = df1.groupby('ordercode').cumcount()
df2['g'] = df2.groupby('ordercode').cumcount()
dfff = pd.concat([df1,df2]).sort_values(['group','g']).reset_index(drop=True)
print (dfff)
group ordercode quantity g
0 0 A 1 0
1 0 B 1 0
2 0 C 1 0
3 0 D 1 0
4 0 A 1 0
5 0 B 3 0
6 1 C 2 0
7 1 A 1 1
8 1 B 2 1
9 1 C 1 1
10 1 A 1 1
11 1 B 1 1
and last remove column:
dfff = dfff.drop('g', axis=1)

Grouby function creating 2 new columns [duplicate]

SQL : Select Max(A) , Min (B) , C from Table group by C
I want to do the same operation in pandas on a dataframe. The closer I got was till :
DF2= DF1.groupby(by=['C']).max()
where I land up getting max of both the columns , how do i do more than one operation while grouping by.
You can use function agg:
DF2 = DF1.groupby('C').agg({'A': max, 'B': min})
Sample:
print DF1
A B C D
0 1 5 a a
1 7 9 a b
2 2 10 c d
3 3 2 c c
DF2 = DF1.groupby('C').agg({'A': max, 'B': min})
print DF2
A B
C
a 7 5
c 3 2
GroupBy-fu: improvements in grouping and aggregating data in pandas - nice explanations.
try agg() function:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,5,size=(20, 3)), columns=list('ABC'))
print(df)
print(df.groupby('C').agg({'A': max, 'B':min}))
Output:
A B C
0 2 3 0
1 2 2 1
2 4 0 1
3 0 1 4
4 3 3 2
5 0 4 3
6 2 4 2
7 3 4 0
8 4 2 2
9 3 2 1
10 2 3 1
11 4 1 0
12 4 3 2
13 0 0 1
14 3 1 1
15 4 1 1
16 0 0 0
17 4 0 1
18 3 4 0
19 0 2 4
A B
C
0 4 0
1 4 0
2 4 2
3 0 4
4 0 1
Alternatively you may want to check pandas.read_sql_query() function...
You can use the agg function
import pandas as pd
import numpy as np
df.groupby('something').agg({'column1': np.max, 'columns2': np.min})

Most efficient way to groupby => aggregate for large dataframe in pandas

I have a pandas dataframe with roughly 150,000,000 rows in the following format:
df.head()
Out[1]:
ID TERM X
0 1 A 0
1 1 A 4
2 1 A 6
3 1 B 0
4 1 B 10
5 2 A 1
6 2 B 1
7 2 F 1
I want to aggregate it by ID & TERM, and count the number of rows. Currently I do the following:
df.groupby(['ID','TERM']).count()
Out[2]:
ID TERM X
0 1 A 3
1 1 B 2
2 2 A 1
3 2 B 1
4 2 F 1
But this takes roughly two minutes. The same operation using R data.tables takes less than 22 seconds. Is there a more efficient way to do this in python?
For comparison, R data.table:
system.time({ df[,.(.N), .(ID, TERM)] })
#user: 30.32 system: 2.45 elapsed: 22.88
A NumPy solution would be like so -
def groupby_count(df):
unq, t = np.unique(df.TERM, return_inverse=1)
ids = df.ID.values
sidx = np.lexsort([t,ids])
ts = t[sidx]
idss = ids[sidx]
m0 = (idss[1:] != idss[:-1]) | (ts[1:] != ts[:-1])
m = np.concatenate(([True], m0, [True]))
ids_out = idss[m[:-1]]
t_out = unq[ts[m[:-1]]]
x_out = np.diff(np.flatnonzero(m)+1)
out_ar = np.column_stack((ids_out, t_out, x_out))
return pd.DataFrame(out_ar, columns = [['ID','TERM','X']])
A bit simpler version -
def groupby_count_v2(df):
a = df.values
sidx = np.lexsort(a[:,:2].T)
b = a[sidx,:2]
m = np.concatenate(([True],(b[1:] != b[:-1]).any(1),[True]))
out_ar = np.column_stack((b[m[:-1],:2], np.diff(np.flatnonzero(m)+1)))
return pd.DataFrame(out_ar, columns = [['ID','TERM','X']])
Sample run -
In [332]: df
Out[332]:
ID TERM X
0 1 A 0
1 1 A 4
2 1 A 6
3 1 B 0
4 1 B 10
5 2 A 1
6 2 B 1
7 2 F 1
In [333]: groupby_count(df)
Out[333]:
ID TERM X
0 1 A 3
1 1 B 2
2 2 A 1
3 2 B 1
4 2 F 1
Let's randomly shuffle the rows and verify that it works with our solution -
In [339]: df1 = df.iloc[np.random.permutation(len(df))]
In [340]: df1
Out[340]:
ID TERM X
7 2 F 1
6 2 B 1
0 1 A 0
3 1 B 0
5 2 A 1
2 1 A 6
1 1 A 4
4 1 B 10
In [341]: groupby_count(df1)
Out[341]:
ID TERM X
0 1 A 3
1 1 B 2
2 2 A 1
3 2 B 1
4 2 F 1

return rows with unique pairs across columns

I'm trying to find rows that have unique pairs of values across 2 columns, so this dataframe:
A B
1 0
2 0
3 0
0 1
2 1
3 1
0 2
1 2
3 2
0 3
1 3
2 3
will be reduced to only the rows that don't match up if flipped, for instance 1 and 3 is a combination I only want returned once. So a check to see if the same pair exists if the columns are flipped (3 and 1) it can be removed. The table I'm looking to get is:
A B
0 2
0 3
1 0
1 2
1 3
2 3
Where there is only one occurrence of each pair of values that are mirrored if the columns are flipped.
I think you can use apply sorted + drop_duplicates:
df = df.apply(sorted, axis=1).drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Faster solution with numpy.sort:
df = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Solution without sorting with DataFrame.min and DataFrame.max:
a = df.min(axis=1)
b = df.max(axis=1)
df['A'] = a
df['B'] = b
df = df.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Loading the data:
import numpy as np
import pandas as pd
a = np.array("1 2 3 0 2 3 0 1 3 0 1 2".split("\t"),dtype=np.double)
b = np.array("0 0 0 1 1 1 2 2 2 3 3 3".split("\t"),dtype=np.double)
df = pd.DataFrame(dict(A=a,B=b))
In case you don't need to sort the entire DF:
df["trans"] = df.apply(
lambda row: (min(row['A'], row['B']), max(row['A'], row['B'])), axis=1
)
df.drop_duplicates("trans")

Categories