groupby cumsum (or cumcount) with cyclical data - python

I have a dataframe that looks like
ID
SWITCH
A
ON
A
ON
A
ON
A
OFF
A
OFF
A
OFF
A
ON
A
ON
A
ON
...
...
B
ON
B
ON
B
ON
B
OFF
B
OFF
B
OFF
B
ON
B
ON
B
ON
Column 'SWITCH' is cyclical data and I'd like to count the number of ON and OFF for each cycle like this:
ID
SWITCH
Cum. Count
A
ON
1
A
ON
2
A
ON
3
A
OFF
1
A
OFF
2
A
OFF
3
A
ON
1
A
ON
2
A
ON
3
...
...
B
ON
1
B
ON
2
B
OFF
1
B
OFF
2
B
OFF
3
B
ON
1
B
ON
2
B
ON
3
I'd tried cumsum or cumcount but it didn't reset the count when the next 'ON' cycle has come (it keeps counting on the number from the previous cycle).
What can I do?

You need to create a new column which indicates the change in the 'SWITCH' column, then you can use 'groupby' to perform the cumulative count.
import pandas as pd
# Create sample data
df = pd.DataFrame({'ID': ['A'] * 9 + ['B'] * 9,
'SWITCH': ['ON'] * 3 + ['OFF'] * 3 + ['ON'] * 3 + ['ON'] * 3 + ['OFF'] * 3 + ['OFF'] * 3})
df['SWITCH_CHANGE'] = (df['SWITCH'] != df['SWITCH'].shift()).astype(int)
df['Cum. Count'] = df.groupby(['ID', df.SWITCH_CHANGE.cumsum()])['SWITCH'].cumcount() + 1
print(df)
Result:
ID SWITCH
SWITCH_CHANGE
Cum. Count
0
A
ON
1
1
1
A
ON
0
2
2
A
ON
0
3
3
A
OFF
1
1
4
A
OFF
0
2
5
A
OFF
0
3
6
A
ON
1
1
7
A
ON
0
2
8
A
ON
0
3
9
B
ON
0
1
10
B
ON
0
2
11
B
ON
0
3
12
B
OFF
1
1
13
B
OFF
0
2
14
B
OFF
0
3
15
B
OFF
0
4
16
B
OFF
0
5
17
B
OFF
0
6

Try put in the cumsum of the difference as well:
switch_blocsk = df['SWITCH'].ne(df['SWITCH'].shift()).cumsum()
df['cum.count'] = df.groupby(['ID', switch_blocks]).cumcount().add(1)

Related

Pandas sum all columns of group result into a single result

I've generated the following dummy data, where the number of rows per id range from 1 to 5:
import pandas as pd, numpy as np
import random
import functools
import operator
uid = functools.reduce(operator.iconcat, [np.repeat(x,random.randint(1,5)).tolist() for x in range(10)], [])
df = pd.DataFrame(columns=list("ABCD"), data= np.random.randint(1, 5, size=(len(uid), 4)))
df['id'] = uid
df.head()
A B C D id
0 1 1 2 2 0
1 1 2 4 4 0
2 2 3 3 2 0
3 4 3 3 1 1
4 1 3 4 4 1
I would like to group by id then sum all the values, I.E:
A B C D id
0 1 1 2 2 0
1 1 2 4 4 0
2 2 3 3 2 0 = (1+1+2+2+1+2+4+4+2+2+3+3+2) = 29
Then duplicate the value for the group so the result would be:
A B C D id val
0 1 1 2 2 0 29
1 1 2 4 4 0 29
2 2 3 3 2 0 29
I've tried to call sum df.groupby('id').sum() but it sums each column separately.
You can use set_index and then stack folowed by groupby and sum, then series.map
df['val'] = df['id'].map(df.set_index("id").stack().groupby(level=0).sum())
Or as suggested by #jezrael, sum has a level=0 arg which does the same as above:
df['val'] = df['id'].map(df.set_index("id").stack().sum(level=0))
A B C D id val
0 4 2 4 1 0 34
1 3 4 4 2 0 34
2 2 4 3 1 0 34
3 1 1 1 3 1 6
4 2 3 1 4 2 50
You can first sum all columns without id and then using GroupBy.transform:
df['val'] = df.drop('id',1).sum(axis=1).groupby(df['id']).transform('sum')
Another idea:
df['val'] = df['id'].map( df.groupby('id').sum().sum(axis=1))
M=df.groupby("id").sum()
np.sum(M,axis=1)
print(M)
print(np.sum(M,axis=1))
First M is the sum of all columns which grouped by id. Then after grouping by id, when we sum all data in a row basically we have wanted result

arithmetic on pandas dataframe row-wise

df have:
A B C
a 1 2 3
b 2 1 4
c 1 1 1
df want:
A B C
a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
I am able to get df want by using:
df.loc['d']=df.loc['b']-df.loc['a']
However, my actual df has 'a','b','c' rows for multiple IDs 'X', 'Y' etc.
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
Y a 1 2 3
b 2 1 4
c 1 1 1
How can I create the same output with multiple IDs?
My original method:
df.loc['d']=df.loc['b']-df.loc['a']
fails KeyError:'b'
Desired output:
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
Y a 1 2 3
b 2 2 4
c 1 1 1
d 1 0 1
IIUC,
for i, sub in df.groupby(df.index.get_level_values(0)):
df.loc[(i, 'd'), :] = sub.loc[(i,'b')] - sub.loc[(i, 'a')]
print(df.sort_index())
Or maybe
k = df.groupby(df.index.get_level_values(0), as_index=False).apply(lambda s: pd.DataFrame([s.loc[(s.name,'b')].values - s.loc[(s.name, 'a')].values],
columns=s.columns,
index=pd.MultiIndex(levels=[[s.name], ['d']], codes=[[0],[0]])
)).reset_index(drop=True, level=0)
pd.concat([k, df]).sort_index()
Data reshaping is a useful trick if you want to do manipulation on a particular level of a multiindex. See code below,
result = (df.unstack(0).T
.assign(d=lambda x:x.b-x.a)
.stack()
.unstack(0))
Use pd.IndexSlice to slice a and b. Call diff and slice on b and rename it to d. Finally, append it to original df
idx = pd.IndexSlice
df1 = df.loc[idx[:,['a','b']],:].diff().loc[idx[:,'b'],:].rename({'b': 'd'})
df2 = df.append(df1).sort_index().astype(int)
Out[106]:
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
Y a 1 2 3
b 2 2 4
c 1 1 1
d 1 0 1

Concat() alternate group by python3.0

My goal here is to concat() alternate groups between two dataframe.
desired result :
group ordercode quantity
0 A 1
B 1
C 1
D 1
0 A 1
B 3
1 A 1
B 2
C 1
1 A 1
B 1
C 2
My dataframe:
import pandas as pd
df1=pd.DataFrame([[0,"A",1],[0,"B",1],[0,"C",1],[0,"D",1],[1,"A",1],[1,"B",2],[1,"C",1]],columns=["group","ordercode","quantity"])
df2=pd.DataFrame([[0,"A",1],[0,"B",3],[1,"A",1],[1,"B",1],[1,"C",2]],columns=["group","ordercode","quantity"])
print(df1)
print(df2)
I have used dfff=pd.concat([df1,df2]).sort_index(kind="merge")
but I have got the below result:
group ordercode quantity
0 0 A 1
0 0 A 1
1 B 1
1 B 3
2 C 1
3 D 1
4 1 A 1
4 1 A 1
5 B 2
5 B 1
6 C 1
6 C 2
You can see here the concatenate is formed between each rows not by group.
It has to print like
group 0 of df1
group0 of df2
group1 of df1
group1 of df2 and so on
Note:
I have created these DataFrame using groupby() function
df = pd.DataFrame(np.concatenate(df.apply(lambda x: [x[0]] * x[1], 1).as_matrix()),
columns=['ordercode'])
df['quantity'] = 1
df['group'] = sorted(list(range(0, len(df)//3, 1)) * 4)[0:len(df)]
df=df.groupby(['group', 'ordercode']).sum()
Question:
Where I went wrong?
Its sorting out by taking index
I have used .set_index("group") but It didnt work either.
Use cumcount for helper column used for sorting by sort_values :
df1['g'] = df1.groupby('ordercode').cumcount()
df2['g'] = df2.groupby('ordercode').cumcount()
dfff = pd.concat([df1,df2]).sort_values(['group','g']).reset_index(drop=True)
print (dfff)
group ordercode quantity g
0 0 A 1 0
1 0 B 1 0
2 0 C 1 0
3 0 D 1 0
4 0 A 1 0
5 0 B 3 0
6 1 C 2 0
7 1 A 1 1
8 1 B 2 1
9 1 C 1 1
10 1 A 1 1
11 1 B 1 1
and last remove column:
dfff = dfff.drop('g', axis=1)

Most efficient way to groupby => aggregate for large dataframe in pandas

I have a pandas dataframe with roughly 150,000,000 rows in the following format:
df.head()
Out[1]:
ID TERM X
0 1 A 0
1 1 A 4
2 1 A 6
3 1 B 0
4 1 B 10
5 2 A 1
6 2 B 1
7 2 F 1
I want to aggregate it by ID & TERM, and count the number of rows. Currently I do the following:
df.groupby(['ID','TERM']).count()
Out[2]:
ID TERM X
0 1 A 3
1 1 B 2
2 2 A 1
3 2 B 1
4 2 F 1
But this takes roughly two minutes. The same operation using R data.tables takes less than 22 seconds. Is there a more efficient way to do this in python?
For comparison, R data.table:
system.time({ df[,.(.N), .(ID, TERM)] })
#user: 30.32 system: 2.45 elapsed: 22.88
A NumPy solution would be like so -
def groupby_count(df):
unq, t = np.unique(df.TERM, return_inverse=1)
ids = df.ID.values
sidx = np.lexsort([t,ids])
ts = t[sidx]
idss = ids[sidx]
m0 = (idss[1:] != idss[:-1]) | (ts[1:] != ts[:-1])
m = np.concatenate(([True], m0, [True]))
ids_out = idss[m[:-1]]
t_out = unq[ts[m[:-1]]]
x_out = np.diff(np.flatnonzero(m)+1)
out_ar = np.column_stack((ids_out, t_out, x_out))
return pd.DataFrame(out_ar, columns = [['ID','TERM','X']])
A bit simpler version -
def groupby_count_v2(df):
a = df.values
sidx = np.lexsort(a[:,:2].T)
b = a[sidx,:2]
m = np.concatenate(([True],(b[1:] != b[:-1]).any(1),[True]))
out_ar = np.column_stack((b[m[:-1],:2], np.diff(np.flatnonzero(m)+1)))
return pd.DataFrame(out_ar, columns = [['ID','TERM','X']])
Sample run -
In [332]: df
Out[332]:
ID TERM X
0 1 A 0
1 1 A 4
2 1 A 6
3 1 B 0
4 1 B 10
5 2 A 1
6 2 B 1
7 2 F 1
In [333]: groupby_count(df)
Out[333]:
ID TERM X
0 1 A 3
1 1 B 2
2 2 A 1
3 2 B 1
4 2 F 1
Let's randomly shuffle the rows and verify that it works with our solution -
In [339]: df1 = df.iloc[np.random.permutation(len(df))]
In [340]: df1
Out[340]:
ID TERM X
7 2 F 1
6 2 B 1
0 1 A 0
3 1 B 0
5 2 A 1
2 1 A 6
1 1 A 4
4 1 B 10
In [341]: groupby_count(df1)
Out[341]:
ID TERM X
0 1 A 3
1 1 B 2
2 2 A 1
3 2 B 1
4 2 F 1

How to create a dataframe that indicates the rows min and max of a given dataframe?

I need to produce a matrix as input for a conditional formatting for an automated chart creation process. My receiving colleague has to display the numbers and give the max and min of each row an associated color. For his process, a second matrix with entries that indicate the row mins and max would be ideal.
So what do I need to deliver?
Let's say I have the following dataframe:
Cat Product Brand1 Brand2 Brand3
A a 6 9 5
A b 11 7 7
A c 9 5 5
B d 7 3 10
B e 5 8 8
B f 10 6 6
C g 8 4 4
C h 6 2 9
C i 4 7 7
From that, I want to generate the following dataframe, indicating "1" as row max and "2" as row min:
Cat Product Brand1 Brand2 Brand3
A a 0 1 2
A b 1 2 2
A c 1 2 2
B d 0 2 1
B e 2 1 1
B f 1 2 2
C g 1 2 2
C h 0 2 1
C i 2 1 1
The indicators "1" and "2" could be something else, even letters or whatever. The zeros could also be na.
How can this be achieved?
You can use numpy.where for replace values by mask created by eq:
df = df.set_index(['Cat','Product'])
m1 = df.eq(df.max(axis=1), axis=0)
m2 = df.eq(df.min(axis=1), axis=0)
df = pd.DataFrame(np.where(m1, 1, np.where(m2, 2, 0)), index=df.index, columns=df.columns)
df = df.reset_index()
print (df)
Cat Product Brand1 Brand2 Brand3
0 A a 0 1 2
1 A b 1 2 2
2 A c 1 2 2
3 B d 0 2 1
4 B e 2 1 1
5 B f 1 2 2
6 C g 1 2 2
7 C h 0 2 1
8 C i 2 1 1
Another solution:
df = df.set_index(['Cat','Product'])
m1 = df.values == df.values.max(axis=1)[:, None]
m2 = df.values == df.values.min(axis=1)[:, None]
df = pd.DataFrame(np.where(m1, 1, np.where(m2, 2, 0)), index=df.index, columns=df.columns)
df = df.reset_index()
print (df)
Cat Product Brand1 Brand2 Brand3
0 A a 0 1 2
1 A b 1 2 2
2 A c 1 2 2
3 B d 0 2 1
4 B e 2 1 1
5 B f 1 2 2
6 C g 1 2 2
7 C h 0 2 1
8 C i 2 1 1

Categories