pandas add column to groupby dataframe - python

I have this simple dataframe df:
df = pd.DataFrame({'c':[1,1,1,2,2,2,2],'type':['m','n','o','m','m','n','n']})
my goal is to count values of type for each c, and then add a column with the size of c. So starting with:
In [27]: g = df.groupby('c')['type'].value_counts().reset_index(name='t')
In [28]: g
Out[28]:
c type t
0 1 m 1
1 1 n 1
2 1 o 1
3 2 m 2
4 2 n 2
the first problem is solved. Then I can also:
In [29]: a = df.groupby('c').size().reset_index(name='size')
In [30]: a
Out[30]:
c size
0 1 3
1 2 4
How can I add the size column directly to the first dataframe? So far I used map as:
In [31]: a.index = a['c']
In [32]: g['size'] = g['c'].map(a['size'])
In [33]: g
Out[33]:
c type t size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4
which works, but is there a more straightforward way to do this?

Use transform to add a column back to the orig df from a groupby aggregation, transform returns a Series with its index aligned to the orig df:
In [123]:
g = df.groupby('c')['type'].value_counts().reset_index(name='t')
g['size'] = df.groupby('c')['type'].transform('size')
g
Out[123]:
c type t size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4

Another solution with transform len:
df['size'] = df.groupby('c')['type'].transform(len)
print df
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4
Another solution with Series.map and Series.value_counts:
df['size'] = df['c'].map(df['c'].value_counts())
print (df)
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4

You can calculate the groupby object and use it multiple times:
g = df.groupby('c')['type']
df = g.value_counts().reset_index(name='counts')
df['size'] = g.transform('size')
or
g.value_counts().reset_index(name='counts').assign(size=g.transform('size'))
Output:
c type counts size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4

Related

Count duplicates on two columns and add value as a new column [duplicate]

I have this simple dataframe df:
df = pd.DataFrame({'c':[1,1,1,2,2,2,2],'type':['m','n','o','m','m','n','n']})
my goal is to count values of type for each c, and then add a column with the size of c. So starting with:
In [27]: g = df.groupby('c')['type'].value_counts().reset_index(name='t')
In [28]: g
Out[28]:
c type t
0 1 m 1
1 1 n 1
2 1 o 1
3 2 m 2
4 2 n 2
the first problem is solved. Then I can also:
In [29]: a = df.groupby('c').size().reset_index(name='size')
In [30]: a
Out[30]:
c size
0 1 3
1 2 4
How can I add the size column directly to the first dataframe? So far I used map as:
In [31]: a.index = a['c']
In [32]: g['size'] = g['c'].map(a['size'])
In [33]: g
Out[33]:
c type t size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4
which works, but is there a more straightforward way to do this?
Use transform to add a column back to the orig df from a groupby aggregation, transform returns a Series with its index aligned to the orig df:
In [123]:
g = df.groupby('c')['type'].value_counts().reset_index(name='t')
g['size'] = df.groupby('c')['type'].transform('size')
g
Out[123]:
c type t size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4
Another solution with transform len:
df['size'] = df.groupby('c')['type'].transform(len)
print df
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4
Another solution with Series.map and Series.value_counts:
df['size'] = df['c'].map(df['c'].value_counts())
print (df)
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4
You can calculate the groupby object and use it multiple times:
g = df.groupby('c')['type']
df = g.value_counts().reset_index(name='counts')
df['size'] = g.transform('size')
or
g.value_counts().reset_index(name='counts').assign(size=g.transform('size'))
Output:
c type counts size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4

Count Number of dates in a dataframe [duplicate]

I have this simple dataframe df:
df = pd.DataFrame({'c':[1,1,1,2,2,2,2],'type':['m','n','o','m','m','n','n']})
my goal is to count values of type for each c, and then add a column with the size of c. So starting with:
In [27]: g = df.groupby('c')['type'].value_counts().reset_index(name='t')
In [28]: g
Out[28]:
c type t
0 1 m 1
1 1 n 1
2 1 o 1
3 2 m 2
4 2 n 2
the first problem is solved. Then I can also:
In [29]: a = df.groupby('c').size().reset_index(name='size')
In [30]: a
Out[30]:
c size
0 1 3
1 2 4
How can I add the size column directly to the first dataframe? So far I used map as:
In [31]: a.index = a['c']
In [32]: g['size'] = g['c'].map(a['size'])
In [33]: g
Out[33]:
c type t size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4
which works, but is there a more straightforward way to do this?
Use transform to add a column back to the orig df from a groupby aggregation, transform returns a Series with its index aligned to the orig df:
In [123]:
g = df.groupby('c')['type'].value_counts().reset_index(name='t')
g['size'] = df.groupby('c')['type'].transform('size')
g
Out[123]:
c type t size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4
Another solution with transform len:
df['size'] = df.groupby('c')['type'].transform(len)
print df
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4
Another solution with Series.map and Series.value_counts:
df['size'] = df['c'].map(df['c'].value_counts())
print (df)
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4
You can calculate the groupby object and use it multiple times:
g = df.groupby('c')['type']
df = g.value_counts().reset_index(name='counts')
df['size'] = g.transform('size')
or
g.value_counts().reset_index(name='counts').assign(size=g.transform('size'))
Output:
c type counts size
0 1 m 1 3
1 1 n 1 3
2 1 o 1 3
3 2 m 2 4
4 2 n 2 4

arithmetic on pandas dataframe row-wise

df have:
A B C
a 1 2 3
b 2 1 4
c 1 1 1
df want:
A B C
a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
I am able to get df want by using:
df.loc['d']=df.loc['b']-df.loc['a']
However, my actual df has 'a','b','c' rows for multiple IDs 'X', 'Y' etc.
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
Y a 1 2 3
b 2 1 4
c 1 1 1
How can I create the same output with multiple IDs?
My original method:
df.loc['d']=df.loc['b']-df.loc['a']
fails KeyError:'b'
Desired output:
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
Y a 1 2 3
b 2 2 4
c 1 1 1
d 1 0 1
IIUC,
for i, sub in df.groupby(df.index.get_level_values(0)):
df.loc[(i, 'd'), :] = sub.loc[(i,'b')] - sub.loc[(i, 'a')]
print(df.sort_index())
Or maybe
k = df.groupby(df.index.get_level_values(0), as_index=False).apply(lambda s: pd.DataFrame([s.loc[(s.name,'b')].values - s.loc[(s.name, 'a')].values],
columns=s.columns,
index=pd.MultiIndex(levels=[[s.name], ['d']], codes=[[0],[0]])
)).reset_index(drop=True, level=0)
pd.concat([k, df]).sort_index()
Data reshaping is a useful trick if you want to do manipulation on a particular level of a multiindex. See code below,
result = (df.unstack(0).T
.assign(d=lambda x:x.b-x.a)
.stack()
.unstack(0))
Use pd.IndexSlice to slice a and b. Call diff and slice on b and rename it to d. Finally, append it to original df
idx = pd.IndexSlice
df1 = df.loc[idx[:,['a','b']],:].diff().loc[idx[:,'b'],:].rename({'b': 'd'})
df2 = df.append(df1).sort_index().astype(int)
Out[106]:
A B C
X a 1 2 3
b 2 1 4
c 1 1 1
d 1 -1 1
Y a 1 2 3
b 2 2 4
c 1 1 1
d 1 0 1

Most efficient way to groupby => aggregate for large dataframe in pandas

I have a pandas dataframe with roughly 150,000,000 rows in the following format:
df.head()
Out[1]:
ID TERM X
0 1 A 0
1 1 A 4
2 1 A 6
3 1 B 0
4 1 B 10
5 2 A 1
6 2 B 1
7 2 F 1
I want to aggregate it by ID & TERM, and count the number of rows. Currently I do the following:
df.groupby(['ID','TERM']).count()
Out[2]:
ID TERM X
0 1 A 3
1 1 B 2
2 2 A 1
3 2 B 1
4 2 F 1
But this takes roughly two minutes. The same operation using R data.tables takes less than 22 seconds. Is there a more efficient way to do this in python?
For comparison, R data.table:
system.time({ df[,.(.N), .(ID, TERM)] })
#user: 30.32 system: 2.45 elapsed: 22.88
A NumPy solution would be like so -
def groupby_count(df):
unq, t = np.unique(df.TERM, return_inverse=1)
ids = df.ID.values
sidx = np.lexsort([t,ids])
ts = t[sidx]
idss = ids[sidx]
m0 = (idss[1:] != idss[:-1]) | (ts[1:] != ts[:-1])
m = np.concatenate(([True], m0, [True]))
ids_out = idss[m[:-1]]
t_out = unq[ts[m[:-1]]]
x_out = np.diff(np.flatnonzero(m)+1)
out_ar = np.column_stack((ids_out, t_out, x_out))
return pd.DataFrame(out_ar, columns = [['ID','TERM','X']])
A bit simpler version -
def groupby_count_v2(df):
a = df.values
sidx = np.lexsort(a[:,:2].T)
b = a[sidx,:2]
m = np.concatenate(([True],(b[1:] != b[:-1]).any(1),[True]))
out_ar = np.column_stack((b[m[:-1],:2], np.diff(np.flatnonzero(m)+1)))
return pd.DataFrame(out_ar, columns = [['ID','TERM','X']])
Sample run -
In [332]: df
Out[332]:
ID TERM X
0 1 A 0
1 1 A 4
2 1 A 6
3 1 B 0
4 1 B 10
5 2 A 1
6 2 B 1
7 2 F 1
In [333]: groupby_count(df)
Out[333]:
ID TERM X
0 1 A 3
1 1 B 2
2 2 A 1
3 2 B 1
4 2 F 1
Let's randomly shuffle the rows and verify that it works with our solution -
In [339]: df1 = df.iloc[np.random.permutation(len(df))]
In [340]: df1
Out[340]:
ID TERM X
7 2 F 1
6 2 B 1
0 1 A 0
3 1 B 0
5 2 A 1
2 1 A 6
1 1 A 4
4 1 B 10
In [341]: groupby_count(df1)
Out[341]:
ID TERM X
0 1 A 3
1 1 B 2
2 2 A 1
3 2 B 1
4 2 F 1

How to create a dataframe that indicates the rows min and max of a given dataframe?

I need to produce a matrix as input for a conditional formatting for an automated chart creation process. My receiving colleague has to display the numbers and give the max and min of each row an associated color. For his process, a second matrix with entries that indicate the row mins and max would be ideal.
So what do I need to deliver?
Let's say I have the following dataframe:
Cat Product Brand1 Brand2 Brand3
A a 6 9 5
A b 11 7 7
A c 9 5 5
B d 7 3 10
B e 5 8 8
B f 10 6 6
C g 8 4 4
C h 6 2 9
C i 4 7 7
From that, I want to generate the following dataframe, indicating "1" as row max and "2" as row min:
Cat Product Brand1 Brand2 Brand3
A a 0 1 2
A b 1 2 2
A c 1 2 2
B d 0 2 1
B e 2 1 1
B f 1 2 2
C g 1 2 2
C h 0 2 1
C i 2 1 1
The indicators "1" and "2" could be something else, even letters or whatever. The zeros could also be na.
How can this be achieved?
You can use numpy.where for replace values by mask created by eq:
df = df.set_index(['Cat','Product'])
m1 = df.eq(df.max(axis=1), axis=0)
m2 = df.eq(df.min(axis=1), axis=0)
df = pd.DataFrame(np.where(m1, 1, np.where(m2, 2, 0)), index=df.index, columns=df.columns)
df = df.reset_index()
print (df)
Cat Product Brand1 Brand2 Brand3
0 A a 0 1 2
1 A b 1 2 2
2 A c 1 2 2
3 B d 0 2 1
4 B e 2 1 1
5 B f 1 2 2
6 C g 1 2 2
7 C h 0 2 1
8 C i 2 1 1
Another solution:
df = df.set_index(['Cat','Product'])
m1 = df.values == df.values.max(axis=1)[:, None]
m2 = df.values == df.values.min(axis=1)[:, None]
df = pd.DataFrame(np.where(m1, 1, np.where(m2, 2, 0)), index=df.index, columns=df.columns)
df = df.reset_index()
print (df)
Cat Product Brand1 Brand2 Brand3
0 A a 0 1 2
1 A b 1 2 2
2 A c 1 2 2
3 B d 0 2 1
4 B e 2 1 1
5 B f 1 2 2
6 C g 1 2 2
7 C h 0 2 1
8 C i 2 1 1

Categories