I'm working with a Pandas DataFrame having the following structure:
import pandas as pd
df = pd.DataFrame({'brand' : ['A', 'A', 'B', 'B', 'C', 'C'],
'target' : [0, 1, 0, 1, 0, 1],
'freq' : [5600, 220, 5700, 90, 5000, 100]})
print(df)
brand target freq
0 A 0 5600
1 A 1 220
2 B 0 5700
3 B 1 90
4 C 0 5000
5 C 1 100
For each brand, I would like to calculate the ratio of positive targets, e.g. for brand A, the percentage of positive target is 220/(220+5600) = 0.0378.
My resulting DataFrame should look like the following:
brand target freq ratio
0 A 0 5600 0.0378
1 A 1 220 0.0378
2 B 0 5700 0.0156
3 B 1 90 0.0156
4 C 0 5000 0.0196
5 C 1 100 0.0196
I know that I should group my DataFrame by brand and then apply some function to each group (since I want to keep all rows in my final result I think I should use transform here). I tested a couple of things but without any success. Any help is appreciated.
First sorting columns by brand and target for last 1 row per group and then divide in GroupBy.transform with lambda function:
df = df.sort_values(['brand','target'])
df['ratio'] = df.groupby('brand')['freq'].transform(lambda x: x.iat[-1] / x.sum())
print (df)
brand target freq ratio
0 A 0 5600 0.037801
1 A 1 220 0.037801
2 B 0 5700 0.015544
3 B 1 90 0.015544
4 C 0 5000 0.019608
5 C 1 100 0.019608
Or divide Series created by functions GroupBy.last and GroupBy.sum:
df = df.sort_values(['brand','target'])
g = df.groupby('brand')['freq']
df['ratio'] = g.transform('last').div(g.transform('sum'))
Related
I am trying to groupby-aggregate a dataframe using lambda functions that are being created programatically. This so I can simulate a one-hot encoder of the categories present in a column.
Dataframe:
df = pd.DataFrame(np.array([[10, 'A'], [10, 'B'], [20, 'A'],[30,'B']]),
columns=['ID', 'category'])
ID category
10 A
10 B
20 A
30 B
Expected result:
ID A B
10 1 1
20 1 0
30 0 1
What I am trying:
one_hot_columns = ['A','B']
lambdas = [lambda x: 1 if x.eq(column).any() else 0 for column in one_hot_columns]
df_g = df.groupby('ID').category.agg(lambdas)
Result:
ID A B
10 1 1
20 0 0
30 1 1
But the above is not quite the expected result. Not sure what I am doing wrong.
I know I could do this with get_dummies, but using lambdas is more convenient for automation. Also, I can ensure the order of the output columns.
Use crosstab:
pd.crosstab(df.ID, df['category']).reset_index()
Output:
category ID A B
0 10 1 1
1 20 1 0
2 30 0 1
You can use pd.get_dummies with Groupby.sum:
In [4331]: res = pd.get_dummies(df, columns=['category']).groupby('ID', as_index=False).sum()
In [4332]: res
Out[4332]:
ID category_A category_B
0 10 1 1
1 20 1 0
2 30 0 1
OR, use pd.concat with pd.get_dummies:
In [4329]: res = pd.concat([df, pd.get_dummies(df.category)], axis=1).groupby('ID', as_index=False).sum()
In [4330]: res
Out[4330]:
ID A B
0 10 1 1
1 20 1 0
2 30 0 1
I am trying to groupby-aggregate a dataframe using lambda functions that are being created programatically. This so I can simulate a one-hot encoder of the categories present in a column.
Dataframe:
df = pd.DataFrame(np.array([[10, 'A'], [10, 'B'], [20, 'A'],[30,'B']]),
columns=['ID', 'category'])
ID category
10 A
10 B
20 A
30 B
Expected result:
ID A B
10 1 1
20 1 0
30 0 1
What I am trying:
one_hot_columns = ['A','B']
lambdas = [lambda x: 1 if x.eq(column).any() else 0 for column in one_hot_columns]
df_g = df.groupby('ID').category.agg(lambdas)
Result:
ID A B
10 1 1
20 0 0
30 1 1
But the above is not quite the expected result. Not sure what I am doing wrong.
I know I could do this with get_dummies, but using lambdas is more convenient for automation. Also, I can ensure the order of the output columns.
Use crosstab:
pd.crosstab(df.ID, df['category']).reset_index()
Output:
category ID A B
0 10 1 1
1 20 1 0
2 30 0 1
You can use pd.get_dummies with Groupby.sum:
In [4331]: res = pd.get_dummies(df, columns=['category']).groupby('ID', as_index=False).sum()
In [4332]: res
Out[4332]:
ID category_A category_B
0 10 1 1
1 20 1 0
2 30 0 1
OR, use pd.concat with pd.get_dummies:
In [4329]: res = pd.concat([df, pd.get_dummies(df.category)], axis=1).groupby('ID', as_index=False).sum()
In [4330]: res
Out[4330]:
ID A B
0 10 1 1
1 20 1 0
2 30 0 1
I have data similar to this
data = {'A': [10,20,30,10,-10], 'B': [100,200,300,100,-100], 'C':[1000,2000,3000,1000, -1000]}
df = pd.DataFrame(data)
df
Index
A
B
C
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
4
-10
-100
-1000
Here index value 0,3 and 4 are exactly equal but one is negative, so for such scenarios I want a fourth column D to be populated with a value 'Exact opposite' for index value 3 and 4.(Any one value)
Similar to
Index
A
B
C
D
0
10
100
1000
1
20
200
2000
2
30
300
3000
3
10
100
1000
Exact opposite
4
-10
-100
-1000
Exact opposite
One approach I can think of is by adding a column which adds values of all the columns
column_names=['A','B','C']
df['Sum Val'] = df[column_names].sum(axis=1)
Index
A
B
C
Sum val
0
10
100
1000
1110
1
20
200
2000
2200
2
30
300
3000
3300
3
10
100
1000
1110
4
-10
-100
-1000
-1110
and then check if there are any negative values and try to find out the corresponding equal positive value but could not proceed from there
Any mistakes please pardon.
Like this maybe:
In [69]: import numpy as np
# Create column 'D' with exact duplicate rows using 'abs'
In [68]: df['D'] = np.where(df.abs().duplicated(keep=False), 'Duplicate', '')
# If the sum of duplicated rows = 0, this means they are 'exact opposite'
In [78]: if df[df.D.eq('Duplicate')].sum(1).sum() == 0:
...: df.loc[ix, 'D'] = 'Exact Opposite'
...:
In [79]: df
Out[79]:
A B C D
0 10 100 1000 Exact Opposite
1 20 200 2000
2 30 300 3000
3 -10 -100 -1000 Exact Opposite
To follow your logic let us just adding abs with groupby , so the output will return the pair index as list
df.reset_index().groupby(df['Sum Val'].abs())['index'].agg(list)
Out[367]:
Sum Val
1110 [0, 3]
2220 [1]
3330 [2]
Name: index, dtype: object
import pandas as pd
data = {'A': [10, 20, 30, -10], 'B': [100, 200,300, -100], 'C': [1000, 2000, 3000,-1000]}
df = pd.DataFrame(data)
print(df)
df['total'] = df.sum(axis=1)
df['total'] = df['total'].apply(lambda x: "Exact opposite" if sum(df['total'] == -1*x) else "")
print(df)
I'm trying to find a way to do a more advanced group by aggregate in pandas. For example:
d = {'name': ['a', 'a', 'b', 'b', 'b', 'c', 'd', 'e'], 'amount': [2, 5, 2, 3, 7, 2, 4, 1]}
df = pd.DataFrame(data=d)
df_per_category = df.groupby(['name']) \
.agg({'amount': ['count', 'sum']}) \
.sort_values(by=[('amount', 'count')], ascending=False)
df_per_category[('amount', 'sum')].plot.barh()
df_per_category
Produces:
amount
count
sum
Name
b
3
12
a
2
7
c
1
2
d
1
4
e
1
1
When you have a dataset where 70% of items have just one count and 30% have multiple counts it would be nice if you could group the 70%. At first to begin simple, simply group all the records that have just one count and put them under a name like other. So the result would look like:
amount
count
sum
Name
b
3
12
a
2
7
other
3
7
Is there a panda's way to do this? Right now I'm thinking of just looping trough my aggregate result and creating a new dataframe manually.
Current solution:
name = []
count = []
amount = []
aggregates = {
5: [0, 0],
10: [0, 0],
25: [0, 0],
50: [0, 0],
}
l = list(aggregates)
first_aggregates = l
last_aggregate = l[-1] + 1
aggregates.update({last_aggregate: [0, 0]})
def aggregate_small_values(c):
n = c.name
s = c[('amount', 'sum')]
c = c[('amount', 'count')]
if c <= 2:
if s < last_aggregate:
for a in first_aggregates:
if s <= a:
aggregates[a][0] += c
aggregates[a][1] += s
break
else:
aggregates[last_aggregate][0] += c
aggregates[last_aggregate][1] += s
else:
name.append(n)
count.append(c)
amount.append(s)
df_per_category.apply(aggregate_small_values, axis=1)
for a in first_aggregates:
name.append(f'{a} and smaller')
count.append(aggregates[a][0])
amount.append(aggregates[a][1])
name.append(f'{last_aggregate} and bigger')
count.append(aggregates[last_aggregate][0])
amount.append(aggregates[last_aggregate][1])
df_agg = pd.DataFrame(index=name, data={'count': count, 'amount': amount})
df_agg.plot.barh(title='Boodschappen 2021')
df_agg
yields something like:
If need replace name by other if counts is 1 use Series.duplicated with keep=False:
df.loc[~df['name'].duplicated(keep=False), 'name'] = 'other'
print (df)
name amount
0 a 2
1 a 5
2 b 2
3 b 3
4 b 7
5 other 2
6 other 4
7 other 1
If need replace by percenteges, here below 20% is set other use Series.value_counts with normalize=True and then use Series.map for mask with same size like original df:
s = df['name'].value_counts(normalize=True)
print (s)
b 0.375
a 0.250
d 0.125
e 0.125
c 0.125
Name: name, dtype: float64
df.loc[df['name'].map(s).lt(0.2), 'name'] = 'other'
print (df)
name amount
0 a 2
1 a 5
2 b 2
3 b 3
4 b 7
5 other 2
6 other 4
7 other 1
For filter by counts, here below 3:
s = df['name'].value_counts()
print (s)
b 3
a 2
d 1
e 1
c 1
Name: name, dtype: int64
df.loc[df['name'].map(s).lt(3), 'name'] = 'other'
print (df)
name amount
0 other 2
1 other 5
2 b 2
3 b 3
4 b 7
5 other 2
6 other 4
7 other 1
I am looking for a solution that produces column C (first row = 10000) within a function framework, i.e. without iterations but with vectorization:
index A B c
1 0 0 1000
2 100 0 900
3 0 0 900
4 0 200 1100
5 0 0 1100
the Function should look similar to this:
def calculate(self):
df = pd.DataFrame()
df['A'] = self.some_value
df['B'] = self.some_other_value
df['C'] = df['C'].shift(1) - df['A'] + df['B']........
but the reference to the prior row does not work. what can be done to accomplish the task lined out?
This should work:
df['C'] = 1000 + (-df['A'] + df['B']).cumsum()
df
Out[80]:
A B C
0 0 0 1000
1 100 0 900
2 0 0 900
3 0 200 1100
4 0 0 1100