Python Dataframe compute season wise performance compared to a reference (100%) - python

I have a dataframe with seasons as a column. Each season has two variety items. I want to compute each season-wise performance of all items with one item A, where item A performance will be marked as 100%.
code:
xdf = pd.DataFrame({'Season':[1,1,2,2,3,3],'item':['A','B','A','B','A','B'],'value':[25,30,50,75,40,60]})
xdf =
Season item value
0 1 A 25
1 1 B 30
2 2 A 50
3 2 B 75
4 3 A 40
5 3 B 60
Expected answer:
xdf =
Season item value %value
0 1 A 25 100 # This value(25) is taken as reference i.e. 100%
1 1 B 30 120 # Now value of `B` in Season 1 is 30. So, performance % = 30/25 *100
2 2 A 50 100 # This value(50) is taken as reference i.e. 100%
3 2 B 75 150 # Now value of `B` in Season 2 is 75. So, performance % = 75/50 *100
4 3 A 40 100 # This value(40) is taken as reference i.e. 100%
5 3 B 60 150 # Now value of `B` in Season 3 is 60. So, performance % = 60/40 *100

Let us create multiindex on Season and item in order to simplify calculation:
s = xdf.set_index(['Season', 'item'])['value']
xdf['%value'] = s.div(s.xs('A', level=1)).mul(100).tolist()
Season item value %value
0 1 A 25 100.0
1 1 B 30 120.0
2 2 A 50 100.0
3 2 B 75 150.0
4 3 A 40 100.0
5 3 B 60 150.0

If your rows are correctly ordered ('A' before 'B'), you can use pct_change. If needed sort by item first.
xdf['%value'] = xdf.groupby('Season')['value'].pct_change().fillna(0) * 100 + 100
print(xdf)
# Output
Season item value %value
0 1 A 25 100.0
1 1 B 30 120.0
2 2 A 50 100.0
3 2 B 75 150.0
4 3 A 40 100.0
5 3 B 60 150.0

Hopefully this function solve your problem.
df=xdf.copy()
for i in xdf['Season'].unique():
_df = xdf[xdf['Season'] == i]
idx = _df.iloc[1:].index
val=_df.iloc[0].value
for j in idx:
cv=xdf.iloc[j].value
xdf.at[j, 'value'] = (100/val)*cv
xdf.loc[xdf.groupby('Season')['value'].head(1).index, 'value'] = 100
df['%value']=xdf['value']
df

Related

Last row of some column in dataframe not included

So I have tried to find an average of a value for an index 0 before it exchange to another index.
An example of the dataframe:
column_a
value_b
sum_c
count_d_
avg_e
0
10
10
1
0
20
30
2
0
30
60
3
20
1
10
10
1
1
20
30
2
1
30
60
3
20
0
10
10
1
0
20
30
2
15
1
10
10
1
1
20
30
2
1
30
60
3
20
0
10
10
1
0
20
however, only the last row for sum and count is unavailable, so the avg cannot be calculated for it
part of the code...
#sum and avg for each section
for i, row in df.iloc[0:-1].iterrows():
if df['column_a'][i] == 0:
sum = sum + df['value_b'][i]
df['sum_c'][i] = sum
count = count + 1
df['count_d'][i] = count
else:
sum = 0
count = 0
df['sum_c'][i] = sum
df['count_d'][i] = count
totcount = 0
for m, row in df.iloc[0:-1].iterrows():
if df.loc[m, 'column_a'] == 0 :
if (df.loc[m+1, 'sum_c'] == 0) :
totcount = df.loc[m, 'count_d']
avg_e = (df.loc[m, 'sum_c']) / totcount
df.loc[m, 'avg_e'] = avg_e
have tried only using df.iloc[0:].iterrows but it produce an error.
You can rewrite you full code with groupby.cummax, groupby.cumcount, groupby.transform('mean') and masking with where.
# compute a mask with True for the last value per successive group
m = df['column_a'].ne(df['column_a'].shift(-1))[::-1]
# make a grouper
group = m.cumsum()
# for each group
g = df.groupby(group)['value_b']
# compute the cumsum
df['sum_c'] = g.cumsum()
# compute the cumcount
df['count_d_'] = g.cumcount().add(1)
# compute the mean and assign to the last row per group
df['avg_e'] = g.transform('mean').where(m)
Output:
column_a value_b sum_c count_d_ avg_e
0 0 10 10 1 NaN
1 0 20 30 2 NaN
2 0 30 60 3 20.0
3 1 10 10 1 NaN
4 1 20 30 2 NaN
5 1 30 60 3 20.0
6 0 10 10 1 NaN
7 0 20 30 2 15.0
8 1 10 10 1 NaN
9 1 20 30 2 NaN
10 1 30 60 3 20.0
11 0 10 10 1 NaN
12 0 20 30 2 15.0
It is the expected behavior of df.iloc[0:-1] to return all the rows excepting the last one. When using slicing, remember that the last index you provide is not included in the return range. Since -1 is the index of the last row, [0:-1] excludes the last row.
The solution given by #mozway is anyway more elegant, but if for any reason you still want to use iterrows(), you can use df.iloc[0:].
The error ou got when you did may be due to your df.loc[m+1, 'sum_c']. At the last row, m+1 will be out of bounds and produce an IndexError.

How to drop rows with a value of less than a percentage of the maximum per group

I have a pandas dataframe with a time series of a signal with some peaks identified:
Time (s) Intensity Peak
1 1 a
2 10 a
3 30 a
4 100 a
5 40 a
6 20 a
7 2 a
1 20 b
2 100 b
3 300 b
4 80 b
5 20 b
6 2 b
I would like to drop the rows where the Intensity value is less than 10% of the maximum Intensity value for each peak in order to obtain:
Time (s) Intensity Peak
3 30 a
4 200 a
5 40 a
6 25 a
2 100 b
3 300 b
4 80 b
How would I do that? I tried looking for a groupby function that would do that but I just cannot seem to find something that fits.
Thank you!
Use groupby to generate a mask:
filtered = df[df.groupby('Peak')['Intensity'].apply(lambda x: x > x.max() / 10)]
Output:
>>> filtered
Time(s) Intensity Peak
2 3 30 a
3 4 100 a
4 5 40 a
5 6 20 a
8 2 100 b
9 3 300 b
10 4 80 b
You could use GroupBy.transform with max to get max from each group and take 10% using Series.div. Now, compare that with df['Intensity'] and use it for boolean indexing.
max_vals = df.groupby('Peak')['Intensity'].transform('max').div(10)
mask = df['Intensity'] > max_vals
df[mask]
# Time (s) Intensity Peak
# 2 3 30 a
# 3 4 100 a
# 4 5 40 a
# 5 6 20 a
# 8 2 100 b
# 9 3 300 b
# 10 4 80 b

In Python, how to count a number of elements in multiple columns that are higher than element in one column?

I like to count elements in columns 1 to 3 that are greater than elements in column 0. For example, I have a dataframe as below.
a b c d
0 50 60 40 20.0
1 40 10 30 40.0
2 30 40 20 35.0
3 20 0 30 25.0
4 10 5 40 NaN
And I want to count elements in columns b,c,d that are greater than element in column a. So the result should be as below.
a b c d count
0 50 60 40 20.0 1
1 40 10 30 40.0 0
2 30 40 20 35.0 2
3 20 0 30 25.0 2
4 10 5 40 NaN 1
Thank you for the help in advance.
Use DataFrame.gt along axis=0 to create a boolean mask then use DataFrame.sum along axis=1 on this mask to get the count:
df['count'] = df[['b', 'c', 'd']].gt(df['a'], axis=0).sum(1)
Result:
a b c d count
0 50 60 40 20.0 1
1 40 10 30 40.0 0
2 30 40 20 35.0 2
3 20 0 30 25.0 2
4 10 5 40 NaN 1
For each of the columns involved, we can do the comparison and convert the boolean results to int (1 for True, 0 for False):
def greater_value(df, reference, column):
return (df[column] > df[reference]).astype(int)
And then add up the results:
df['count'] = greater_value(df, 'a', 'b') + greater_value(df, 'a', 'c') + greater_value(df, 'a', 'd')
Or generalize across a supplied set of column names:
def count_greater(df, reference, *columns):
return sum(greater_value(df, reference, column) for column in columns)
df['count'] = count_greater(df, 'a', 'b', 'c', 'd')

Pandas calculate aggrerage value with respect to current row

Let's say we have this data:
df = pd.DataFrame({
'group_id': [100,100,100,101,101,101,101],
'amount': [30,40,10,20,25,80,40]
})
df.index.name = 'id'
df.set_index(['group_id', df.index], inplace=True)
It looks like this:
amount
group_id id
100 0 30
1 40
2 10
101 3 20
4 25
5 80
6 40
The goal is to compute a new column, that's the sum of all amounts less than the current one. I.e. We want this result.
amount sum_of_smaller_amounts
group_id id
100 0 30 10
1 40 40 # 30 + 10
2 10 0 # smallest amount
101 3 20 0 # smallest
4 25 20
5 80 85 # 20 + 25 + 40
6 40 45 # 20 + 25
Ideally this should be (very) efficient as the real dataframe could be millions of rows.
Better solution (I think):
df['sum_smaller_amount'] = (df_sort.groupby('group_id')['amount']
.transform(lambda x: x.mask(x.duplicated(),0).cumsum()) -
df['amount'])
Output:
amount sum_smaller_amount
group_id id
100 0 30 10.0
1 40 40.0
2 10 0.0
101 3 20 0.0
4 25 20.0
5 80 85.0
6 40 45.0
Another way to do this to use a cartesian product and filter:
df.merge(df.reset_index(), on='group_id', suffixes=('_sum_smaller',''))\
.query('amount_sum_smaller < amount')\
.groupby(['group_id','id'])[['amount_sum_smaller']].sum()\
.join(df, how='right').fillna(0)
Output:
amount_sum_smaller amount
group_id id
100 0 10.0 30
1 40.0 40
2 0.0 10
101 3 0.0 20
4 20.0 25
5 85.0 80
6 45.0 40
You want sort_values and cumsum:
df['new_amount']= (df.sort_values('amount')
.groupby(level='group_id')
['amount'].cumsum() - df['amount'])
Output:
amount new_amount
group_id id
100 0 30 10
1 40 40
2 10 0
101 3 20 0
4 25 20
5 80 85
6 40 45
Update: fix for repeated values:
# the data
df = pd.DataFrame({
'group_id': [100,100,100,100,101,101,101,101],
'amount': [30,40,10,30,20,25,80,40]
})
df.index.name = 'id'
df.set_index(['group_id', df.index], inplace=True)
# sort values:
df_sorted = df.sort_values('amount')
# cumsum
s1 = df_sorted.groupby('group_id')['amount'].cumsum()
# value counts
s2 = df_sorted.groupby(['group_id', 'amount']).cumcount() + 1
# instead of just subtracting df['amount'], we subtract amount * counts
df['new_amount'] = s1 - df['amount'].mul(s2)
Output (note the two values 30 in group 100)
amount new_amount
group_id id
100 0 30 10
1 40 70
2 10 0
3 30 10
101 4 20 0
5 25 20
6 80 85
7 40 45
I'm intermediate on pandas, not sure on efficiency but here's a solution:
temp_df = df.sort_values(['group_id','amount'])
temp_df = temp_df.mask(temp_df['amount'] == temp_df['amount'].shift(), other=0).groupby(level='group_id').cumsum()
df['sum'] = temp_df.sort_index(level='id')['amount'] - df['amount']
Result:
amount sum
group_id id
100 0 30 10
1 40 40
2 10 0
101 3 20 0
4 25 20
5 80 85
6 40 45
7 40 45
You can substitute the last line with these if they help efficiency somehow:
df['sum'] = df.subtract(temp_df).multiply(-1)
# or
df['sum'] = (~df).add(temp_df + 1)

Conditional groupby and shift in pandas

I have a dataframe that looks like this, which store the order by order information of an order book. Type N = new order, D = delete order and E = execution. the same order_id might be reused.
So basically the problem is that delete and execution does not have a proper price as they should be inferred by the last new order with the same oid. Could someone suggest a method to achieve this? Thank you
Input
type order_id price
0 N 10 99
1 E 10 0
1 E 10 0
1 D 10 0
0 N 11 98
1 N 10 97
1 D 10 0
Output
type order_id price
0 N 10 99
1 E 10 **99**
1 E 10 **99**
1 D 10 **99**
0 N 11 98
1 N 10 97
1 D 10 **97**
Seems like you need replace + ffill, since here I assuming you have the correct order of your df.
df.replace(0,np.nan).ffill()
Out[758]:
type order_id price
0 N 10 99.0
1 E 10 99.0
1 E 10 99.0
1 D 10 99.0
0 N 11 98.0
1 N 10 97.0
1 D 10 97.0
Or we adding groupby
df.assign(price=df.price.replace(0,np.nan)).groupby(df.type.eq('N').cumsum()).price.ffill().values
I think need:
df['price'] = df['price'].mask(df['type'].isin(['E','D']))
#df['price'] = df['price'].where(df['type'] == 'N')
df['price'] = df.groupby(df['order_id'].ne(df['order_id'].shift()).cumsum())['price'].ffill()
print (df)
type order_id price
0 N 10 99.0
1 E 10 99.0
1 E 10 99.0
1 D 10 99.0
0 N 11 98.0
1 N 10 97.0
1 D 10 97.0
Explanation:
First replace values of price to NaNs with mask or inverted condition with where
Then groupby by helper Series created by consecutive order_id with forward fill NaNs by DataFrameGroupBy.ffill:

Categories