Python Dataframe compute season wise performance compared to a reference (100%)

Python Dataframe compute season wise performance compared to a reference (100%) - python

I have a dataframe with seasons as a column. Each season has two variety items. I want to compute each season-wise performance of all items with one item A, where item A performance will be marked as 100%.
code:
xdf = pd.DataFrame({'Season':[1,1,2,2,3,3],'item':['A','B','A','B','A','B'],'value':[25,30,50,75,40,60]})
xdf =
Season item value
0 1 A 25
1 1 B 30
2 2 A 50
3 2 B 75
4 3 A 40
5 3 B 60
Expected answer:
xdf =
Season item value %value
0 1 A 25 100 # This value(25) is taken as reference i.e. 100%
1 1 B 30 120 # Now value of `B` in Season 1 is 30. So, performance % = 30/25 *100
2 2 A 50 100 # This value(50) is taken as reference i.e. 100%
3 2 B 75 150 # Now value of `B` in Season 2 is 75. So, performance % = 75/50 *100
4 3 A 40 100 # This value(40) is taken as reference i.e. 100%
5 3 B 60 150 # Now value of `B` in Season 3 is 60. So, performance % = 60/40 *100

Let us create multiindex on Season and item in order to simplify calculation:
s = xdf.set_index(['Season', 'item'])['value']
xdf['%value'] = s.div(s.xs('A', level=1)).mul(100).tolist()
Season item value %value
0 1 A 25 100.0
1 1 B 30 120.0
2 2 A 50 100.0
3 2 B 75 150.0
4 3 A 40 100.0
5 3 B 60 150.0

If your rows are correctly ordered ('A' before 'B'), you can use pct_change. If needed sort by item first.
xdf['%value'] = xdf.groupby('Season')['value'].pct_change().fillna(0) * 100 + 100
print(xdf)
# Output
Season item value %value
0 1 A 25 100.0
1 1 B 30 120.0
2 2 A 50 100.0
3 2 B 75 150.0
4 3 A 40 100.0
5 3 B 60 150.0

Hopefully this function solve your problem.
df=xdf.copy()
for i in xdf['Season'].unique():
_df = xdf[xdf['Season'] == i]
idx = _df.iloc[1:].index
val=_df.iloc[0].value
for j in idx:
cv=xdf.iloc[j].value
xdf.at[j, 'value'] = (100/val)*cv
xdf.loc[xdf.groupby('Season')['value'].head(1).index, 'value'] = 100
df['%value']=xdf['value']
df

Related

Last row of some column in dataframe not included

So I have tried to find an average of a value for an index 0 before it exchange to another index.
An example of the dataframe:
column_a
value_b
sum_c
count_d_
avg_e
0
10
10
1
0
20
30
2
0
30
60
3
20
1
10
10
1
1
20
30
2
1
30
60
3
20
0
10
10
1
0
20
30
2
15
1
10
10
1
1
20
30
2
1
30
60
3
20
0
10
10
1
0
20
however, only the last row for sum and count is unavailable, so the avg cannot be calculated for it
part of the code...
#sum and avg for each section
for i, row in df.iloc[0:-1].iterrows():
if df['column_a'][i] == 0:
sum = sum + df['value_b'][i]
df['sum_c'][i] = sum
count = count + 1
df['count_d'][i] = count
else:
sum = 0
count = 0
df['sum_c'][i] = sum
df['count_d'][i] = count
totcount = 0
for m, row in df.iloc[0:-1].iterrows():
if df.loc[m, 'column_a'] == 0 :
if (df.loc[m+1, 'sum_c'] == 0) :
totcount = df.loc[m, 'count_d']
avg_e = (df.loc[m, 'sum_c']) / totcount
df.loc[m, 'avg_e'] = avg_e
have tried only using df.iloc[0:].iterrows but it produce an error.

You can rewrite you full code with groupby.cummax, groupby.cumcount, groupby.transform('mean') and masking with where.
# compute a mask with True for the last value per successive group
m = df['column_a'].ne(df['column_a'].shift(-1))[::-1]
# make a grouper
group = m.cumsum()
# for each group
g = df.groupby(group)['value_b']
# compute the cumsum
df['sum_c'] = g.cumsum()
# compute the cumcount
df['count_d_'] = g.cumcount().add(1)
# compute the mean and assign to the last row per group
df['avg_e'] = g.transform('mean').where(m)
Output:
column_a value_b sum_c count_d_ avg_e
0 0 10 10 1 NaN
1 0 20 30 2 NaN
2 0 30 60 3 20.0
3 1 10 10 1 NaN
4 1 20 30 2 NaN
5 1 30 60 3 20.0
6 0 10 10 1 NaN
7 0 20 30 2 15.0
8 1 10 10 1 NaN
9 1 20 30 2 NaN
10 1 30 60 3 20.0
11 0 10 10 1 NaN
12 0 20 30 2 15.0

It is the expected behavior of df.iloc[0:-1] to return all the rows excepting the last one. When using slicing, remember that the last index you provide is not included in the return range. Since -1 is the index of the last row, [0:-1] excludes the last row.
The solution given by #mozway is anyway more elegant, but if for any reason you still want to use iterrows(), you can use df.iloc[0:].
The error ou got when you did may be due to your df.loc[m+1, 'sum_c']. At the last row, m+1 will be out of bounds and produce an IndexError.

How to drop rows with a value of less than a percentage of the maximum per group

I have a pandas dataframe with a time series of a signal with some peaks identified:
Time (s) Intensity Peak
1 1 a
2 10 a
3 30 a
4 100 a
5 40 a
6 20 a
7 2 a
1 20 b
2 100 b
3 300 b
4 80 b
5 20 b
6 2 b
I would like to drop the rows where the Intensity value is less than 10% of the maximum Intensity value for each peak in order to obtain:
Time (s) Intensity Peak
3 30 a
4 200 a
5 40 a
6 25 a
2 100 b
3 300 b
4 80 b
How would I do that? I tried looking for a groupby function that would do that but I just cannot seem to find something that fits.
Thank you!

Use groupby to generate a mask:
filtered = df[df.groupby('Peak')['Intensity'].apply(lambda x: x > x.max() / 10)]
Output:
>>> filtered
Time(s) Intensity Peak
2 3 30 a
3 4 100 a
4 5 40 a
5 6 20 a
8 2 100 b
9 3 300 b
10 4 80 b

You could use GroupBy.transform with max to get max from each group and take 10% using Series.div. Now, compare that with df['Intensity'] and use it for boolean indexing.
max_vals = df.groupby('Peak')['Intensity'].transform('max').div(10)
mask = df['Intensity'] > max_vals
df[mask]
# Time (s) Intensity Peak
# 2 3 30 a
# 3 4 100 a
# 4 5 40 a
# 5 6 20 a
# 8 2 100 b
# 9 3 300 b
# 10 4 80 b

In Python, how to count a number of elements in multiple columns that are higher than element in one column?

I like to count elements in columns 1 to 3 that are greater than elements in column 0. For example, I have a dataframe as below.
a b c d
0 50 60 40 20.0
1 40 10 30 40.0
2 30 40 20 35.0
3 20 0 30 25.0
4 10 5 40 NaN
And I want to count elements in columns b,c,d that are greater than element in column a. So the result should be as below.
a b c d count
0 50 60 40 20.0 1
1 40 10 30 40.0 0
2 30 40 20 35.0 2
3 20 0 30 25.0 2
4 10 5 40 NaN 1
Thank you for the help in advance.

Use DataFrame.gt along axis=0 to create a boolean mask then use DataFrame.sum along axis=1 on this mask to get the count:
df['count'] = df[['b', 'c', 'd']].gt(df['a'], axis=0).sum(1)
Result:
a b c d count
0 50 60 40 20.0 1
1 40 10 30 40.0 0
2 30 40 20 35.0 2
3 20 0 30 25.0 2
4 10 5 40 NaN 1

For each of the columns involved, we can do the comparison and convert the boolean results to int (1 for True, 0 for False):
def greater_value(df, reference, column):
return (df[column] > df[reference]).astype(int)
And then add up the results:
df['count'] = greater_value(df, 'a', 'b') + greater_value(df, 'a', 'c') + greater_value(df, 'a', 'd')
Or generalize across a supplied set of column names:
def count_greater(df, reference, *columns):
return sum(greater_value(df, reference, column) for column in columns)
df['count'] = count_greater(df, 'a', 'b', 'c', 'd')

Pandas calculate aggrerage value with respect to current row

Let's say we have this data:
df = pd.DataFrame({
'group_id': [100,100,100,101,101,101,101],
'amount': [30,40,10,20,25,80,40]
})
df.index.name = 'id'
df.set_index(['group_id', df.index], inplace=True)
It looks like this:
amount
group_id id
100 0 30
1 40
2 10
101 3 20
4 25
5 80
6 40
The goal is to compute a new column, that's the sum of all amounts less than the current one. I.e. We want this result.
amount sum_of_smaller_amounts
group_id id
100 0 30 10
1 40 40 # 30 + 10
2 10 0 # smallest amount
101 3 20 0 # smallest
4 25 20
5 80 85 # 20 + 25 + 40
6 40 45 # 20 + 25
Ideally this should be (very) efficient as the real dataframe could be millions of rows.

Better solution (I think):
df['sum_smaller_amount'] = (df_sort.groupby('group_id')['amount']
.transform(lambda x: x.mask(x.duplicated(),0).cumsum()) -
df['amount'])
Output:
amount sum_smaller_amount
group_id id
100 0 30 10.0
1 40 40.0
2 10 0.0
101 3 20 0.0
4 25 20.0
5 80 85.0
6 40 45.0
Another way to do this to use a cartesian product and filter:
df.merge(df.reset_index(), on='group_id', suffixes=('_sum_smaller',''))\
.query('amount_sum_smaller < amount')\
.groupby(['group_id','id'])[['amount_sum_smaller']].sum()\
.join(df, how='right').fillna(0)
Output:
amount_sum_smaller amount
group_id id
100 0 10.0 30
1 40.0 40
2 0.0 10
101 3 0.0 20
4 20.0 25
5 85.0 80
6 45.0 40

You want sort_values and cumsum:
df['new_amount']= (df.sort_values('amount')
.groupby(level='group_id')
['amount'].cumsum() - df['amount'])
Output:
amount new_amount
group_id id
100 0 30 10
1 40 40
2 10 0
101 3 20 0
4 25 20
5 80 85
6 40 45
Update: fix for repeated values:
# the data
df = pd.DataFrame({
'group_id': [100,100,100,100,101,101,101,101],
'amount': [30,40,10,30,20,25,80,40]
})
df.index.name = 'id'
df.set_index(['group_id', df.index], inplace=True)
# sort values:
df_sorted = df.sort_values('amount')
# cumsum
s1 = df_sorted.groupby('group_id')['amount'].cumsum()
# value counts
s2 = df_sorted.groupby(['group_id', 'amount']).cumcount() + 1
# instead of just subtracting df['amount'], we subtract amount * counts
df['new_amount'] = s1 - df['amount'].mul(s2)
Output (note the two values 30 in group 100)
amount new_amount
group_id id
100 0 30 10
1 40 70
2 10 0
3 30 10
101 4 20 0
5 25 20
6 80 85
7 40 45

I'm intermediate on pandas, not sure on efficiency but here's a solution:
temp_df = df.sort_values(['group_id','amount'])
temp_df = temp_df.mask(temp_df['amount'] == temp_df['amount'].shift(), other=0).groupby(level='group_id').cumsum()
df['sum'] = temp_df.sort_index(level='id')['amount'] - df['amount']
Result:
amount sum
group_id id
100 0 30 10
1 40 40
2 10 0
101 3 20 0
4 25 20
5 80 85
6 40 45
7 40 45
You can substitute the last line with these if they help efficiency somehow:
df['sum'] = df.subtract(temp_df).multiply(-1)
# or
df['sum'] = (~df).add(temp_df + 1)

Conditional groupby and shift in pandas

I have a dataframe that looks like this, which store the order by order information of an order book. Type N = new order, D = delete order and E = execution. the same order_id might be reused.
So basically the problem is that delete and execution does not have a proper price as they should be inferred by the last new order with the same oid. Could someone suggest a method to achieve this? Thank you
Input
type order_id price
0 N 10 99
1 E 10 0
1 E 10 0
1 D 10 0
0 N 11 98
1 N 10 97
1 D 10 0
Output
type order_id price
0 N 10 99
1 E 10 **99**
1 E 10 **99**
1 D 10 **99**
0 N 11 98
1 N 10 97
1 D 10 **97**

Seems like you need replace + ffill, since here I assuming you have the correct order of your df.
df.replace(0,np.nan).ffill()
Out[758]:
type order_id price
0 N 10 99.0
1 E 10 99.0
1 E 10 99.0
1 D 10 99.0
0 N 11 98.0
1 N 10 97.0
1 D 10 97.0
Or we adding groupby
df.assign(price=df.price.replace(0,np.nan)).groupby(df.type.eq('N').cumsum()).price.ffill().values

I think need:
df['price'] = df['price'].mask(df['type'].isin(['E','D']))
#df['price'] = df['price'].where(df['type'] == 'N')
df['price'] = df.groupby(df['order_id'].ne(df['order_id'].shift()).cumsum())['price'].ffill()
print (df)
type order_id price
0 N 10 99.0
1 E 10 99.0
1 E 10 99.0
1 D 10 99.0
0 N 11 98.0
1 N 10 97.0
1 D 10 97.0
Explanation:
First replace values of price to NaNs with mask or inverted condition with where
Then groupby by helper Series created by consecutive order_id with forward fill NaNs by DataFrameGroupBy.ffill:

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Dataframe compute season wise performance compared to a reference (100%) - python

Let us create multiindex on Season and item in order to simplify calculation: s = xdf.set_index(['Season', 'item'])['value'] xdf['%value'] = s.div(s.xs('A', level=1)).mul(100).tolist() Season item value %value 0 1 A 25 100.0 1 1 B 30 120.0 2 2 A 50 100.0 3 2 B 75 150.0 4 3 A 40 100.0 5 3 B 60 150.0

Related

Last row of some column in dataframe not included

How to drop rows with a value of less than a percentage of the maximum per group

In Python, how to count a number of elements in multiple columns that are higher than element in one column?

Pandas calculate aggrerage value with respect to current row

Conditional groupby and shift in pandas

Categories

Resources