So I have tried to find an average of a value for an index 0 before it exchange to another index.
An example of the dataframe:
column_a
value_b
sum_c
count_d_
avg_e
0
10
10
1
0
20
30
2
0
30
60
3
20
1
10
10
1
1
20
30
2
1
30
60
3
20
0
10
10
1
0
20
30
2
15
1
10
10
1
1
20
30
2
1
30
60
3
20
0
10
10
1
0
20
however, only the last row for sum and count is unavailable, so the avg cannot be calculated for it
part of the code...
#sum and avg for each section
for i, row in df.iloc[0:-1].iterrows():
if df['column_a'][i] == 0:
sum = sum + df['value_b'][i]
df['sum_c'][i] = sum
count = count + 1
df['count_d'][i] = count
else:
sum = 0
count = 0
df['sum_c'][i] = sum
df['count_d'][i] = count
totcount = 0
for m, row in df.iloc[0:-1].iterrows():
if df.loc[m, 'column_a'] == 0 :
if (df.loc[m+1, 'sum_c'] == 0) :
totcount = df.loc[m, 'count_d']
avg_e = (df.loc[m, 'sum_c']) / totcount
df.loc[m, 'avg_e'] = avg_e
have tried only using df.iloc[0:].iterrows but it produce an error.
You can rewrite you full code with groupby.cummax, groupby.cumcount, groupby.transform('mean') and masking with where.
# compute a mask with True for the last value per successive group
m = df['column_a'].ne(df['column_a'].shift(-1))[::-1]
# make a grouper
group = m.cumsum()
# for each group
g = df.groupby(group)['value_b']
# compute the cumsum
df['sum_c'] = g.cumsum()
# compute the cumcount
df['count_d_'] = g.cumcount().add(1)
# compute the mean and assign to the last row per group
df['avg_e'] = g.transform('mean').where(m)
Output:
column_a value_b sum_c count_d_ avg_e
0 0 10 10 1 NaN
1 0 20 30 2 NaN
2 0 30 60 3 20.0
3 1 10 10 1 NaN
4 1 20 30 2 NaN
5 1 30 60 3 20.0
6 0 10 10 1 NaN
7 0 20 30 2 15.0
8 1 10 10 1 NaN
9 1 20 30 2 NaN
10 1 30 60 3 20.0
11 0 10 10 1 NaN
12 0 20 30 2 15.0
It is the expected behavior of df.iloc[0:-1] to return all the rows excepting the last one. When using slicing, remember that the last index you provide is not included in the return range. Since -1 is the index of the last row, [0:-1] excludes the last row.
The solution given by #mozway is anyway more elegant, but if for any reason you still want to use iterrows(), you can use df.iloc[0:].
The error ou got when you did may be due to your df.loc[m+1, 'sum_c']. At the last row, m+1 will be out of bounds and produce an IndexError.
Related
I have a dataframe with seasons as a column. Each season has two variety items. I want to compute each season-wise performance of all items with one item A, where item A performance will be marked as 100%.
code:
xdf = pd.DataFrame({'Season':[1,1,2,2,3,3],'item':['A','B','A','B','A','B'],'value':[25,30,50,75,40,60]})
xdf =
Season item value
0 1 A 25
1 1 B 30
2 2 A 50
3 2 B 75
4 3 A 40
5 3 B 60
Expected answer:
xdf =
Season item value %value
0 1 A 25 100 # This value(25) is taken as reference i.e. 100%
1 1 B 30 120 # Now value of `B` in Season 1 is 30. So, performance % = 30/25 *100
2 2 A 50 100 # This value(50) is taken as reference i.e. 100%
3 2 B 75 150 # Now value of `B` in Season 2 is 75. So, performance % = 75/50 *100
4 3 A 40 100 # This value(40) is taken as reference i.e. 100%
5 3 B 60 150 # Now value of `B` in Season 3 is 60. So, performance % = 60/40 *100
Let us create multiindex on Season and item in order to simplify calculation:
s = xdf.set_index(['Season', 'item'])['value']
xdf['%value'] = s.div(s.xs('A', level=1)).mul(100).tolist()
Season item value %value
0 1 A 25 100.0
1 1 B 30 120.0
2 2 A 50 100.0
3 2 B 75 150.0
4 3 A 40 100.0
5 3 B 60 150.0
If your rows are correctly ordered ('A' before 'B'), you can use pct_change. If needed sort by item first.
xdf['%value'] = xdf.groupby('Season')['value'].pct_change().fillna(0) * 100 + 100
print(xdf)
# Output
Season item value %value
0 1 A 25 100.0
1 1 B 30 120.0
2 2 A 50 100.0
3 2 B 75 150.0
4 3 A 40 100.0
5 3 B 60 150.0
Hopefully this function solve your problem.
df=xdf.copy()
for i in xdf['Season'].unique():
_df = xdf[xdf['Season'] == i]
idx = _df.iloc[1:].index
val=_df.iloc[0].value
for j in idx:
cv=xdf.iloc[j].value
xdf.at[j, 'value'] = (100/val)*cv
xdf.loc[xdf.groupby('Season')['value'].head(1).index, 'value'] = 100
df['%value']=xdf['value']
df
My end goal is to sum all minutes only from initial to final in column periods. This needs to be grouped by id
I have thousands of id and not all of them have the same amount of min in between initial and final.
Periods are sorted in a "journey" fashion each record represents a period of time of its id
Pseudocode:
Iterate rows and sum all values in column "min"
if sum starts in periods == initial and ends in periods = final
Example with 2 ids
id
periods
min
1
period_x
10
1
initial
2
1
progress
3
1
progress_1
4
1
final
5
2
period_y
10
2
period_z
2
2
initial
3
2
progress_1
20
2
final
3
Desired output
id
periods
min
sum
1
period_x
10
14
1
initial
2
14
1
progress
3
14
1
progress_1
4
14
1
final
5
14
2
period_y
10
26
2
period_z
2
26
2
initial
3
26
2
progress_1
20
26
2
final
3
26
So far I've tried:
L = ['initial' 'final']
df['sum'] = df.id.where( df.zone_name.isin(L)).groupby(df['if']).transform('sum')
But this doesn't count what is in between initial and final
Create groups using cumsum and then return the sum of group 1, then apply that sum to the entire column. "Group 1" is anything per id that is between initial and final:
import numpy as np
df['grp'] = df['periods'].isin(['initial','final'])
df['grp'] = np.where(df['periods'] == 'final', 1, df.groupby('id')['grp'].cumsum())
df['sum'] = np.where(df['grp'].eq(1), df.groupby(['id', 'grp'])['min'].transform('sum'), np.nan)
df['sum'] = df.groupby('id')['sum'].transform(max)
df
Out[1]:
id periods min grp sum
0 1 period_x 10 0 14.0
1 1 initial 2 1 14.0
2 1 progress 3 1 14.0
3 1 progress_1 4 1 14.0
4 1 final 5 1 14.0
5 2 period_y 10 0 26.0
6 2 period_z 2 0 26.0
7 2 initial 3 1 26.0
8 2 progress_1 20 1 26.0
9 2 final 3 1 26.0
I have the below sorted dataframe and I want to set the last value of each id in the id column to 0
id value
1 500
1 50
1 36
2 45
2 150
2 70
2 20
2 10
I am able to set the last value of the entire id column to 0 using df['value'].iloc[-1] = 0. How can I set the last value of both id : 1 and id : 2 to get the below output.
id value
1 500
1 50
1 0
2 45
2 150
2 70
2 20
2 0
you can do drop_duplicates and keep last to get the last row of each id. Use the index of these rows and set the value to 0
df.loc[df['id'].drop_duplicates(keep='last').index, 'value'] = 0
print(df)
id value
0 1 500
1 1 50
2 1 0
3 2 45
4 2 150
5 2 70
6 2 20
7 2 0
df.loc[~df.id.duplicated('last'),'value']=0
Broken down
m=df.id.duplicated('last')
df.loc[~m,'value']=0
id value
0 1 500
1 1 50
2 1 0
3 2 45
4 2 150
5 2 70
6 2 20
7 2 0
How it works
m=df.id.duplicated('last')# Selects the last duplicated in column id
~m reverses that and hence last duplicated becomes true
df.loc[~m,'value']# loc accessor allows us to reach the True value in the nominated column to write with 0
If you are willing to use numpy here is a fast solution:
import numpy as np
# Recreate example
df = pd.DataFrame({
"id":[1,1,1,2,2,2,2,2],
"value": [500,50,36,45,150,70,20,10]
})
# Solution
df["value"] = np.where(~df.id.duplicated(keep="last"),0,df["value"].to_numpy())
I have a pretty similiar question to another question on here.
Let's assume I have two dataframes:
df
volumne
11
24
30
df2
range_low range_high price
10 20 1
21 30 2
How can I filter the second dataframe, based for one row of the first dataframe, if the value range is true?
So for example (value 11 from df) leads to:
df3
range_low range_high price
10 20 1
wherelese (value 30 from df) leads to:
df3
I am looking for a way to check, if a specific value is in a range of another dataframe, and filter the dataframe based on this condition. In none python code:
Find 11 in
(10, 20), if True: df3 = filter on this row
(21, 30), if True: df3= filter on this row
if not
return empty frame
For loop solution use:
for v in df['volumne']:
df3 = df2[(df2['range_low'] < v) & (df2['range_high'] > v)]
print (df3)
For non loop solution is possible use cross join, but if large DataFrames there should be memory problem:
df = df.assign(a=1).merge(df2.assign(a=1), on='a', how='outer')
print (df)
volumne a range_low range_high price
0 11 1 10 20 1
1 11 1 21 30 2
2 24 1 10 20 1
3 24 1 21 30 2
4 30 1 10 20 1
5 30 1 21 30 2
df3 = df[(df['range_low'] < df['volumne']) & (df['range_high'] > df['volumne'])]
print (df3)
volumne a range_low range_high price
0 11 1 10 20 1
3 24 1 21 30 2
I have a similar problem (but with date ranges), and if df2 is too large, it will take for ever.
If the volumes are always integers, a faster solution is to create an intermediate dataframe where you associate each possible volume to a price (in one iteration) and then merge.
price_list=[]
for index, row in df2.iterrows():
x=pd.DataFrame(range(row['range_low'],row['range_high']+1),columns=['volume'])
x['price']=row['price']
price_list.append(x)
df_prices=pd.concat(price_list)
you will get something like this
volume price
0 10 1
1 11 1
2 12 1
3 13 1
4 14 1
5 15 1
6 16 1
7 17 1
8 18 1
9 19 1
10 20 1
0 21 2
1 22 2
2 23 2
3 24 2
4 25 2
5 26 2
6 27 2
7 28 2
8 29 2
9 30 2
then you can quickly associate associate a price to each volume in df
df.merge(df_prices,on='volume')
volume price
0 11 1
1 24 2
2 30 2
Currently I'm working with weekly data for different subjects, but it might have some long streaks without data, so, what I want to do, is to just keep the longest streak of consecutive weeks for every id. My data looks like this:
id week
1 8
1 15
1 60
1 61
1 62
2 10
2 11
2 12
2 13
2 25
2 26
My expected output would be:
id week
1 60
1 61
1 62
2 10
2 11
2 12
2 13
I got a bit close, trying to mark with a 1 when week==week.shift()+1. The problem is this approach doesn't mark the first occurrence in a streak, and also I can't filter the longest one:
df.loc[ (df['id'] == df['id'].shift())&(df['week'] == df['week'].shift()+1),'streak']=1
This, according to my example, would bring this:
id week streak
1 8 nan
1 15 nan
1 60 nan
1 61 1
1 62 1
2 10 nan
2 11 1
2 12 1
2 13 1
2 25 nan
2 26 1
Any ideas on how to achieve what I want?
Try this:
df['consec'] = df.groupby(['id',df['week'].diff(-1).ne(-1).shift().bfill().cumsum()]).transform('count')
df[df.groupby('id')['consec'].transform('max') == df.consec]
Output:
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4
Not as concise as #ScottBoston but I like this approach
def max_streak(s):
a = s.values # Let's deal with an array
# I need to know where the differences are not `1`.
# Also, because I plan to use `diff` again, I'll wrap
# the boolean array with `True` to make things cleaner
b = np.concatenate([[True], np.diff(a) != 1, [True]])
# Tell the locations of the breaks in streak
c = np.flatnonzero(b)
# `diff` again tells me the length of the streaks
d = np.diff(c)
# `argmax` will tell me the location of the largest streak
e = d.argmax()
return c[e], d[e]
def make_thing(df):
start, length = max_streak(df.week)
return df.iloc[start:start + length].assign(consec=length)
pd.concat([
make_thing(g) for _, g in df.groupby('id')
])
id week consec
2 1 60 3
3 1 61 3
4 1 62 3
5 2 10 4
6 2 11 4
7 2 12 4
8 2 13 4