I want to select the previous row's value only if it meets a certain condition
E.g.
df:
Value Marker
10 0
12 0
50 1
42 1
52 0
23 1
I want to select the previous row's value where marker == 0if the current value marker == 1.
Result:
df:
Value Marker Prev_Value
10 0 nan
12 0 nan
50 1 12
42 1 12
52 0 nan
23 1 52
I tried:
df[prev_value] = np.where(df[marker] == 1, df[Value].shift(), np.nan)
but that does not take conditional previous value like i want.
condition = (df.Marker.shift() == 0) & (df.Marker == 1)
df['Prev_Value'] = np.where(condition, df.Value.shift(), np.nan)
Output:
df
Value Marker Prev_Value
0 10 0 NaN
1 12 0 NaN
2 50 1 12.0
3 42 1 NaN
4 52 0 NaN
5 23 1 52.0
You could try this:
df['Prev_Value']=np.where(dataframe['Marker'].diff()==1,dataframe['Value'].shift(1, axis = 0),np.nan)
Output:
df
Value Marker Prev_Value
0 10 0 NaN
1 12 0 NaN
2 50 1 12.0
3 42 1 NaN
4 52 0 NaN
5 23 1 52.0
If you want to get the previous non-1 marker value, if marker==1, you could try this:
prevro=[]
for i in reversed(df.index):
if df.iloc[i,1]==1:
prevro_zero=df.iloc[0:i,0][df.iloc[0:i,1].eq(0)].tolist()
if len(prevro_zero)>0:
prevro.append(prevro_zero[len(prevro_zero)-1])
else:
prevro.append(np.nan)
else:
prevro.append(np.nan)
df['Prev_Value']=list(reversed(prevro))
print(df)
Output:
Value Marker Prev_Value
0 10 0 NaN
1 12 0 NaN
2 50 1 12.0
3 42 1 12.0
4 52 0 NaN
5 23 1 52.0
Related
Given a simple dataframe:
df = pd.DataFrame({'user': ['x','x','x','x','x','y','y','y','y'],
'Flag': [0,1,0,0,1,0,1,0,0],
'time': [10,34,40,43,44,12,20, 46, 51]})
I want to calculate the timedelta from the last flag == 1 for each user.
I did the diffs:
df.sort_values(['user', 'time']).groupby('user')['time'].diff().fillna(pd.Timedelta(10000000)).dt.total_seconds()/60
But it doesn't seem to solve my issue, I need time delta between the 1's and if there wasn't any then fill with some number N.
Please advise
For example:
user Flag time diff
0 x 0 10 NaN
1 x 1 34 NaN
2 x 0 40 6.0
3 x 0 43 9.0
4 x 1 44 10.0
5 y 0 12 NaN
6 y 1 20 NaN
7 y 0 46 26.0
8 y 0 51 31.0
I am not sure that I understood correctly, but if you want to compute the time delta only between 1's per group of user, you can apply your computation on the sliced dataframe for 1's only and using groupby:
df['delta'] = (df[df['Flag'].eq(1)] # select 1's only
.groupby('user') # group by user
['time'].diff() # compute the diff
.dt.total_seconds()/60 # convert to minutes
)
output:
user Flag time delta
0 x 0 0 days 10:30:00 NaN
1 x 1 0 days 11:34:00 NaN
2 x 0 0 days 11:43:00 NaN
3 y 0 0 days 13:43:00 NaN
4 y 1 0 days 14:40:00 NaN
5 y 0 0 days 15:32:00 NaN
6 y 1 0 days 18:30:00 230.0
7 w 0 0 days 19:30:00 NaN
8 w 0 0 days 20:11:00 NaN
edit. Here is a working solution for the updated question.
IIUC the update, you want to calculate the difference to the last 1 per user, and if the flag is 1, the difference to the last valid value per user if any.
In summary, it creates subgroup for ranges starting with 1s, then uses these groups to calculate the diffs. Finally masks the 1s with a diff with them previous value (is existing)
(df.assign(mask=df['Flag'].eq(1),
group=lambda d: d.groupby('user')['mask'].cumsum(),
# diff from last 1
diff=lambda d: d.groupby(['user', 'group'])['time'].apply(lambda g: g -(g.iloc[0] if g.name[1]>0 else float('nan'))),
)
# mask 1s with their own diff
.assign(## diff=lambda d: d['diff'].mask(d['mask'],d.groupby('user')['time'].diff()) ## OLD VERSION
diff= lambda d: d['diff'].mask(d['mask'],
(d[d['mask'].groupby(d['user']).cumsum().eq(0)|d['mask']]
.groupby('user')['time'].diff())
)
)
.drop(['mask', 'group'], axis=1) # cleanup temp columns
)
Output:
user Flag time diff
0 x 0 10 NaN
1 x 1 34 24.0
2 x 0 40 6.0
3 x 0 43 9.0
4 x 1 44 10.0
5 y 0 12 NaN
6 y 1 20 8.0
7 y 0 46 26.0
8 y 0 51 31.0
Following is my test dataframe:
df.head(25)
freq pow group avg
0 0.000000 10.716615 0 NaN
1 0.022888 -9.757687 0 NaN
2 0.045776 -9.203844 0 NaN
3 0.068665 -8.746512 0 NaN
4 0.091553 -8.725540 0 NaN
...
95 2.174377 -12.697743 0 NaN
96 2.197266 -7.398328 0 NaN
97 2.220154 -23.002036 0 NaN
98 2.243042 -22.591483 0 NaN
99 2.265930 -13.686127 0 NaN
I am trying to assign values from 1-24 in group column based on range of values in freq column.
For example, using df.loc[(df['freq'] >= 0.1) & (df['freq'] <= 0.2)] yields the following:
freq pow group avg
5 0.114441 -8.620905 0 NaN
6 0.137329 -10.633629 0 NaN
7 0.160217 -9.098974 0 NaN
8 0.183105 -9.381907 0 NaN
So, if I select any particular range as shown above, I would want to change the values in group column from 0 to 1 as shown below.
freq pow group avg
5 0.114441 -8.620905 1 NaN
6 0.137329 -10.633629 1 NaN
7 0.160217 -9.098974 1 NaN
8 0.183105 -9.381907 1 NaN
Similarly, I want to change more values in group column from 0 to anything from 1-24 depending on the range I provide.
For example, df.loc[(df['freq'] >= 1) & (df['freq'] <= 1.2)] results in
freq pow group avg
44 1.007080 -13.826930 0 NaN
45 1.029968 -12.892703 0 NaN
46 1.052856 -14.353349 0 NaN
47 1.075745 -15.389498 0 NaN
48 1.098633 -12.519916 0 NaN
49 1.121521 -15.118952 0 NaN
50 1.144409 -15.986558 0 NaN
51 1.167297 -13.262798 0 NaN
52 1.190186 -12.629713 0 NaN
But I want to change the value of group column for this range from 0 to 2
freq pow group avg
44 1.007080 -13.826930 2 NaN
45 1.029968 -12.892703 2 NaN
46 1.052856 -14.353349 2 NaN
47 1.075745 -15.389498 2 NaN
48 1.098633 -12.519916 2 NaN
49 1.121521 -15.118952 2 NaN
50 1.144409 -15.986558 2 NaN
51 1.167297 -13.262798 2 NaN
52 1.190186 -12.629713 2 NaN
and once I have allocated the entire table with custom group values, I would need to calculate mean from values in pow column for the respective ranges and edit the avg column accordingly.
I cannot seem to figure out on how modify the multiple values in group column to a single value for a particular range of freq column. Any help would be greatly appreciated for this problem. Thank you in advance.
You can use -
df.loc[(df['freq'] >= 1) & (df['freq'] <= 1.2), 'group'] = 2
in the same way for mean-
df.loc[(df['freq'] >= 1) & (df['freq'] <= 1.2), 'avg'] = df[(df['freq'] >= 1) & (df['freq'] <= 1.2)]['pow'].mean
You can loop these commands changing the ranges of df['freq'] & the update integer (1-24 as mentioned in the question)
df = pd.read_csv('test.txt',dtype=str)
print(df)
HE WE
0 aa NaN
1 181 76
2 22 13
3 NaN NaN
I want to overwrite any of these data frames with the following indexes
dff = pd.DataFrame({'HE' : [100,30]},index=[1,2])
print(dff)
HE
1 100
2 30
for i in dff.index:
df._set_value(i,'HE',dff._get_value(i,'HE'))
print(df)
HE WE
0 aa NaN
1 100 76
2 30 13
3 NaN NaN
Is there a way to change it all at once without using 'for'?
Use DataFrame.update, (working inplace):
df.update(dff)
print (df)
HE WE
0 aa NaN
1 100 76.0
2 30 13.0
3 NaN NaN
daychange SS
0.017065 0
-0.009259 100
0.031542 0
-0.004530 0
0.000709 0
0.004970 100
-0.021900 0
0.003611 0
I have two columns and I want to calculate the sum of next 5 'daychange' if SS = 100.
I am using the following right now but it does not work quite the way I want it to:
df['total'] = df.loc[df['SS'] == 100,['daychange']].sum(axis=1)
Since pandas 1.1 you can create a forward rolling window and select the rows you want to include in your dataframe. With different arguments my notebook kernel got terminated: use with caution.
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=5)
df['total'] = df.daychange.rolling(indexer, min_periods=1).sum()[df.SS == 100]
df
Out:
daychange SS total
0 0.017065 0 NaN
1 -0.009259 100 0.023432
2 0.031542 0 NaN
3 -0.004530 0 NaN
4 0.000709 0 NaN
5 0.004970 100 -0.013319
6 -0.021900 0 NaN
7 0.003611 0 NaN
Exclude the starting row with SS == 100 from the sum
That would be the next row after rows with SS == 100. As all rows are computed you can use
df['total'] = df.daychange.rolling(indexer, min_periods=1).sum().shift(-1)[df.SS == 100]
df
Out:
daychange SS total
0 0.017065 0 NaN
1 -0.009259 100 0.010791
2 0.031542 0 NaN
3 -0.004530 0 NaN
4 0.000709 0 NaN
5 0.004970 100 -0.018289
6 -0.021900 0 NaN
7 0.003611 0 NaN
Slow hacky solution using indices of selected rows
This feels like a hack, but works and avoids computing unnecessary rolling values
df['next5sum'] = df[df.SS == 100].index.to_series().apply(lambda x: df.daychange.iloc[x: x + 5].sum())
df
Out:
daychange SS next5sum
0 0.017065 0 NaN
1 -0.009259 100 0.023432
2 0.031542 0 NaN
3 -0.004530 0 NaN
4 0.000709 0 NaN
5 0.004970 100 -0.013319
6 -0.021900 0 NaN
7 0.003611 0 NaN
For the sum of the next five rows excluding the rows with SS == 100 you can adjust the slices or shift the series
df['next5sum'] = df[df.SS == 100].index.to_series().apply(lambda x: df.daychange.iloc[x + 1: x + 6].sum())
# df['next5sum'] = df[df.SS == 100].index.to_series().apply(lambda x: df.daychange.shift(-1).iloc[x: x + 5].sum())
df
Out:
daychange SS next5sum
0 0.017065 0 NaN
1 -0.009259 100 0.010791
2 0.031542 0 NaN
3 -0.004530 0 NaN
4 0.000709 0 NaN
5 0.004970 100 -0.018289
6 -0.021900 0 NaN
7 0.003611 0 NaN
7 0.003611 0 NaN
Just curious on the behavior of 'where' and why you would use it over 'loc'.
If I create a dataframe:
df = pd.DataFrame({'ID':[1,2,3,4,5,6,7,8,9,10],
'Run Distance':[234,35,77,787,243,5435,775,123,355,123],
'Goals':[12,23,56,7,8,0,4,2,1,34],
'Gender':['m','m','m','f','f','m','f','m','f','m']})
And then apply the 'where' function:
df2 = df.where(df['Goals']>10)
I get the following which filters out the results where Goals > 10, but leaves everything else as NaN:
Gender Goals ID Run Distance
0 m 12.0 1.0 234.0
1 m 23.0 2.0 35.0
2 m 56.0 3.0 77.0
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 m 34.0 10.0 123.0
If however I use the 'loc' function:
df2 = df.loc[df['Goals']>10]
It returns the dataframe subsetted without the NaN values:
Gender Goals ID Run Distance
0 m 12 1 234
1 m 23 2 35
2 m 56 3 77
9 m 34 10 123
So essentially I am curious why you would use 'where' over 'loc/iloc' and why it returns NaN values?
Think of loc as a filter - give me only the parts of the df that conform to a condition.
where originally comes from numpy. It runs over an array and checks if each element fits a condition. So it gives you back the entire array, with a result or NaN. A nice feature of where is that you can also get back something different, e.g. df2 = df.where(df['Goals']>10, other='0'), to replace values that don't meet the condition with 0.
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 0 0 0 0
4 0 0 0 0
5 0 0 0 0
6 0 0 0 0
7 0 0 0 0
8 0 0 0 0
9 10 123 34 m
Also, while where is only for conditional filtering, loc is the standard way of selecting in Pandas, along with iloc. loc uses row and column names, while iloc uses their index number. So with loc you could choose to return, say, df.loc[0:1, ['Gender', 'Goals']]:
Gender Goals
0 m 12
1 m 23
If check docs DataFrame.where it replace rows by condition - default by NAN, but is possible specify value:
df2 = df.where(df['Goals']>10)
print (df2)
ID Run Distance Goals Gender
0 1.0 234.0 12.0 m
1 2.0 35.0 23.0 m
2 3.0 77.0 56.0 m
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN
8 NaN NaN NaN NaN
9 10.0 123.0 34.0 m
df2 = df.where(df['Goals']>10, 100)
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
3 100 100 100 100
4 100 100 100 100
5 100 100 100 100
6 100 100 100 100
7 100 100 100 100
8 100 100 100 100
9 10 123 34 m
Another syntax is called boolean indexing and is for filter rows - remove rows matched condition.
df2 = df.loc[df['Goals']>10]
#alternative
df2 = df[df['Goals']>10]
print (df2)
ID Run Distance Goals Gender
0 1 234 12 m
1 2 35 23 m
2 3 77 56 m
9 10 123 34 m
If use loc is possible also filter by rows by condition and columns by name(s):
s = df.loc[df['Goals']>10, 'ID']
print (s)
0 1
1 2
2 3
9 10
Name: ID, dtype: int64
df2 = df.loc[df['Goals']>10, ['ID','Gender']]
print (df2)
ID Gender
0 1 m
1 2 m
2 3 m
9 10 m
loc retrieves only the rows that matches the condition.
where returns the whole dataframe, replacing the rows that don't match the condition (NaN by default).