Sum of column values based on a condition in pandas

Sum of column values based on a condition in pandas - python

daychange SS
0.017065 0
-0.009259 100
0.031542 0
-0.004530 0
0.000709 0
0.004970 100
-0.021900 0
0.003611 0
I have two columns and I want to calculate the sum of next 5 'daychange' if SS = 100.
I am using the following right now but it does not work quite the way I want it to:
df['total'] = df.loc[df['SS'] == 100,['daychange']].sum(axis=1)

Since pandas 1.1 you can create a forward rolling window and select the rows you want to include in your dataframe. With different arguments my notebook kernel got terminated: use with caution.
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=5)
df['total'] = df.daychange.rolling(indexer, min_periods=1).sum()[df.SS == 100]
df
Out:
daychange SS total
0 0.017065 0 NaN
1 -0.009259 100 0.023432
2 0.031542 0 NaN
3 -0.004530 0 NaN
4 0.000709 0 NaN
5 0.004970 100 -0.013319
6 -0.021900 0 NaN
7 0.003611 0 NaN
Exclude the starting row with SS == 100 from the sum
That would be the next row after rows with SS == 100. As all rows are computed you can use
df['total'] = df.daychange.rolling(indexer, min_periods=1).sum().shift(-1)[df.SS == 100]
df
Out:
daychange SS total
0 0.017065 0 NaN
1 -0.009259 100 0.010791
2 0.031542 0 NaN
3 -0.004530 0 NaN
4 0.000709 0 NaN
5 0.004970 100 -0.018289
6 -0.021900 0 NaN
7 0.003611 0 NaN
Slow hacky solution using indices of selected rows
This feels like a hack, but works and avoids computing unnecessary rolling values
df['next5sum'] = df[df.SS == 100].index.to_series().apply(lambda x: df.daychange.iloc[x: x + 5].sum())
df
Out:
daychange SS next5sum
0 0.017065 0 NaN
1 -0.009259 100 0.023432
2 0.031542 0 NaN
3 -0.004530 0 NaN
4 0.000709 0 NaN
5 0.004970 100 -0.013319
6 -0.021900 0 NaN
7 0.003611 0 NaN
For the sum of the next five rows excluding the rows with SS == 100 you can adjust the slices or shift the series
df['next5sum'] = df[df.SS == 100].index.to_series().apply(lambda x: df.daychange.iloc[x + 1: x + 6].sum())
# df['next5sum'] = df[df.SS == 100].index.to_series().apply(lambda x: df.daychange.shift(-1).iloc[x: x + 5].sum())
df
Out:
daychange SS next5sum
0 0.017065 0 NaN
1 -0.009259 100 0.010791
2 0.031542 0 NaN
3 -0.004530 0 NaN
4 0.000709 0 NaN
5 0.004970 100 -0.018289
6 -0.021900 0 NaN
7 0.003611 0 NaN
7 0.003611 0 NaN

Related

Drop rows from a slice of Multi-Index DataFrame based on boolean

EDIT: Upon request I provide an example that is closer to the real data I am working with.
So I have a table data that looks something like
value0 value1 value2
run step
0 0 0.12573 -0.132105 0.640423
1 0.1049 -0.535669 0.361595
2 1.304 0.947081 -0.703735
3 -1.265421 -0.623274 0.041326
4 -2.325031 -0.218792 -1.245911
5 -0.732267 -0.544259 -0.3163
1 0 0.411631 1.042513 -0.128535
1 1.366463 -0.665195 0.35151
2 0.90347 0.094012 -0.743499
3 -0.921725 -0.457726 0.220195
4 -1.009618 -0.209176 -0.159225
5 0.540846 0.214659 0.355373
(think: collection of time series) and a second table valid_range
start stop
run
0 1 3
1 2 5
For each run I want to drop all rows that do not satisfy start≤step≤stop.
I tried the following (table generating code at the end)
for idx in valid_range.index:
slc = data.loc[idx]
start, stop = valid_range.loc[idx]
cond = (start <= slc.index) & (slc.index <= stop)
data.loc[idx] = data.loc[idx][cond]
However, this results in:
value0 value1 value2
run step
0 0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
1 0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
I also tried data.loc[idx].drop(slc[cond].index, inplace=True) but it didn't have any effect...
Generating code for table
import numpy as np
from pandas import DataFrame, MultiIndex, Index
rng = np.random.default_rng(0)
valid_range = DataFrame({"start": [1, 2], "stop":[3, 5]}, index=Index(range(2), name="run"))
midx = MultiIndex(levels=[[],[]], codes=[[],[]], names=["run", "step"])
data = DataFrame(columns=[f"value{k}" for k in range(3)], index=midx)
for run in range(2):
for step in range(6):
data.loc[(run, step), :] = rng.normal(size=(3))
)

First, merge data and valid range based on 'run', using the merge method
>>> data
value0 value1 value2
run step
0 0 0.12573 -0.132105 0.640423
1 0.1049 -0.535669 0.361595
2 1.304 0.947081 -0.703735
3 -1.26542 -0.623274 0.041326
4 -2.32503 -0.218792 -1.24591
5 -0.732267 -0.544259 -0.3163
1 0 0.411631 1.04251 -0.128535
1 1.36646 -0.665195 0.35151
2 0.90347 0.0940123 -0.743499
3 -0.921725 -0.457726 0.220195
4 -1.00962 -0.209176 -0.159225
5 0.540846 0.214659 0.355373
>>> valid_range
start stop
run
0 1 3
1 2 5
>>> merged = data.reset_index().merge(valid_range, how='left', on='run')
>>> merged
run step value0 value1 value2 start stop
0 0 0 0.12573 -0.132105 0.640423 1 3
1 0 1 0.1049 -0.535669 0.361595 1 3
2 0 2 1.304 0.947081 -0.703735 1 3
3 0 3 -1.26542 -0.623274 0.041326 1 3
4 0 4 -2.32503 -0.218792 -1.24591 1 3
5 0 5 -0.732267 -0.544259 -0.3163 1 3
6 1 0 0.411631 1.04251 -0.128535 2 5
7 1 1 1.36646 -0.665195 0.35151 2 5
8 1 2 0.90347 0.0940123 -0.743499 2 5
9 1 3 -0.921725 -0.457726 0.220195 2 5
10 1 4 -1.00962 -0.209176 -0.159225 2 5
11 1 5 0.540846 0.214659 0.355373 2 5
Then select the rows which satisfy the condition using eval. Use the boolean array to mask data
>>> cond = merged.eval('start < step < stop').to_numpy()
>>> data[cond]
value0 value1 value2
run step
0 2 1.304 0.947081 -0.703735
1 3 -0.921725 -0.457726 0.220195
4 -1.00962 -0.209176 -0.159225
Or if you want, here is a similar approach using query
res = (
data.reset_index()
.merge(valid_range, on='run', how='left')
.query('start < step < stop')
.drop(columns=['start','stop'])
.set_index(['run', 'step'])
)

I would go on groupby like this:
(df.groupby(level=0)
.apply(lambda x: x[x['small']>1])
.reset_index(level=0, drop=True) # remove duplicate index
)
which gives:
big small
animal animal attribute
cow cow speed 30.0 20.0
weight 250.0 150.0
falcon falcon speed 320.0 250.0
lama lama speed 45.0 30.0
weight 200.0 100.0

Calculate time delta from last occurrence of some boolean column (from last 1) per category?

Given a simple dataframe:
df = pd.DataFrame({'user': ['x','x','x','x','x','y','y','y','y'],
'Flag': [0,1,0,0,1,0,1,0,0],
'time': [10,34,40,43,44,12,20, 46, 51]})
I want to calculate the timedelta from the last flag == 1 for each user.
I did the diffs:
df.sort_values(['user', 'time']).groupby('user')['time'].diff().fillna(pd.Timedelta(10000000)).dt.total_seconds()/60
But it doesn't seem to solve my issue, I need time delta between the 1's and if there wasn't any then fill with some number N.
Please advise
For example:
user Flag time diff
0 x 0 10 NaN
1 x 1 34 NaN
2 x 0 40 6.0
3 x 0 43 9.0
4 x 1 44 10.0
5 y 0 12 NaN
6 y 1 20 NaN
7 y 0 46 26.0
8 y 0 51 31.0

I am not sure that I understood correctly, but if you want to compute the time delta only between 1's per group of user, you can apply your computation on the sliced dataframe for 1's only and using groupby:
df['delta'] = (df[df['Flag'].eq(1)] # select 1's only
.groupby('user') # group by user
['time'].diff() # compute the diff
.dt.total_seconds()/60 # convert to minutes
)
output:
user Flag time delta
0 x 0 0 days 10:30:00 NaN
1 x 1 0 days 11:34:00 NaN
2 x 0 0 days 11:43:00 NaN
3 y 0 0 days 13:43:00 NaN
4 y 1 0 days 14:40:00 NaN
5 y 0 0 days 15:32:00 NaN
6 y 1 0 days 18:30:00 230.0
7 w 0 0 days 19:30:00 NaN
8 w 0 0 days 20:11:00 NaN
edit. Here is a working solution for the updated question.
IIUC the update, you want to calculate the difference to the last 1 per user, and if the flag is 1, the difference to the last valid value per user if any.
In summary, it creates subgroup for ranges starting with 1s, then uses these groups to calculate the diffs. Finally masks the 1s with a diff with them previous value (is existing)
(df.assign(mask=df['Flag'].eq(1),
group=lambda d: d.groupby('user')['mask'].cumsum(),
# diff from last 1
diff=lambda d: d.groupby(['user', 'group'])['time'].apply(lambda g: g -(g.iloc[0] if g.name[1]>0 else float('nan'))),
)
# mask 1s with their own diff
.assign(## diff=lambda d: d['diff'].mask(d['mask'],d.groupby('user')['time'].diff()) ## OLD VERSION
diff= lambda d: d['diff'].mask(d['mask'],
(d[d['mask'].groupby(d['user']).cumsum().eq(0)|d['mask']]
.groupby('user')['time'].diff())
)
)
.drop(['mask', 'group'], axis=1) # cleanup temp columns
)
Output:
user Flag time diff
0 x 0 10 NaN
1 x 1 34 24.0
2 x 0 40 6.0
3 x 0 43 9.0
4 x 1 44 10.0
5 y 0 12 NaN
6 y 1 20 8.0
7 y 0 46 26.0
8 y 0 51 31.0

How to conditionally select previous row's value in python?

I want to select the previous row's value only if it meets a certain condition
E.g.
df:
Value Marker
10 0
12 0
50 1
42 1
52 0
23 1
I want to select the previous row's value where marker == 0if the current value marker == 1.
Result:
df:
Value Marker Prev_Value
10 0 nan
12 0 nan
50 1 12
42 1 12
52 0 nan
23 1 52
I tried:
df[prev_value] = np.where(df[marker] == 1, df[Value].shift(), np.nan)
but that does not take conditional previous value like i want.

condition = (df.Marker.shift() == 0) & (df.Marker == 1)
df['Prev_Value'] = np.where(condition, df.Value.shift(), np.nan)
Output:
df
Value Marker Prev_Value
0 10 0 NaN
1 12 0 NaN
2 50 1 12.0
3 42 1 NaN
4 52 0 NaN
5 23 1 52.0

You could try this:
df['Prev_Value']=np.where(dataframe['Marker'].diff()==1,dataframe['Value'].shift(1, axis = 0),np.nan)
Output:
df
Value Marker Prev_Value
0 10 0 NaN
1 12 0 NaN
2 50 1 12.0
3 42 1 NaN
4 52 0 NaN
5 23 1 52.0
If you want to get the previous non-1 marker value, if marker==1, you could try this:
prevro=[]
for i in reversed(df.index):
if df.iloc[i,1]==1:
prevro_zero=df.iloc[0:i,0][df.iloc[0:i,1].eq(0)].tolist()
if len(prevro_zero)>0:
prevro.append(prevro_zero[len(prevro_zero)-1])
else:
prevro.append(np.nan)
else:
prevro.append(np.nan)
df['Prev_Value']=list(reversed(prevro))
print(df)
Output:
Value Marker Prev_Value
0 10 0 NaN
1 12 0 NaN
2 50 1 12.0
3 42 1 12.0
4 52 0 NaN
5 23 1 52.0

Eventsearch with Pandas

I have a dataframe and I would like to creat a column with event labels. If a condition is true, then the event would get a number. But I want to give the same event label if the successive values are events. Do you have any idea? I tried to use .apply and .rolling, but without any success.
DataFrame:
df = pd.DataFrame({'Signal_1' : [0,0,0,1,1,0,0,1,1,1,1,0,0,0,1,1,1,1,1]})
Signal_1 ExpectedColumn
0 0 NaN
1 0 NaN
2 0 NaN
3 1 1
4 1 1
5 0 NaN
6 0 NaN
7 1 2
8 1 2
9 1 2
10 1 2
11 0 NaN
12 0 NaN
13 0 NaN
14 1 3
15 1 3
16 1 3
17 1 3
18 1 3

Here is a way to do it. First create the countup flag, and then perform a cumsum. Then correct it with the NaN values.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Signal_1' : [0,0,0,1,1,0,0,1,1,1,1,0,0,0,1,1,1,1,1]})
# Only count up when the previous sample = 0, and the current sample = 1
df["shift"] = df["Signal_1"].shift(1)
df["countup"] = np.where((df["Signal_1"] == 1) & (df["shift"] == 0),1,0)
# Cumsum the countup flag and set to NaN when sample = 0
df["result"] = df["countup"].cumsum()
df["result"] = np.where(df["Signal_1"] == 0, np.NaN, df["result"] )

Sum data from multiple csv files using pandas

I have -many- csv files with the same number of columns (different number of rows) in the following pattern:
Files 1:
A1,B1,C1
A2,B2,C2
A3,B3,C3
A4,B4,C4
File 2:
*A1*,*B1*,*C1*
*A2*,*B2*,*C2*
*A3*,*B3*,*C3*
File ...
Output:
A1+*A1*+...,B1+*B1*+...,C1+*C1*+...
A2+*A2*+...,B2+*B2*+...,C2+*C2*+...
A3+*A3*+...,B3+*B3*+...,C3+*C3*+...
A4+... ,B4+... ,C4+...
For example:
Files 1:
1,0,0
1,0,1
1,0,0
0,1,0
Files 2:
1,1,0
1,1,1
0,1,0
Output:
2,1,0
2,1,2
1,1,0
0,1,0
I am trying to use python.pandas and was thinking of something like this to create the reading variables:
dic={}
for i in range(14253,14352):
try:
dic['df_{0}'.format(i)]=pandas.read_csv('output_'+str(i)+'.csv')
except:
pass
and then to sum the columns:
for residue in residues:
for number in range(14254,14255):
df=dic['df_14253'][residue]
df+=dic['df_'+str(number)][residue]
residues is a list of strings which are the column names.
I have the problem that my files have different numbers of rows and are only summed up until the last row of df1. How could I add them up until the last row of the longest file - so that no data is lost? I think groupby.sum by panda could be an option but I don't understand how to use it.
To add an example - now I get this:
Files 1:
1,0,0
1,0,1
1,0,0
0,1,0
Files 2:
1,1,0
1,1,1
0,1,0
File 3:
1,0,0
0,0,1
1,0,0
1,0,0
1,0,0
1,0,1
File ...:
Output:
3,1,0
2,1,3
2,1,0
1,1,0
1,0,0
1,0,1

You can use Panel in pandas , a 3Dobject, collection of dataframes :
dfs={ i : pd.DataFrame.from_csv('file'+str(i)+'.csv',sep=',',\
header=None,index_col=None) for i in range(n)} # n files.
panel=pd.Panel(dfs)
dfs_sum=panel.sum(axis=0)
dfs is a dictionnary of dataframes. Panel completes automatically lacking values with Nan and does the good sum. For example :
n [500]: panel[1]
Out[500]:
0 1 2
0 1 0 0
1 1 0 1
2 1 0 0
3 0 1 0
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
In [501]: panel[2]
Out[501]:
0 1 2
0 1 0 0
1 1 0 1
2 1 0 0
3 0 1 0
4 1 0 0
5 1 0 1
6 1 0 0
7 0 1 0
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
In [502]: panel[3]
Out[502]:
0 1 2
0 1 0 0
1 1 0 1
2 1 0 0
3 0 1 0
4 1 0 0
5 1 0 1
6 1 0 0
7 0 1 0
8 1 0 0
9 1 0 1
10 1 0 0
11 0 1 0
In [503]: panel.sum(0)
Out[503]:
0 1 2
0 3 0 0
1 3 0 3
2 3 0 0
3 0 3 0
4 2 0 0
5 2 0 2
6 2 0 0
7 0 2 0
8 1 0 0
9 1 0 1
10 1 0 0
11 0 1 0

Looking for this exact same thing, I find out that Panel is now Deprecated so I post here the news :
class pandas.Panel(data=None, items=None, major_axis=None, minor_axis=None, copy=False, dtype=None)
Deprecated since version 0.20.0: The recommended way to represent 3-D data are with a >MultiIndex on a DataFrame via the to_frame() method or with the xarray package. >Pandas provides a to_xarray() method to automate this conversion.
to_frame(filter_observations=True)
Transform wide format into long (stacked) format as DataFrame whose columns are >the Panel’s items and whose index is a MultiIndex formed of the Panel’s major and >minor
I would recommend using
pandas.DataFrame.sum
DataFrame.sum(axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)
Parameters:
axis : {index (0), columns (1)}
Axis for the function to be applied on.
One can use it the same way as in B.M. answer

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sum of column values based on a condition in pandas - python

Related

Drop rows from a slice of Multi-Index DataFrame based on boolean

Calculate time delta from last occurrence of some boolean column (from last 1) per category?

How to conditionally select previous row's value in python?

Eventsearch with Pandas

Sum data from multiple csv files using pandas

Categories

Resources