I have a dataframe and I would like to creat a column with event labels. If a condition is true, then the event would get a number. But I want to give the same event label if the successive values are events. Do you have any idea? I tried to use .apply and .rolling, but without any success.
DataFrame:
df = pd.DataFrame({'Signal_1' : [0,0,0,1,1,0,0,1,1,1,1,0,0,0,1,1,1,1,1]})
Signal_1 ExpectedColumn
0 0 NaN
1 0 NaN
2 0 NaN
3 1 1
4 1 1
5 0 NaN
6 0 NaN
7 1 2
8 1 2
9 1 2
10 1 2
11 0 NaN
12 0 NaN
13 0 NaN
14 1 3
15 1 3
16 1 3
17 1 3
18 1 3
Here is a way to do it. First create the countup flag, and then perform a cumsum. Then correct it with the NaN values.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Signal_1' : [0,0,0,1,1,0,0,1,1,1,1,0,0,0,1,1,1,1,1]})
# Only count up when the previous sample = 0, and the current sample = 1
df["shift"] = df["Signal_1"].shift(1)
df["countup"] = np.where((df["Signal_1"] == 1) & (df["shift"] == 0),1,0)
# Cumsum the countup flag and set to NaN when sample = 0
df["result"] = df["countup"].cumsum()
df["result"] = np.where(df["Signal_1"] == 0, np.NaN, df["result"] )
Related
I've got a dataframe df = pd.DataFrame({'A':[1,1,2,2],'values':np.arange(10,30,5)})
How can I group by A to get the sum of values, where the sum is placed in a new column sum_values_A, but only once at the bottom of each group. e.g.
A values sum_values_A
0 1 10 NaN
1 1 15 25
2 2 20 NaN
3 2 25 45
I tried
df['sum_values_A'] = df.groupby('A')['values'].transform('sum')
df['sum_values_A'] = df.groupby('A')['sum_values_A'].unique()
But couldn't find a way to get the unique sums to be sorted at the bottom of each group
You can use:
df.loc[~df['A'].duplicated(keep='last'),
'sum_values_A'
] = df.groupby('A')['values'].transform('sum')
print(df)
Or:
m = ~df['A'].duplicated(keep='last')
df.loc[m, 'sum_values_A'] = df.loc[m, 'A'].map(df.groupby('A')['values'].sum())
Output:
A values sum_values_A
0 1 10 NaN
1 1 15 25.0
2 2 20 NaN
3 2 25 45.0
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[1,1,2,2],'values':np.arange(10,30,5)})
df = df.assign(sum_values_A=df.groupby('A').cumsum()
[df['A'].duplicated(keep='first')])
>>> df
A values sum_values_A
0 1 10 NaN
1 1 15 25.0
2 2 20 NaN
3 2 25 45.0
EDIT: Upon request I provide an example that is closer to the real data I am working with.
So I have a table data that looks something like
value0 value1 value2
run step
0 0 0.12573 -0.132105 0.640423
1 0.1049 -0.535669 0.361595
2 1.304 0.947081 -0.703735
3 -1.265421 -0.623274 0.041326
4 -2.325031 -0.218792 -1.245911
5 -0.732267 -0.544259 -0.3163
1 0 0.411631 1.042513 -0.128535
1 1.366463 -0.665195 0.35151
2 0.90347 0.094012 -0.743499
3 -0.921725 -0.457726 0.220195
4 -1.009618 -0.209176 -0.159225
5 0.540846 0.214659 0.355373
(think: collection of time series) and a second table valid_range
start stop
run
0 1 3
1 2 5
For each run I want to drop all rows that do not satisfy start≤step≤stop.
I tried the following (table generating code at the end)
for idx in valid_range.index:
slc = data.loc[idx]
start, stop = valid_range.loc[idx]
cond = (start <= slc.index) & (slc.index <= stop)
data.loc[idx] = data.loc[idx][cond]
However, this results in:
value0 value1 value2
run step
0 0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
1 0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
I also tried data.loc[idx].drop(slc[cond].index, inplace=True) but it didn't have any effect...
Generating code for table
import numpy as np
from pandas import DataFrame, MultiIndex, Index
rng = np.random.default_rng(0)
valid_range = DataFrame({"start": [1, 2], "stop":[3, 5]}, index=Index(range(2), name="run"))
midx = MultiIndex(levels=[[],[]], codes=[[],[]], names=["run", "step"])
data = DataFrame(columns=[f"value{k}" for k in range(3)], index=midx)
for run in range(2):
for step in range(6):
data.loc[(run, step), :] = rng.normal(size=(3))
)
First, merge data and valid range based on 'run', using the merge method
>>> data
value0 value1 value2
run step
0 0 0.12573 -0.132105 0.640423
1 0.1049 -0.535669 0.361595
2 1.304 0.947081 -0.703735
3 -1.26542 -0.623274 0.041326
4 -2.32503 -0.218792 -1.24591
5 -0.732267 -0.544259 -0.3163
1 0 0.411631 1.04251 -0.128535
1 1.36646 -0.665195 0.35151
2 0.90347 0.0940123 -0.743499
3 -0.921725 -0.457726 0.220195
4 -1.00962 -0.209176 -0.159225
5 0.540846 0.214659 0.355373
>>> valid_range
start stop
run
0 1 3
1 2 5
>>> merged = data.reset_index().merge(valid_range, how='left', on='run')
>>> merged
run step value0 value1 value2 start stop
0 0 0 0.12573 -0.132105 0.640423 1 3
1 0 1 0.1049 -0.535669 0.361595 1 3
2 0 2 1.304 0.947081 -0.703735 1 3
3 0 3 -1.26542 -0.623274 0.041326 1 3
4 0 4 -2.32503 -0.218792 -1.24591 1 3
5 0 5 -0.732267 -0.544259 -0.3163 1 3
6 1 0 0.411631 1.04251 -0.128535 2 5
7 1 1 1.36646 -0.665195 0.35151 2 5
8 1 2 0.90347 0.0940123 -0.743499 2 5
9 1 3 -0.921725 -0.457726 0.220195 2 5
10 1 4 -1.00962 -0.209176 -0.159225 2 5
11 1 5 0.540846 0.214659 0.355373 2 5
Then select the rows which satisfy the condition using eval. Use the boolean array to mask data
>>> cond = merged.eval('start < step < stop').to_numpy()
>>> data[cond]
value0 value1 value2
run step
0 2 1.304 0.947081 -0.703735
1 3 -0.921725 -0.457726 0.220195
4 -1.00962 -0.209176 -0.159225
Or if you want, here is a similar approach using query
res = (
data.reset_index()
.merge(valid_range, on='run', how='left')
.query('start < step < stop')
.drop(columns=['start','stop'])
.set_index(['run', 'step'])
)
I would go on groupby like this:
(df.groupby(level=0)
.apply(lambda x: x[x['small']>1])
.reset_index(level=0, drop=True) # remove duplicate index
)
which gives:
big small
animal animal attribute
cow cow speed 30.0 20.0
weight 250.0 150.0
falcon falcon speed 320.0 250.0
lama lama speed 45.0 30.0
weight 200.0 100.0
daychange SS
0.017065 0
-0.009259 100
0.031542 0
-0.004530 0
0.000709 0
0.004970 100
-0.021900 0
0.003611 0
I have two columns and I want to calculate the sum of next 5 'daychange' if SS = 100.
I am using the following right now but it does not work quite the way I want it to:
df['total'] = df.loc[df['SS'] == 100,['daychange']].sum(axis=1)
Since pandas 1.1 you can create a forward rolling window and select the rows you want to include in your dataframe. With different arguments my notebook kernel got terminated: use with caution.
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=5)
df['total'] = df.daychange.rolling(indexer, min_periods=1).sum()[df.SS == 100]
df
Out:
daychange SS total
0 0.017065 0 NaN
1 -0.009259 100 0.023432
2 0.031542 0 NaN
3 -0.004530 0 NaN
4 0.000709 0 NaN
5 0.004970 100 -0.013319
6 -0.021900 0 NaN
7 0.003611 0 NaN
Exclude the starting row with SS == 100 from the sum
That would be the next row after rows with SS == 100. As all rows are computed you can use
df['total'] = df.daychange.rolling(indexer, min_periods=1).sum().shift(-1)[df.SS == 100]
df
Out:
daychange SS total
0 0.017065 0 NaN
1 -0.009259 100 0.010791
2 0.031542 0 NaN
3 -0.004530 0 NaN
4 0.000709 0 NaN
5 0.004970 100 -0.018289
6 -0.021900 0 NaN
7 0.003611 0 NaN
Slow hacky solution using indices of selected rows
This feels like a hack, but works and avoids computing unnecessary rolling values
df['next5sum'] = df[df.SS == 100].index.to_series().apply(lambda x: df.daychange.iloc[x: x + 5].sum())
df
Out:
daychange SS next5sum
0 0.017065 0 NaN
1 -0.009259 100 0.023432
2 0.031542 0 NaN
3 -0.004530 0 NaN
4 0.000709 0 NaN
5 0.004970 100 -0.013319
6 -0.021900 0 NaN
7 0.003611 0 NaN
For the sum of the next five rows excluding the rows with SS == 100 you can adjust the slices or shift the series
df['next5sum'] = df[df.SS == 100].index.to_series().apply(lambda x: df.daychange.iloc[x + 1: x + 6].sum())
# df['next5sum'] = df[df.SS == 100].index.to_series().apply(lambda x: df.daychange.shift(-1).iloc[x: x + 5].sum())
df
Out:
daychange SS next5sum
0 0.017065 0 NaN
1 -0.009259 100 0.010791
2 0.031542 0 NaN
3 -0.004530 0 NaN
4 0.000709 0 NaN
5 0.004970 100 -0.018289
6 -0.021900 0 NaN
7 0.003611 0 NaN
7 0.003611 0 NaN
how to fill null values in a row based on the value in another column.
A B
0 5
1 NAN
1 6
0 NAN
for the null value in B if coressponding value in A is 0 then fill with the previous value.
A B
0 5
1 NAN
1 6
0 6
```
it want it to be like this
numpy.where + isnull + ffill
df.assign(
B=np.where(df.A.eq(0) & df.B.isnull(), df.B.ffill(), df.B)
)
A B
0 0 5.0
1 1 NaN
2 1 6.0
3 0 6.0
Another way using loc,
df.loc[df['A'].eq(0), 'B'] = df['B'].ffill()
A B
0 0 5
1 1 NaN
2 1 6
3 0 6
A faster way (compared to the previous ones):
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[0,1,1,0], 'B':[5,np.nan,6,np.nan]})
df.B = np.where(df.A==0, df.B.ffill(), df.B)
and you get:
A B
0 0 5.0
1 1 NaN
2 1 6.0
3 0 6.0
I have two pandas.dataframe df1 and df2:
>>>import pandas as pd
>>>import numpy as np
>>>from random import random
>>>df1=pd.DataFrame({'x1':range(10), 'y1':np.repeat(0,10).tolist()})
>>>df2=pd.DataFrame({'x2':range(0,10,2), 'y2':[random() for _ in range(5)]})
>>>df1
x1 y1
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
8 8 0
9 9 0
>>>df2
x2 y2
0 0 0.075922
1 2 0.606703
2 4 0.272918
3 6 0.842641
4 8 0.576636
Now I want to fuse df2 into df1. This is to say, I want to change the values of y1 in df1 into the values of y2 in df2 when the value of x1 in df1 is equal to the value of x2 in df2. The final result I need is like the following:
>>>df1
x1 y1
0 0 0.075922
1 1 0
2 2 0.606703
3 3 0
4 4 0.272918
5 5 0
6 6 0.842641
7 7 0
8 8 0.576636
9 9 0
Although I can use the follow codes to get the above result:
>>> for i in range(df1.shape[0]):
... for j in range(df2.shape[0]):
... if df1.iloc[i,0] == df2.iloc[j,0]:
... df1.iloc[i,1]=df2.iloc[j,1]
...
I think there must be better ways to achieve this. Do you know what they are? Thank you in advance.
You can use df.update to update your df1 in place, eg:
df1.update({'y1': df2.set_index('x2')['y2']})
Gives you:
x1 y1
0 0 0.075922
1 1 0.000000
2 2 0.606703
3 3 0.000000
4 4 0.272918
5 5 0.000000
6 6 0.842641
7 7 0.000000
8 8 0.576636
9 9 0.000000
Use map and then replace missing values by original values by fillna:
df1['y1'] = df1['x1'].map(df2.set_index('x2')['y2']).fillna(df1['y1'])
print (df)
x1 y1
0 0 0.696469
1 1 0.000000
2 2 0.286139
3 3 0.000000
4 4 0.226851
5 5 0.000000
6 6 0.551315
7 7 0.000000
8 8 0.719469
9 9 0.000000
You can also use update after setting indices of both dataframes:
import pandas as pd
import numpy as np
from random import random
df1=pd.DataFrame({'x1':range(10), 'y1':np.repeat(0,10).tolist()})
#set index of the first dataframe to be 'x1'
df1.set_index('x1', inplace=True)
df2=pd.DataFrame({'x2':range(0,10,2), 'y1':[random() for _ in range(5)]})
#set index of the second dataframe to be 'x2'
df2.set_index('x2', inplace=True)
#update values in df1 with values in df
df1.update(df2)
#reset index if necessary (though index will look exactly like x1 column)
df1 = df1.reset_index()
Update() seems to be the best option here !
import pandas as pd
import numpy as np
from random import random
# your dataframes
df1 = pd.DataFrame({'x1': range(10), 'y1': np.repeat(0, 10).tolist()})
df2 = pd.DataFrame({'x2': range(0, 10, 2), 'y2': [random() for _ in range(5)]})
# printing df1 and df2 values before update
print(df1)
print(df2)
df1.update({'y1': df2.set_index('x2')['y2']})
# printing df1 after update was performed
print(df1)
Another method, adding the two dataframes together:
# first give df2 the same column names as df2
df2.columns = ['x1','y1']
#now set 'x1' as the index for both dfs (since this is what you want to 'join' on)
df1 = df1.set_index('x1')
df2 = df2.set_index('x1')
print(df1)
y1
x1
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
print(df2)
y1
x1
0 0.525232
2 0.907628
4 0.612100
6 0.497420
8 0.656509
#now you can simply add the two df's to eachother
df_new = df1 + df2
print(df_new)
y1
x1
0 0.317418
1 NaN
2 0.581443
3 NaN
4 0.728766
5 NaN
6 0.495450
7 NaN
8 0.171131
9 NaN
Two problems:
The dataframe has NA's where you want 0's. These are the positions where df2 was not defined. Those positions were effectively equal to NA in df2, and NA + anything = NA. This can be fixed with a
fillna
You want 'x1' to be a column, not the index so just reset the index
df_new=df_new.reset_index().fillna(0)
print(df_new)
x1 y1
0 0 0.118903
1 1 0.000000
2 2 0.465557
3 3 0.000000
4 4 0.533266
5 5 0.000000
6 6 0.518484
7 7 0.000000
8 8 0.308733
9 9 0.000000