df["Size"] should output a new cumulative total without allowing the sum to drop below 0:
Size
0 11.0
1 18.0
2 -13.0
3 -4.0
4 -26.0
5 30.0
print(df["Cumulative"]) output should read:
Cumulative
0 11
1 29
2 16
3 12
4 0
5 30
I hoped lambda might help, but I get an error:
df.Size = df.Size.astype(int)
df["Cumulative"] = df.Size.apply(lambda x: x.cumsum() if x.cumsum() > 0 else 0)
print(df)
Output:
AttributeError: 'int' object has no attribute 'cumsum'
This error appears no matter what data type is entered 'str', 'float'
alternitively I started with:
df.Size = df.Size.astype(int)
df["Cumulative"] = df.Size.cumsum()
Output:
Cumulative
0 11
1 29
2 16
3 12
4 -14
5 16
This output worked as expected but does not stop results from dropping below 0
Update
You have to use accumulate from itertools:
from itertools import accumulate
def reset_cumsum(bal, val):
return max(bal + val, 0) # Enhanced by #Chrysophylaxs
# return bal if (bal := bal + val) > 0 else 0
df['Cumulative'] = list(accumulate(df['Size'], func=reset_cumsum, initial=0))[1:]
print(df)
# Output
Size Cumulative
0 11.0 11.0
1 18.0 29.0
2 -13.0 16.0
3 -4.0 12.0
4 -26.0 0.0
5 30.0 30.0
You can use expanding and compute the sum at each iteration. If the sum greater than 0 return the sum else return 0:
>>> df['Size'].expanding().apply(lambda x: c if (c := x.sum()) > 0 else 0)
0 11.0
1 29.0
2 16.0
3 12.0
4 0.0
5 16.0
Name: Size, dtype: float64
Related
I have a dataframe like this:
import pandas as pd
import numpy as np
data={'trip':[1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3],
'timestamps':[1235471761, 1235471763, 1235471765, 1235471767, 1235471770, 1235471772, 1235471776, 1235471779, 1235471780, 1235471789,1235471792,1235471793,1235471829,1235471833,1235471835,1235471838,1235471844,1235471847,1235471848,1235471852,1235471855,1235471859,1235471900,1235471904,1235471911,1235471913]}
df = pd.DataFrame(data)
df['TimeDistance'] = df.groupby('trip')['timestamps'].diff(1)
df
What I am looking for is to start from the first row(consider it as an origin) in the "TimeDistance" column and doing cumulative sum over its values and whenever this summation reach 10, restart the cumsum and continue this procedure until the end of the trip (as you can see in this dataframe we have 3 trips in the "trip" column).
I want all the cumulative sum in a new column,lets say, "cumu" column.
Another important issue is that after reaching out threshold, the next row after threshold in the "cumu" column must be zero and the summation restart from this new origin again.
I hope I've understood your question right. You can use generator with .send():
def my_accumulate(maxval):
val = 0
yield
while True:
if val < maxval:
val += yield val
else:
yield val
val = 0
def fn(x):
a = my_accumulate(10)
next(a)
x["cumu"] = [a.send(v) for v in x["TimeDistance"]]
return x
df = df.groupby("trip").apply(fn)
print(df)
Prints:
trip timestamps TimeDistance cumu
0 1 1235471761 NaN 0.0
1 1 1235471763 2.0 2.0
2 1 1235471765 2.0 4.0
3 1 1235471767 2.0 6.0
4 1 1235471770 3.0 9.0
5 1 1235471772 2.0 11.0
6 1 1235471776 4.0 0.0
7 1 1235471779 3.0 3.0
8 1 1235471780 1.0 4.0
9 1 1235471789 9.0 13.0
10 1 1235471792 3.0 0.0
11 1 1235471793 1.0 1.0
12 2 1235471829 NaN 0.0
13 2 1235471833 4.0 4.0
14 2 1235471835 2.0 6.0
15 2 1235471838 3.0 9.0
16 2 1235471844 6.0 15.0
17 2 1235471847 3.0 0.0
18 2 1235471848 1.0 1.0
19 2 1235471852 4.0 5.0
20 2 1235471855 3.0 8.0
21 2 1235471859 4.0 12.0
22 3 1235471900 NaN 0.0
23 3 1235471904 4.0 4.0
24 3 1235471911 7.0 11.0
25 3 1235471913 2.0 0.0
Another solution:
df = df.groupby("trip").apply(
lambda x: x.assign(
cumu=(
val := 0,
*(
val := val + v if val < 10 else (val := 0)
for v in x["TimeDistance"][1:]
),
)
),
)
print(df)
Andrej's answer is better, as mine is probably not as efficient, and it depends on the df being ordered by trip and the TimeDistance being nan as the first value of each trip.
cummulative_sum = 0
df['cumu'] = 0
for i in range(len(df)):
if np.isnan(df.loc[i,'TimeDistance']) or cummulative_sum >= 10:
cummulative_sum = 0
df.loc[i, 'cumu'] = 0
else:
cummulative_sum += df.loc[i,'TimeDistance']
df.loc[i, 'cumu'] = cummulative_sum
print(df) outputs:
trip timestamps TimeDistance cumu
0 1 1235471761 NaN 0
1 1 1235471763 2.0 2
2 1 1235471765 2.0 4
3 1 1235471767 2.0 6
4 1 1235471770 3.0 9
5 1 1235471772 2.0 11
6 1 1235471776 4.0 0
7 1 1235471779 3.0 3
8 1 1235471780 1.0 4
9 1 1235471789 9.0 13
10 1 1235471792 3.0 0
11 1 1235471793 1.0 1
12 2 1235471829 NaN 0
13 2 1235471833 4.0 4
14 2 1235471835 2.0 6
15 2 1235471838 3.0 9
16 2 1235471844 6.0 15
17 2 1235471847 3.0 0
18 2 1235471848 1.0 1
19 2 1235471852 4.0 5
20 2 1235471855 3.0 8
21 2 1235471859 4.0 12
22 3 1235471900 NaN 0
23 3 1235471904 4.0 4
24 3 1235471911 7.0 11
25 3 1235471913 2.0 0
I have a big data frame with float values. I want to perform two if logical operations.
My code:
df =
A B
0 78.2 98.2
1 54.0 58.0
2 45.0 49.0
3 20.0 10.0
# I want to compare each column data with predefined limits and assign a rank.
# For A col, Give rank 1 if > 70, 2 if 70< > 40, 3 if < 40
# For B col, Give rank 1 if > 80, 2 if 80< > 45, 3 if < 45
# perform the logical operation
df['A_op','B_op'] = pd.cut(df, bins=[[np.NINF, 40, 70, np.inf],[np.NINF, 45, 80, np.inf]], labels=[[3, 2, 1],[3, 2, 1]])
Present output:
ValueError: Input array must be 1 dimensional
Expected output:
df =
A B A_op B_op
0 78.2 98.2 1 1
1 54.0 58.0 2 2
2 45.0 49.0 2 2
3 20.0 10.0 3 3
It doesn't look like you need to use pd.cut for this. You can simply use np.select:
df["A_op"] = np.select([df["A"]>70, df["A"]<40],[1,3], 2)
df["B_op"] = np.select([df["B"]>80, df["B"]<45],[1,3], 2)
print (df)
A B A_op B_op
0 78.2 98.2 1 1
1 54.0 58.0 2 2
2 45.0 49.0 2 2
3 20.0 10.0 3 3
After a series of trials, I found the direct answer from the select method.
My answer:
rankdf = pd.DataFrame({'Ah':[70],'Al':[40],'Bh':[80],'Bl':[45]})
hcols = ['Ah','Bh']
lcols = ['Al','Bl']
# input columns
ip_cols = ['A','B']
#create empty op columns in df
op_cols = ['A_op','B_op']
df = pd.concat([df,pd.DataFrame(columns=op_cols)])
# logic operation
df[op_cols] = np.select([df[ip_cols ]>rankdf[hcols].values, df[ip_cols]<rankdf[lcols].values],[1,3],2)
Present output:
A B A_op B_op
0 78.2 98.2 1 1
1 54.0 58.0 2 3
2 45.0 49.0 2 3
3 20.0 10.0 3 3
How can I find the first element of one of the sessions (for each group) which starts a new series of continuous values?
import pandas as pd
df = pd.DataFrame({'group':[1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,], 'value':[
1,2,3,4,5,10,11, 15, 16,17,18,19,20, # 13
21, 22,23,24,26,27.28,
4,5,6, 8,9,10,11,12, 13,14
]})
display(df)
so far I am stuck here:
df['shifted_value'] = df['value'].shift(-1)
df['difference_nect'] = df['shifted_value'] - df['value']
# this is obviously not yet correct - how can I get the first element (elemnt of 0 for each of the starting sessions)
df['session_element_index'] = df.groupby(['group']).cumcount()
df.head()
In SQL I would use a window function and compare previous/next elements to determine if a session starts/ends. Is there a nicer more pandas native way how to do this in a vectorized way?
Use DataFrameGroupBy.diff with compare not equal 1 and filter in boolean indexing:
df1 = df[df.groupby('group')['value'].diff().ne(1)]
print (df1)
group value
0 1 1.00
5 1 10.00
7 1 15.00
17 1 26.00
18 1 27.28
19 2 4.00
22 2 8.00
If need counter column:
g = df.groupby('group')['value'].apply(lambda x: x.diff().ne(1).cumsum())
df['session_element_index'] = df.groupby(g).cumcount()
print (df.head(10))
group value session_element_index
0 1 1.0 0
1 1 2.0 1
2 1 3.0 2
3 1 4.0 3
4 1 5.0 4
5 1 10.0 0
6 1 11.0 1
7 1 15.0 0
8 1 16.0 1
9 1 17.0 2
As the first approach:
out = df.groupby("group", as_index=False).value \
.apply(lambda s: ((s - s.shift()) != 1.0).cumsum() \
.drop_duplicates())
>>> out
0 0 1
5 2
7 3
17 4
18 5
1 19 1
22 2
Name: value, dtype: int64
>>> out.index.get_level_values(1)
Int64Index([0, 5, 7, 17, 18, 19, 22], dtype='int64')
I am trying to calculate the difference in certain rows based on the values from other columns.
Using the example data frame below, I want to calculate the difference in Time based on the values in the Code column. Specifically, I want to loop through and determine the time difference between B and A. So Time in B - Time in A.
I can do this manually using the iloc function but I was hoping to determine a more efficient way. Especially if I have to repeat this process numerous times.
import pandas as pd
import numpy as np
k = 5
N = 15
d = ({'Time' : np.random.randint(k, k + 100 , size=N),
'Code' : ['A','x','B','x','A','x','B','x','A','x','B','x','A','x','B']})
df = pd.DataFrame(data=d)
Output:
Code Time
0 A 89
1 x 39
2 B 24
3 x 62
4 A 83
5 x 57
6 B 69
7 x 10
8 A 87
9 x 62
10 B 86
11 x 11
12 A 54
13 x 44
14 B 71
Expected Output:
diff
1 -65
2 -14
3 -1
4 17
First filter by boolean indexing, then subtract by sub with reset_index for default index for align Series a and b and last if want one column DataFrame add to_frame:
a = df.loc[df['Code'] == 'A', 'Time'].reset_index(drop=True)
b = df.loc[df['Code'] == 'B', 'Time'].reset_index(drop=True)
Similar alternative solution:
a = df.loc[df['Code'] == 'A'].reset_index()['Time']
b = df.loc[df['Code'] == 'B'].reset_index()['Time']
c = b.sub(a).to_frame('diff')
print (c)
diff
0 -65
1 -14
2 -1
3 17
Last for new index start from 1 add rename:
c = b.sub(a).to_frame('diff').rename(lambda x: x + 1)
print (c)
diff
1 -65
2 -14
3 -1
4 17
Another approach if need count more difference is reshape by unstack:
df = df.set_index(['Code', df.groupby('Code').cumcount() + 1])['Time'].unstack()
print (df)
1 2 3 4 5 6 7
Code
A 89.0 83.0 87.0 54.0 NaN NaN NaN
B 24.0 69.0 86.0 71.0 NaN NaN NaN
x 39.0 62.0 57.0 10.0 62.0 11.0 44.0
#last remove `NaN`s rows
c = df.loc['B'].sub(df.loc['A']).dropna()
print (c)
1 -65.0
2 -14.0
3 -1.0
4 17.0
dtype: float64
#subtract with NaNs values - fill_value=0 return non NaNs values
d = df.loc['x'].sub(df.loc['A'], fill_value=0)
print (d)
1 -50.0
2 -21.0
3 -30.0
4 -44.0
5 62.0
6 11.0
7 44.0
dtype: float64
Assuming your Code is a repeat of 'A', 'x', 'B', 'x', you can just use
>>> (df.Time[df.Code == 'B'].reset_index() - df.Time[df.Code == 'A'].reset_index())[['Time']]
Time
0 -65
1 -14
2 -1
3 17
But note that the original assumption, that 'A' and 'B' values alternate, seems fragile.
If you want the indexes to run from 1 to 4, as in your question, you can assign the previous to diff, and then use
diff.index += 1
>>> diff
Time
1 -65
2 -14
3 -1
4 17
Hi I would like to implement a counter which counts the number of successive zero observations in a dataframe (across multiple columns). But I would like to reset it if a non-zero observation is found. I have used a for loop but it is incredibly slow, I am sure there must be far more efficient ways. This is my code:
Here is a snapshot of df
df.head()
ACL ACT ADH ADR AFE AFH AFT
2013-02-05 NaN NaN NaN NaN NaN NaN NaN
2013-02-12 -0.136861 -0.020406 0.046150 0.000000 -0.005321 NaN 0.058195
2013-02-19 -0.006632 0.041665 0.007365 0.012738 0.040930 NaN -0.037818
2013-02-26 -0.023848 -0.023999 -0.030677 -0.003144 0.050604 NaN -0.047604
2013-03-05 0.009771 -0.024589 -0.021073 -0.039432 0.047315 NaN 0.068727
I first initialise an empty data frame which has the same properties of df (dataframe) above
df1=pd.DataFrame( index= df, columns=df)
df1=df1.fillna(0)
Then I create my function which iterates over the rows, but this only deals with one column at a time
def zero_obs(x=df,y=df1):
for i in range(len(x)):
if x[i] == 0:
y[i] = y[i-1] + 1
else:
y[i] = 0
return y
for col in df.columns:
df1[col] = zero_obs(x=df[col],y=df1[col])
Really appreciate any help!!
The output i expect is as follows:
df1.tail()
BRN AXL TTO AGL ACL
2017-01-03 3 125 0 0 0
2017-01-10 0 126 0 0 0
2017-01-17 1 127 0 0 0
2017-01-24 0 128 0 0 0
2017-01-31 0 129 1 0 0
setup
Consider the dataframe df
df = pd.DataFrame(
np.zeros((10, 2), dtype=int),
columns=list('AB')
)
df.loc[[0, 4, 8], 'A'] = 1
df.loc[6, 'B'] = 1
print(df)
A B
0 1 0
1 0 0
2 0 0
3 0 0
4 1 0
5 0 0
6 0 1
7 0 0
8 1 0
9 0 0
Option 1
pandas apply
def zero_obs(x):
"""`x` is assumed to be a `pd.Series`"""
csum = x.eq(0).cumsum()
cpos = csum.where(x.ne(0)).ffill().fillna(0)
return csum.sub(cpos)
print(df.apply(zero_obs))
A B
0 0.0 1.0
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 0.0 5.0
5 1.0 6.0
6 2.0 0.0
7 3.0 1.0
8 0.0 2.0
9 1.0 3.0
Option 2
don't use apply
This function works just as well on df
zero_obs(df)
A B
0 0.0 1.0
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 0.0 5.0
5 1.0 6.0
6 2.0 0.0
7 3.0 1.0
8 0.0 2.0
9 1.0 3.0