Implement a counter which resets in python panda data frame - python

Hi I would like to implement a counter which counts the number of successive zero observations in a dataframe (across multiple columns). But I would like to reset it if a non-zero observation is found. I have used a for loop but it is incredibly slow, I am sure there must be far more efficient ways. This is my code:
Here is a snapshot of df
df.head()
ACL ACT ADH ADR AFE AFH AFT
2013-02-05 NaN NaN NaN NaN NaN NaN NaN
2013-02-12 -0.136861 -0.020406 0.046150 0.000000 -0.005321 NaN 0.058195
2013-02-19 -0.006632 0.041665 0.007365 0.012738 0.040930 NaN -0.037818
2013-02-26 -0.023848 -0.023999 -0.030677 -0.003144 0.050604 NaN -0.047604
2013-03-05 0.009771 -0.024589 -0.021073 -0.039432 0.047315 NaN 0.068727
I first initialise an empty data frame which has the same properties of df (dataframe) above
df1=pd.DataFrame( index= df, columns=df)
df1=df1.fillna(0)
Then I create my function which iterates over the rows, but this only deals with one column at a time
def zero_obs(x=df,y=df1):
for i in range(len(x)):
if x[i] == 0:
y[i] = y[i-1] + 1
else:
y[i] = 0
return y
for col in df.columns:
df1[col] = zero_obs(x=df[col],y=df1[col])
Really appreciate any help!!
The output i expect is as follows:
df1.tail()
BRN AXL TTO AGL ACL
2017-01-03 3 125 0 0 0
2017-01-10 0 126 0 0 0
2017-01-17 1 127 0 0 0
2017-01-24 0 128 0 0 0
2017-01-31 0 129 1 0 0

setup
Consider the dataframe df
df = pd.DataFrame(
np.zeros((10, 2), dtype=int),
columns=list('AB')
)
df.loc[[0, 4, 8], 'A'] = 1
df.loc[6, 'B'] = 1
print(df)
A B
0 1 0
1 0 0
2 0 0
3 0 0
4 1 0
5 0 0
6 0 1
7 0 0
8 1 0
9 0 0
Option 1
pandas apply
def zero_obs(x):
"""`x` is assumed to be a `pd.Series`"""
csum = x.eq(0).cumsum()
cpos = csum.where(x.ne(0)).ffill().fillna(0)
return csum.sub(cpos)
print(df.apply(zero_obs))
A B
0 0.0 1.0
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 0.0 5.0
5 1.0 6.0
6 2.0 0.0
7 3.0 1.0
8 0.0 2.0
9 1.0 3.0
Option 2
don't use apply
This function works just as well on df
zero_obs(df)
A B
0 0.0 1.0
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 0.0 5.0
5 1.0 6.0
6 2.0 0.0
7 3.0 1.0
8 0.0 2.0
9 1.0 3.0

Related

How to calculate cumulative sum until a threshold and reset it after the threshold is reached considering groups in pandas dataframe in python?

I have a dataframe like this:
import pandas as pd
import numpy as np
data={'trip':[1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3],
'timestamps':[1235471761, 1235471763, 1235471765, 1235471767, 1235471770, 1235471772, 1235471776, 1235471779, 1235471780, 1235471789,1235471792,1235471793,1235471829,1235471833,1235471835,1235471838,1235471844,1235471847,1235471848,1235471852,1235471855,1235471859,1235471900,1235471904,1235471911,1235471913]}
df = pd.DataFrame(data)
df['TimeDistance'] = df.groupby('trip')['timestamps'].diff(1)
df
What I am looking for is to start from the first row(consider it as an origin) in the "TimeDistance" column and doing cumulative sum over its values and whenever this summation reach 10, restart the cumsum and continue this procedure until the end of the trip (as you can see in this dataframe we have 3 trips in the "trip" column).
I want all the cumulative sum in a new column,lets say, "cumu" column.
Another important issue is that after reaching out threshold, the next row after threshold in the "cumu" column must be zero and the summation restart from this new origin again.
I hope I've understood your question right. You can use generator with .send():
def my_accumulate(maxval):
val = 0
yield
while True:
if val < maxval:
val += yield val
else:
yield val
val = 0
def fn(x):
a = my_accumulate(10)
next(a)
x["cumu"] = [a.send(v) for v in x["TimeDistance"]]
return x
df = df.groupby("trip").apply(fn)
print(df)
Prints:
trip timestamps TimeDistance cumu
0 1 1235471761 NaN 0.0
1 1 1235471763 2.0 2.0
2 1 1235471765 2.0 4.0
3 1 1235471767 2.0 6.0
4 1 1235471770 3.0 9.0
5 1 1235471772 2.0 11.0
6 1 1235471776 4.0 0.0
7 1 1235471779 3.0 3.0
8 1 1235471780 1.0 4.0
9 1 1235471789 9.0 13.0
10 1 1235471792 3.0 0.0
11 1 1235471793 1.0 1.0
12 2 1235471829 NaN 0.0
13 2 1235471833 4.0 4.0
14 2 1235471835 2.0 6.0
15 2 1235471838 3.0 9.0
16 2 1235471844 6.0 15.0
17 2 1235471847 3.0 0.0
18 2 1235471848 1.0 1.0
19 2 1235471852 4.0 5.0
20 2 1235471855 3.0 8.0
21 2 1235471859 4.0 12.0
22 3 1235471900 NaN 0.0
23 3 1235471904 4.0 4.0
24 3 1235471911 7.0 11.0
25 3 1235471913 2.0 0.0
Another solution:
df = df.groupby("trip").apply(
lambda x: x.assign(
cumu=(
val := 0,
*(
val := val + v if val < 10 else (val := 0)
for v in x["TimeDistance"][1:]
),
)
),
)
print(df)
Andrej's answer is better, as mine is probably not as efficient, and it depends on the df being ordered by trip and the TimeDistance being nan as the first value of each trip.
cummulative_sum = 0
df['cumu'] = 0
for i in range(len(df)):
if np.isnan(df.loc[i,'TimeDistance']) or cummulative_sum >= 10:
cummulative_sum = 0
df.loc[i, 'cumu'] = 0
else:
cummulative_sum += df.loc[i,'TimeDistance']
df.loc[i, 'cumu'] = cummulative_sum
print(df) outputs:
trip timestamps TimeDistance cumu
0 1 1235471761 NaN 0
1 1 1235471763 2.0 2
2 1 1235471765 2.0 4
3 1 1235471767 2.0 6
4 1 1235471770 3.0 9
5 1 1235471772 2.0 11
6 1 1235471776 4.0 0
7 1 1235471779 3.0 3
8 1 1235471780 1.0 4
9 1 1235471789 9.0 13
10 1 1235471792 3.0 0
11 1 1235471793 1.0 1
12 2 1235471829 NaN 0
13 2 1235471833 4.0 4
14 2 1235471835 2.0 6
15 2 1235471838 3.0 9
16 2 1235471844 6.0 15
17 2 1235471847 3.0 0
18 2 1235471848 1.0 1
19 2 1235471852 4.0 5
20 2 1235471855 3.0 8
21 2 1235471859 4.0 12
22 3 1235471900 NaN 0
23 3 1235471904 4.0 4
24 3 1235471911 7.0 11
25 3 1235471913 2.0 0

Applying a lambda function to columns on pandas avoiding redundancy

I have this dataset, which contains some NaN values:
df = pd.DataFrame({'Id':[1,2,3,4,5,6], 'Name':['Eve','Diana',np.NaN,'Mia','Mae',np.NaN], "Count":[10,3,np.NaN,8,5,2]})
df
Id Name Count
0 1 Eve 10.0
1 2 Diana 3.0
2 3 NaN NaN
3 4 Mia 8.0
4 5 Mae 5.0
5 6 NaN 2.0
I want to test if the column has a NaN value (0) or not (1) and creating two new columns. I have tried this:
df_clean = df
df_clean[['Name_flag','Count_flag']] = df_clean[['Name','Count']].apply(lambda x: 0 if x == np.NaN else 1, axis = 1)
But it mentions that The truth value of a Series is ambiguous. I want to make it avoiding redundancy, but I see there is a mistake in my logic. Please, could you help me with this question?
The expected table is:
Id Name Count Name_flag Count_flag
0 1 Eve 10.0 1 1
1 2 Diana 3.0 1 1
2 3 NaN NaN 0 0
3 4 Mia 8.0 1 1
4 5 Mae 5.0 1 1
5 6 NaN 2.0 0 1
Multiply boolean mask by 1:
df[['Name_flag','Count_flag']] = df[['Name', 'Count']].isna() * 1
>>> df
Id Name Count Name_flag Count_flag
0 1 Eve 10.0 0 0
1 2 Diana 3.0 0 0
2 3 NaN NaN 1 1
3 4 Mia 8.0 0 0
4 5 Mae 5.0 0 0
5 6 NaN 2.0 1 0
For your problem of The truth value of a Series is ambiguous
For apply, you cannot return a scalar 0 or 1 because you have a series as input . You have to use applymap instead to apply a function elementwise. But comparing to NaN is not an easy thing:
Try:
df[['Name','Count']].applymap(lambda x: str(x) == 'nan') * 1
We can use isna and convert the boolean to int:
df[["Name_flag", "Count_flag"]] = df[["Name", "Count"]].isna().astype(int)
Id Name Count Name_flag Count_flag
0 1 Eve 10.00 0 0
1 2 Diana 3.00 0 0
2 3 NaN NaN 1 1
3 4 Mia 8.00 0 0
4 5 Mae 5.00 0 0
5 6 NaN 2.00 1 0

Applying a function to chunks of the Dataframe

I have a Dataframe (df) (for instance - simplified version)
A B
0 2.0 3.0
1 3.0 4.0
and generated 20 bootstrap resamples that are all now in the same df but differ in the Resample Nr.
A B
0 1 0 2.0 3.0
1 1 1 3.0 4.0
2 2 1 3.0 4.0
3 2 1 3.0 4.0
.. ..
.. ..
39 20 0 2.0 3.0
40 20 0 2.0 3.0
Now I want to apply a certain function on each Reample Nr. Say:
C = sum(df['A'] * df['B']) / sum(df['B'] ** 2)
The outlook would look like this:
A B C
0 1 0 2.0 3.0 Calculated Value X1
1 1 1 3.0 4.0 Calculated Value X1
2 2 1 3.0 4.0 Calculated Value X2
3 2 1 3.0 4.0 Calculated Value X2
.. ..
.. ..
39 20 0 2.0 3.0 Calculated Value X20
40 20 0 2.0 3.0 Calculated Value X20
So there are 20 different new values.
I know there is a df.iloc command where I can specify my row selection df.iloc[row, column] but I would like to find a command where I don't have to repeat the code for the 20 samples.
My goal is to find a command that identifies the Resample Nr. automatically and then calculates the function for each Resample Nr.
How can I do this?
Thank you!
Use DataFrame.assign to create two new columns x and y that corresponds to df['A'] * df['B'] and df['B']**2, then use DataFrame.groupby on Resample Nr. (or level=1) and transform using sum:
s = df.assign(x=df['A'].mul(df['B']), y=df['B']**2)\
.groupby(level=1)[['x', 'y']].transform('sum')
df['C'] = s['x'].div(s['y'])
Result:
A B C
0 1 0 2.0 3.0 0.720000
1 1 1 3.0 4.0 0.720000
2 2 1 3.0 4.0 0.750000
3 2 1 3.0 4.0 0.750000
39 20 0 2.0 3.0 0.666667
40 20 0 2.0 3.0 0.666667

Python Pandas: difference of column values insert into new column

I have a Pandas dataframe that looks like the following:
c1 c2 c3 c4
p1 q1 r1 20
p2 q2 r2 10
p3 q3 r1 30
The Desired output looks like this.
c1 c2 c3 c4 NewColumn(c1.1)
p1 q1 r1 20 0
p2 q2 r2 10 p2-p1
p3 q3 r1 30 p3-p2
The shape of my dataset is(333650,665) I want to do that for all columns. Are there any ways to achieve this?
The code I am using:
data = pd.read_csv('Mydataset.csv')
i=0
j=1
while j < len(data['columnname']):
j=data['columnname'][i+1] - data['columnname'][i]
i+=1 #Next value of column.
j+=1 #Next value new column.
print(j)
Is this what you want? it finds the difference between the rows of a particular column using the shift method and assigns it to a new column.
Note that I am using the data from Dave.
df['New Column'] = df.a.sub(df.a.shift()).fillna(0)
a b c New Column
0 1 1 1 0.0
1 2 1 4 1.0
2 3 2 9 1.0
3 4 3 16 1.0
4 5 5 25 1.0
5 6 8 36 1.0
For multiple columns, this may suffice:
M = df.diff().fillna(0).add_suffix('_1')
#concatenate along the columns axis
pd.concat([df,M], axis = 1)
a b c a_1 b_1 c_1
0 1 1 1 0.0 0.0 0.0
1 2 1 4 1.0 0.0 3.0
2 3 2 9 1.0 1.0 5.0
3 4 3 16 1.0 1.0 7.0
4 5 5 25 1.0 2.0 9.0
5 6 8 36 1.0 3.0 11.0
You want the diff function:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html
df
a b c
0 1 1 1
1 2 1 4
2 3 2 9
3 4 3 16
4 5 5 25
5 6 8 36
df.diff()
a b c
0 NaN NaN NaN
1 1.0 0.0 3.0
2 1.0 1.0 5.0
3 1.0 1.0 7.0
4 1.0 2.0 9.0
5 1.0 3.0 11.0

Forward fill column with an index-based limit

I want to forward fill a column and I want to specify a limit, but I want the limit to be based on the index---not a simple number of rows like limit allows.
For example, say I have the dataframe given by:
df = pd.DataFrame({
'data': [0.0, 1.0, np.nan, 3.0, np.nan, 5.0, np.nan, np.nan, np.nan, np.nan],
'group': [0, 0, 0, 1, 1, 0, 0, 0, 1, 1]
})
which looks like
In [27]: df
Out[27]:
data group
0 0.0 0
1 1.0 0
2 NaN 0
3 3.0 1
4 NaN 1
5 5.0 0
6 NaN 0
7 NaN 0
8 NaN 1
9 NaN 1
If I group by the group column and forward fill in that group with limit=2, then my resulting data frame will be
In [35]: df.groupby('group').ffill(limit=2)
Out[35]:
group data
0 0 0.0
1 0 1.0
2 0 1.0
3 1 3.0
4 1 3.0
5 0 5.0
6 0 5.0
7 0 5.0
8 1 3.0
9 1 NaN
What I actually want to do here however is only forward fill onto rows whose indexes are within say 2 from the first index of each group, as opposed to the next 2 rows of each group. For example, if we just look at the groups on the dataframe:
In [36]: for i, group in df.groupby('group'):
...: print(group)
...:
data group
0 0.0 0
1 1.0 0
2 NaN 0
5 5.0 0
6 NaN 0
7 NaN 0
data group
3 3.0 1
4 NaN 1
8 NaN 1
9 NaN 1
I would want the second group here to only be forward filled to index 4---not 8 and 9. The first group's NaN values are all within 2 indexes from the last non-NaN values, so they would be filled completely. The resulting dataframe would look like:
group data
0 0 0.0
1 0 1.0
2 0 1.0
3 1 3.0
4 1 3.0
5 0 5.0
6 0 5.0
7 0 5.0
8 1 NaN
9 1 NaN
FWIW in my actual use case, my index is a DateTimeIndex (and it is sorted).
I currently have a solution which sort of works, requiring looping through the dataframe filtered on the group indexes, creating a time range for every single event with a non-NaN value based on the index, and then combining those. But this is far too slow to be practical.
import numpy as np
import pandas as pd
df = pd.DataFrame({
'data': [0.0, 1.0, 1, 3.0, np.nan, 22, np.nan, 5, np.nan, np.nan],
'group': [0, 0, 1, 0, 1, 0, 1, 0, 1, 1]})
df = df.reset_index()
df['stop_index'] = df['index'] + 2
df['stop_index'] = df['stop_index'].where(pd.notnull(df['data']))
df['stop_index'] = df.groupby('group')['stop_index'].ffill()
df['mask'] = df['index'] <= df['stop_index']
df.loc[df['mask'], 'data'] = df.groupby('group')['data'].ffill()
print(df)
# index data group stop_index mask
# 0 0 0.0 0 2.0 True
# 1 1 1.0 0 3.0 True
# 2 2 1.0 1 4.0 True
# 3 3 3.0 0 5.0 True
# 4 4 1.0 1 4.0 True
# 5 5 22.0 0 7.0 True
# 6 6 NaN 1 4.0 False
# 7 7 5.0 0 9.0 True
# 8 8 NaN 1 4.0 False
# 9 9 NaN 1 4.0 False
# clean up df
df = df[['data', 'group']]
print(df)
yields
data group
0 0.0 0
1 1.0 0
2 1.0 1
3 3.0 0
4 1.0 1
5 22.0 0
6 NaN 1
7 5.0 0
8 NaN 1
9 NaN 1
This copies the index into a column, then
makes a second stop_index column which is the index augmented by the size of
the (time) window.
df = df.reset_index()
df['stop_index'] = df['index'] + 2
Then it makes null rows in stop_index to match null rows in data:
df['stop_index'] = df['stop_index'].where(pd.notnull(df['data']))
Then it forward-fills stop_index on a per-group basis:
df['stop_index'] = df.groupby('group')['stop_index'].ffill()
Now (at last) we can define the desired mask -- the places where we actually want to forward-fill data:
df['mask'] = df['index'] <= df['stop_index']
df.loc[df['mask'], 'data'] = df.groupby('group')['data'].ffill()
IIUC
l=[]
for i, group in df.groupby('group'):
idx=group.index
l.append(group.reindex(df.index).ffill(limit=2).loc[idx])
pd.concat(l).sort_index()
data group
0 0.0 0.0
1 1.0 0.0
2 1.0 0.0
3 3.0 1.0
4 3.0 1.0
5 5.0 0.0
6 5.0 0.0
7 5.0 0.0
8 NaN 1.0
9 NaN 1.0
Testing data
data group
0 0.0 0
1 1.0 0
2 1.0 1
3 3.0 0
4 NaN 1
5 22 0
6 NaN 1
7 5.0 0
8 NaN 1
9 NaN 1
My method for testing data
data group
0 0.0 0.0
1 1.0 0.0
2 1.0 1.0
3 3.0 0.0
4 1.0 1.0
5 22.0 0.0
6 NaN 1.0# here not change , since the previous two do not have valid value for group 1
7 5.0 0.0
8 NaN 1.0
9 NaN 1.0
Out put with unutbu
data group
0 0.0 0
1 1.0 0
2 1.0 1
3 3.0 0
4 1.0 1
5 22.0 0
6 1.0 1# miss match in here
7 5.0 0
8 NaN 1
9 NaN 1

Categories