So I have a data set with about 70,000 data points, and I'm trying to test out some code on a sample data set to make sure it will work on the large one. The sample data set follows this format:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'cond': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B','B', 'B', 'B', 'B', 'B','B','B'],
'time': ['2009-07-09 15:00:00',
'2009-07-09 18:33:00',
'2009-07-09 20:55:00',
'2009-07-10 00:01:00',
'2009-07-10 09:00:00',
'2009-07-10 15:00:00',
'2009-07-10 18:00:00',
'2009-07-11 00:01:00',
'2009-07-12 03:10:00',
'2009-07-09 06:00:00',
'2009-07-10 15:00:00',
'2009-07-11 18:00:00',
'2009-07-11 21:00:00',
'2009-07-12 00:30:00',
'2009-07-12 12:05:00',
'2009-07-12 15:00:00',
'2009-07-13 21:00:00',
'2009-07-14 00:01:00'],
'Score': [0.0, 1.0, 0.0, 0.0, 0.0, -1.0, 0.0, 0.0, 1.0, 0.0, -1.0, 0.0, 1.0, 0.0, 0.0, -1.0, 0.0, 0.0],
})
print(df)
I'm essentially trying to create 2 indicator columns. The first indicator column follows the rule that for each condition (A and B), once I have a score of -1, I should indicate that row as "1" for the rest of that condition. The second indicator column should indicate for each row whether at least 24 hours has passed since the last score of -1. Thus the final result should look something like:
cond time Score Indicator 1 Indicator 2
0 A 2009-07-09 15:00:00 0.0 0 0
1 A 2009-07-09 18:33:00 1.0 0 0
2 A 2009-07-09 20:55:00 0.0 0 0
3 A 2009-07-10 00:01:00 0.0 0 0
4 A 2009-07-10 09:00:00 0.0 0 0
5 A 2009-07-10 15:00:00 -1.0 1 0
6 A 2009-07-10 18:00:00 0.0 1 0
7 A 2009-07-11 00:01:00 0.0 1 0
8 A 2009-07-12 03:10:00 1.0 1 1
9 B 2009-07-09 06:00:00 0.0 0 0
10 B 2009-07-10 15:00:00 -1.0 1 0
11 B 2009-07-11 18:00:00 0.0 1 1
12 B 2009-07-11 21:00:00 1.0 1 1
13 B 2009-07-12 00:30:00 0.0 1 1
14 B 2009-07-12 12:05:00 0.0 1 1
15 B 2009-07-12 15:00:00 -1.0 1 0
16 B 2009-07-13 21:00:00 0.0 1 1
17 B 2009-07-14 00:01:00 0.0 1 1
This is in the similar realm to the question I asked yesterday about Indicator 1, but I realized that because my large data set has so many conditions (700+), I ended up needing help on how to apply the Indicator 1 solution when it's not feasible to individually write out all the cond values, and for Indicator 2, I was working on using a rolling window function, but all the conditions I saw for rolling window examples were looking at a rolling sums or rolling means which is not what I'm trying to compute here, so I'm unsure if what I want exists using a rolling window.
Try:
#convert to datetime if needed
df["time"] = pd.to_datetime(df["time"])
#get the first time the score is -1 for each ID
first = df["cond"].map(df[df["Score"].eq(-1)].groupby("cond")["time"].min())
#get the most recent time that the score is -1
recent = df.loc[df["Score"].eq(-1), "time"].reindex(df.index, method="ffill")
#check that the time is greater than the first -1
df["Indicator 1"] = df["time"].ge(first).astype(int)
#check that at least 1 day has passed since the most recent -1
df["Indicator 2"] = df["time"].sub(recent).dt.days.ge(1).astype(int)
>>> df
cond time Score Indicator 1 Indicator 2
0 A 2009-07-09 15:00:00 0.0 0 0
1 A 2009-07-09 18:33:00 1.0 0 0
2 A 2009-07-09 20:55:00 0.0 0 0
3 A 2009-07-10 00:01:00 0.0 0 0
4 A 2009-07-10 09:00:00 0.0 0 0
5 A 2009-07-10 15:00:00 -1.0 1 0
6 A 2009-07-10 18:00:00 0.0 1 0
7 A 2009-07-11 00:01:00 0.0 1 0
8 A 2009-07-12 03:10:00 1.0 1 1
9 B 2009-07-09 06:00:00 0.0 0 0
10 B 2009-07-10 15:00:00 -1.0 1 0
11 B 2009-07-11 18:00:00 0.0 1 1
12 B 2009-07-11 21:00:00 1.0 1 1
13 B 2009-07-12 00:30:00 0.0 1 1
14 B 2009-07-12 12:05:00 0.0 1 1
15 B 2009-07-12 15:00:00 -1.0 1 0
16 B 2009-07-13 21:00:00 0.0 1 1
17 B 2009-07-14 00:01:00 0.0 1 1
A simple approach IMO, using cummax for the first indicator, and a diff from the first value per group combined with a mask for the second:
# indicator 1
df['Indicator 1'] = df['Score'].eq(-1).astype(int).groupby(df['cond']).cummax()
# indicator 2
# convert to datetime
df['time'] = pd.to_datetime(df['time'])
# groups starting by -1
m1 = df['Score'].eq(-1).groupby(df['cond']).cumsum()
# is the time difference greater than 24h since the group start
m2 = df.groupby(['cond', m1])['time'].apply(lambda s: s.sub(s.iloc[0]).gt('24h'))
df['Indicator 2'] = (m1.ne(0) & m2).astype(int)
Output:
cond time Score Indicator 1 Indicator 2
0 A 2009-07-09 15:00:00 0.0 0 0
1 A 2009-07-09 18:33:00 1.0 0 0
2 A 2009-07-09 20:55:00 0.0 0 0
3 A 2009-07-10 00:01:00 0.0 0 0
4 A 2009-07-10 09:00:00 0.0 0 0
5 A 2009-07-10 15:00:00 -1.0 1 0
6 A 2009-07-10 18:00:00 0.0 1 0
7 A 2009-07-11 00:01:00 0.0 1 0
8 A 2009-07-12 03:10:00 1.0 1 1
9 B 2009-07-09 06:00:00 0.0 0 0
10 B 2009-07-10 15:00:00 -1.0 1 0
11 B 2009-07-11 18:00:00 0.0 1 1
12 B 2009-07-11 21:00:00 1.0. 1 1
13 B 2009-07-12 00:30:00 0.0 1 1
14 B 2009-07-12 12:05:00 0.0 1 1
15 B 2009-07-12 15:00:00 -1.0 1 0
16 B 2009-07-13 21:00:00 0.0 1 0
17 B 2009-07-14 00:01:00 0.0 1 0
Related
I have two dataframes:
daily = pd.DataFrame({'Date': pd.date_range(start="2021-01-01",end="2021-04-29")})
pc21 = pd.DataFrame({'Date': ["2021-01-21", "2021-03-11", "2021-04-22"]})
pc21['Date'] = pd.to_datetime(pc21['Date'])
What I want to do is the following: for every date in pc21 and if the date in pc21 is in daily, I want to get, in a new column, values equal 1 for 8 days after the date and 0 otherwise.
This is an example of a desired output:
# 2021-01-21 is in either daframes so I want a new column in 'daily' that looks like this:
Date newcol
.
.
.
2021-01-20 0
2021-01-21 1
2021-01-22 1
2021-01-23 1
2021-01-24 1
2021-01-25 1
2021-01-26 1
2021-01-27 1
2021-01-28 1
2021-01-29 0
.
.
.
Can anyone help me achieve this?
Thanks!
you can try the following approach:
res = (daily
.merge(pd.concat([pd.date_range(d, freq="D", periods=8).to_frame(name="Date")
for d in pc21["Date"]]),
how="left", indicator=True)
.replace({"both": 1, "left_only":0})
.rename(columns={"_merge":"newcol"}))
result
In [15]: res
Out[15]:
Date newcol
0 2021-01-01 0
1 2021-01-02 0
2 2021-01-03 0
3 2021-01-04 0
4 2021-01-05 0
.. ... ...
114 2021-04-25 1
115 2021-04-26 1
116 2021-04-27 1
117 2021-04-28 1
118 2021-04-29 1
[119 rows x 2 columns]
daily['value'] = 0
pc21['value'] = 1
daily = pd.merge(daily, pc21, on='Date', how='left').rename(
columns={'value_y':'value'}).drop('value_x', 1).fillna(method="ffill", limit=7).fillna(0)
pc21.drop('value',1)
Output Subset
daily.query('value.eq(1)')
Date value
20 2021-01-21 1.0
21 2021-01-22 1.0
22 2021-01-23 1.0
23 2021-01-24 1.0
24 2021-01-25 1.0
25 2021-01-26 1.0
26 2021-01-27 1.0
27 2021-01-28 1.0
69 2021-03-11 1.0
daily["new_col"] = np.where(daily.Date.isin(pc21.Date), 1, np.nan)
daily["new_col"] = daily["new_col"].fillna(method="ffill", limit=7).fillna(0)
We generate the new column first:
If the Date of daily is in Date of pc21
then put 1
else
put a NaN
Then forward fill that column but with a limit of 7 so that we have 8 consecutive 1s
Lastly forward fill again the remaining NaNs with 0.
(you can put an astype(int) at the end to have integers).
I've got a dataframe called new_dh of web request that looks like (there are more columns
s-sitename sc-win32-status
date_time
2006-11-01 00:00:00 W3SVC1 0.0
2006-11-01 00:00:00 W3SVC1 0.0
2006-11-01 01:00:00 W3SVC1 0.0
2006-11-01 01:00:00 W3SVC1 0.0
2006-11-01 02:00:00 W3SVC1 0.0
2007-02-28 02:00:00 W3SVC1 0.0
2007-02-28 10:00:00 W3SVC1 0.0
2007-02-28 23:00:00 W3SVC1 0.0
2007-02-28 23:00:00 W3SVC1 0.0
2007-02-28 23:00:00 W3SVC1 0.0
What I would like to do is group by the hours(the actual date of the request does not matter, just the hour and all the times have already been rounded down to not include minutes) for the datetimeindex and instead return
count
hour
0 2
01 2
02 2
10 1
23 3
Any help would be much appreciated.
I have tried
new_dh.groupby([new_dh.index.hour]).count()
but find myself printing many columns of the same value whereas I only want the above version
If need DatetimeIndex in output use DataFrame.resample:
new_dh.resample('H')['s-sitename'].count()
Or DatetimeIndex.floor:
new_dh.groupby(new_dh.index.floor('H'))['s-sitename'].count()
Problem of your solution is if use GroupBy.count it count all columns value per Hours with exclude missing values, so if no missing values get multiple columns with same values. Possible solution is specify column after groupby:
new_dh.groupby([new_dh.index.hour])['s-sitename'].count()
So data was changed for see how count with exclude missing values:
print (new_dh)
s-sitename sc-win32-status
date_time
2006-11-01 00:00:00 W3SVC1 0.0
2006-11-01 00:00:00 W3SVC1 0.0
2006-11-01 01:00:00 W3SVC1 0.0
2006-11-01 01:00:00 W3SVC1 0.0
2006-11-01 02:00:00 NaN 0.0
2007-02-28 02:00:00 W3SVC1 0.0
2007-02-28 10:00:00 W3SVC1 0.0
2007-02-28 23:00:00 NaN 0.0
2007-02-28 23:00:00 NaN 0.0
2007-02-28 23:00:00 W3SVC1 0.0
df = new_dh.groupby([new_dh.index.hour]).count()
print (df)
s-sitename sc-win32-status
date_time
0 2 2
1 2 2
2 1 2
10 1 1
23 1 3
So if column is specified:
s = new_dh.groupby([new_dh.index.hour])['s-sitename'].count()
print (s)
date_time
0 2
1 2
2 1
10 1
23 1
Name: s-sitename, dtype: int64
df = new_dh.groupby([new_dh.index.hour])['s-sitename'].count().to_frame()
print (df)
s-sitename
date_time
0 2
1 2
2 1
10 1
23 1
If need count also missing values then use GroupBy.size:
s = new_dh.groupby([new_dh.index.hour])['s-sitename'].size()
print (s)
date_time
0 2
1 2
2 2
10 1
23 3
Name: s-sitename, dtype: int64
df = new_dh.groupby([new_dh.index.hour])['s-sitename'].size().to_frame()
print (df)
s-sitename
date_time
0 2
1 2
2 2
10 1
23 3
new_dh['hour'] = new_dh.index.map(lambda x: x.hour)
new_dh.groupby('hour')['hour'].count()
Result
hour
0 2
1 2
2 2
10 1
23 3
Name: hour, dtype: int64
If you need a DataFrame as result:
new_dh.groupby('hour')['hour'].count().rename('count').to_frame()
In this case, the result will be:
count
hour
0 2
1 2
2 2
10 1
23 3
You can also do this by using groupby() and assign() method:
If 'date_time' column is not your index:
result=df.assign(hour=df['date_time'].dt.hour).groupby('hour').agg(count=('s-sitename','count'))
If It's your index then use:
result=df.groupby(df.index.hour)['s-sitename'].count().to_frame('count')
result.index.name='hour'
Now if you print result then you will get your desired output:
count
hour
0 1
1 2
2 2
10 1
23 3
I need to subtract dates based on the progression of fault count.
Below is the table that has the two input columns Date and Fault_Count. The output columns I need are Option1 and Option2. The last two columns show the date difference calculations. Basically when the Fault_Count changes I need to count the number of days from when the Fault_Count changed to the initial start of fault count. For example the Fault_Count changed to 2 on 1/4/2020, I need to get the number of days from when the Fault_Count started at 0 and changed to 2 (i.e. 1/4/2020 - 1/1/2020 = 3).
Date Fault_Count Option1 Option2 Option1calc Option2calc
1/1/2020 0 0 0
1/2/2020 0 0 0
1/3/2020 0 0 0
1/4/2020 2 3 3 1/4/2020-1/1/2020 1/4/2020-1/1/2020
1/5/2020 2 0 0
1/6/2020 2 0 0
1/7/2020 4 3 3 1/7/2020-1/4/2020 1/7/2020-1/4/2020
1/8/2020 4 0 0
1/9/2020 5 2 2 1/9/2020-1/7/2020 1/9/2020-1/7/2020
1/10/2020 5 0 0
1/11/2020 0 2 -2 1/11/2020-1/9/2020 (1/11/2020-1/9/2020)*-1 as the fault resets
1/12/2020 1 1 1 1/12/2020-1/11/2020 1/12/2020-1/11/2020
Below is the code.
import pandas as pd
d = {'Date': ['1/1/2020', '1/2/2020', '1/3/2020', '1/4/2020', '1/5/2020', '1/6/2020', '1/7/2020', '1/8/2020', '1/9/2020', '1/10/2020', '1/11/2020', '1/12/2020'], 'Fault_Count' : [0, 0, 0, 2, 2, 2, 4, 4, 5, 5, 0, 1]}
df = pd.DataFrame(d)
df['Date'] = pd.to_datetime(df['Date'])
df['Fault_count_diff'] = df.Fault_Count.diff().fillna(0)
df['Cumlative_Sum'] = df.Fault_count_diff.cumsum()
I thought I could use cumulative sum and group by to get the groups and get the differences of the first value of groups. That's as far as I could get, also I noticed that using cumulative sum was not giving me ordered groups as some of the Fault_Count get reset.
Date Fault_Count Fault_count_diff Cumlative_Sum
0 2020-01-01 0 0.0 0.0
1 2020-01-02 0 0.0 0.0
2 2020-01-03 0 0.0 0.0
3 2020-01-04 2 2.0 2.0
4 2020-01-05 2 0.0 2.0
5 2020-01-06 2 0.0 2.0
6 2020-01-07 4 2.0 4.0
7 2020-01-08 4 0.0 4.0
8 2020-01-09 5 1.0 5.0
9 2020-01-10 5 0.0 5.0
10 2020-01-11 0 -5.0 0.0
11 2020-01-12 1 1.0 1.0
Desired output:
Date Fault_Count Option1 Option2
0 2020-01-01 0 0.0 0.0
1 2020-01-02 0 0.0 0.0
2 2020-01-03 0 0.0 0.0
3 2020-01-04 2 3.0 3.0
4 2020-01-05 2 0.0 0.0
5 2020-01-06 2 0.0 0.0
6 2020-01-07 4 3.0 3.0
7 2020-01-08 4 0.0 0.0
8 2020-01-09 5 2.0 2.0
9 2020-01-10 5 0.0 0.0
10 2020-01-11 0 2.0 -2.0
11 2020-01-12 1 1.0 1.0
Thanks for the help.
Use:
m1 = df['Fault_Count'].ne(df['Fault_Count'].shift(fill_value=0))
m2 = df['Fault_Count'].eq(0) & df['Fault_Count'].shift(fill_value=0).ne(0)
s = df['Date'].groupby(m1.cumsum()).transform('first')
df['Option1'] = df['Date'].sub(s.shift()).dt.days.where(m1, 0)
df['Option2'] = df['Option1'].where(~m2, df['Option1'].mul(-1))
Details:
Use Series.ne + Series.shift to create boolean mask m1 which represent the boundary condition when Fault_count changes, similarly use Series.eq + Series.shift and Series.ne to create a boolean mask m2 which represent the condition where Fault_count resets:
m1 m2
0 False False
1 False False
2 False False
3 True False
4 False False
5 False False
6 True False
7 False False
8 True False
9 False False
10 True True # --> Fault count reset
11 True False
Use Series.groupby on consecutive fault counts obtained using m1.cumsum and transform the Date column using groupby.first:
print(s)
0 2020-01-01
1 2020-01-01
2 2020-01-01
3 2020-01-04
4 2020-01-04
5 2020-01-04
6 2020-01-07
7 2020-01-07
8 2020-01-09
9 2020-01-09
10 2020-01-11
11 2020-01-12
Name: Date, dtype: datetime64[ns]
Use Series.sub to subtract Date for s shifted using Series.shift and use Series.where to fill 0 based on mask m2 and assign this to Option1. Similary we obtain Option2 from Option1 based on mask m2:
print(df)
Date Fault_Count Option1 Option2
0 2020-01-01 0 0.0 0.0
1 2020-01-02 0 0.0 0.0
2 2020-01-03 0 0.0 0.0
3 2020-01-04 2 3.0 3.0
4 2020-01-05 2 0.0 0.0
5 2020-01-06 2 0.0 0.0
6 2020-01-07 4 3.0 3.0
7 2020-01-08 4 0.0 0.0
8 2020-01-09 5 2.0 2.0
9 2020-01-10 5 0.0 0.0
10 2020-01-11 0 2.0 -2.0
11 2020-01-12 1 1.0 1.0
Instead of df['Fault_count_diff'] = ... and the next line, do:
df['cycle'] = (df.Fault_Count.diff() < 0).cumsum()
Then to get the dates in between each count change.
Option1. If all calendar dates are present in df:
ndays = df.groupby(['cycle', 'Fault_Count']).Date.size()
Option2. If there's the possibility of a date not showing up in df and you still want to get the calendar days between incidents:
ndays = df.groupby(['cycle', 'Fault_Count']).Date.min().diff().dropna()
I'm trying to use featuretools to calculate time-series functions. Specifically, I'd like to subtract current(x) from previous(x) by a group-key (user_id), but I'm having trouble in adding this kind of relationship in the entityset.
df = pd.DataFrame({
"user_id": [i % 2 for i in range(0, 6)],
'x': range(0, 6),
'time': pd.to_datetime(['2014-1-1 04:00', '2014-1-1 05:00',
'2014-1-1 06:00', '2014-1-1 08:00', '2014-1-1 10:00', '2014-1-1 12:00'])
})
print(df.to_string())
user_id x time
0 0 0 2014-01-01 04:00:00
1 1 1 2014-01-01 05:00:00
2 0 2 2014-01-01 06:00:00
3 1 3 2014-01-01 08:00:00
4 0 4 2014-01-01 10:00:00
5 1 5 2014-01-01 12:00:00
es = ft.EntitySet(id='test')
es.entity_from_dataframe(entity_id='data', dataframe=df,
variable_types={
'user_id': ft.variable_types.Categorical,
'x': ft.variable_types.Numeric,
'time': ft.variable_types.Datetime
},
make_index=True, index='index',
time_index='time'
)
I then try to invoke dfs, but I can't get the relationship right...
fm, fl = ft.dfs(
target_entity="data",
entityset=es,
trans_primitives=["diff"]
)
print(fm.to_string())
user_id x DIFF(x)
index
0 0 0 NaN
1 1 1 1.0
2 0 2 1.0
3 1 3 1.0
4 0 4 1.0
5 1 5 1.0
But what I'd actually want to get is the difference by user. That is, from the last value for each user:
user_id x DIFF(x)
index
0 0 0 NaN
1 1 1 NaN
2 0 2 2.0
3 1 3 2.0
4 0 4 2.0
5 1 5 2.0
How do I get this kind of relationship in featuretools? I've tried several tutorial but to no avail. I'm stumped.
Thanks!
Thanks for the question. You can get the expected output by normalizing an entity for users and applying a group by transform primitive. I'll go through a quick example using this data.
user_id x time
0 0 2014-01-01 04:00:00
1 1 2014-01-01 05:00:00
0 2 2014-01-01 06:00:00
1 3 2014-01-01 08:00:00
0 4 2014-01-01 10:00:00
1 5 2014-01-01 12:00:00
First, create the entity set and normalize an entity for the users.
es = ft.EntitySet(id='test')
es.entity_from_dataframe(
dataframe=df,
entity_id='data',
make_index=True,
index='index',
time_index='time',
)
es.normalize_entity(
base_entity_id='data',
new_entity_id='users',
index='user_id',
)
Then, apply the group by transform primitive in DFS.
fm, fl = ft.dfs(
target_entity="data",
entityset=es,
groupby_trans_primitives=["diff"],
)
fm.filter(regex="DIFF", axis=1)
You should get the difference by user.
DIFF(x) by user_id
index
0 NaN
1 NaN
2 2.0
3 2.0
4 2.0
5 2.0
Say I have the following DataFrame which has a 0/1 entry depending on whether something happened/didn't happen within a certain month.
Y = [0,0,1,1,0,0,0,0,1,1,1]
X = pd.date_range(start = "2010", freq = "MS", periods = len(Y))
df = pd.DataFrame({'R': Y},index = X)
R
2010-01-01 0
2010-02-01 0
2010-03-01 1
2010-04-01 1
2010-05-01 0
2010-06-01 0
2010-07-01 0
2010-08-01 0
2010-09-01 1
2010-10-01 1
2010-11-01 1
What I want is to create a 2nd column that lists the # of months until the next occurrence of a 1.
That is, I need:
R F
2010-01-01 0 2
2010-02-01 0 1
2010-03-01 1 0
2010-04-01 1 0
2010-05-01 0 4
2010-06-01 0 3
2010-07-01 0 2
2010-08-01 0 1
2010-09-01 1 0
2010-10-01 1 0
2010-11-01 1 0
What I've tried: I haven't gotten far, but I'm able to fill the first bit
A = list(df.index)
T = df[df['R']==1]
a = df.index[0]
b = T.index[0]
c = A.index(b) - A.index(a)
df.loc[a:b, 'F'] = np.linspace(c,0,c+1)
R F
2010-01-01 0 2.0
2010-02-01 0 1.0
2010-03-01 1 0.0
2010-04-01 1 NaN
2010-05-01 0 NaN
2010-06-01 0 NaN
2010-07-01 0 NaN
2010-08-01 0 NaN
2010-09-01 1 NaN
2010-10-01 1 NaN
2010-11-01 1 NaN
EDIT Probably would have been better to provide an original example that spanned multiple years.
Y = [0,0,1,1,0,0,0,0,1,1,1,0,0,1,1,1,0,1,1,1]
X = pd.date_range(start = "2010", freq = "MS", periods = len(Y))
df = pd.DataFrame({'R': Y},index = X)
Here is my way
s=df.R.cumsum()
df.loc[df.R==0,'F']=s.groupby(s).cumcount(ascending=False)+1
df.F.fillna(0,inplace=True)
df
Out[12]:
R F
2010-01-01 0 2.0
2010-02-01 0 1.0
2010-03-01 1 0.0
2010-04-01 1 0.0
2010-05-01 0 4.0
2010-06-01 0 3.0
2010-07-01 0 2.0
2010-08-01 0 1.0
2010-09-01 1 0.0
2010-10-01 1 0.0
2010-11-01 1 0.0
Create a series containing your dates, mask this series when your R series is not equal to 1, bfill, and subtract!
u = df.index.to_series()
ii = u.where(df.R.eq(1)).bfill()
12 * (ii.dt.year - u.dt.year) + (ii.dt.month - u.dt.month)
2010-01-01 2
2010-02-01 1
2010-03-01 0
2010-04-01 0
2010-05-01 4
2010-06-01 3
2010-07-01 2
2010-08-01 1
2010-09-01 0
2010-10-01 0
2010-11-01 0
Freq: MS, dtype: int64
Here is a way that worked for me, not as elegant as #user3483203 but it does the job.
df['F'] = 0
for i in df.index:
j = i
while df.loc[j, 'R'] == 0:
df.loc[i, 'F'] =df.loc[i, 'F'] + 1
j=j+1
df
################
Out[39]:
index R F
0 2010-01-01 0 2
1 2010-02-01 0 1
2 2010-03-01 1 0
3 2010-04-01 1 0
4 2010-05-01 0 4
5 2010-06-01 0 3
6 2010-07-01 0 2
7 2010-08-01 0 1
8 2010-09-01 1 0
9 2010-10-01 1 0
10 2010-11-01 1 0
In [40]:
My take
s = (df.R.diff().ne(0) | df.R.eq(1)).cumsum()
s.groupby(s).transform(lambda s: np.arange(len(s),0,-1) if len(s)>1 else 0)
2010-01-01 2
2010-02-01 1
2010-03-01 0
2010-04-01 0
2010-05-01 4
2010-06-01 3
2010-07-01 2
2010-08-01 1
2010-09-01 0
2010-10-01 0
2010-11-01 0
Freq: MS, Name: R, dtype: int64