Python featuretools difference by data group - python

I'm trying to use featuretools to calculate time-series functions. Specifically, I'd like to subtract current(x) from previous(x) by a group-key (user_id), but I'm having trouble in adding this kind of relationship in the entityset.
df = pd.DataFrame({
"user_id": [i % 2 for i in range(0, 6)],
'x': range(0, 6),
'time': pd.to_datetime(['2014-1-1 04:00', '2014-1-1 05:00',
'2014-1-1 06:00', '2014-1-1 08:00', '2014-1-1 10:00', '2014-1-1 12:00'])
})
print(df.to_string())
user_id x time
0 0 0 2014-01-01 04:00:00
1 1 1 2014-01-01 05:00:00
2 0 2 2014-01-01 06:00:00
3 1 3 2014-01-01 08:00:00
4 0 4 2014-01-01 10:00:00
5 1 5 2014-01-01 12:00:00
es = ft.EntitySet(id='test')
es.entity_from_dataframe(entity_id='data', dataframe=df,
variable_types={
'user_id': ft.variable_types.Categorical,
'x': ft.variable_types.Numeric,
'time': ft.variable_types.Datetime
},
make_index=True, index='index',
time_index='time'
)
I then try to invoke dfs, but I can't get the relationship right...
fm, fl = ft.dfs(
target_entity="data",
entityset=es,
trans_primitives=["diff"]
)
print(fm.to_string())
user_id x DIFF(x)
index
0 0 0 NaN
1 1 1 1.0
2 0 2 1.0
3 1 3 1.0
4 0 4 1.0
5 1 5 1.0
But what I'd actually want to get is the difference by user. That is, from the last value for each user:
user_id x DIFF(x)
index
0 0 0 NaN
1 1 1 NaN
2 0 2 2.0
3 1 3 2.0
4 0 4 2.0
5 1 5 2.0
How do I get this kind of relationship in featuretools? I've tried several tutorial but to no avail. I'm stumped.
Thanks!

Thanks for the question. You can get the expected output by normalizing an entity for users and applying a group by transform primitive. I'll go through a quick example using this data.
user_id x time
0 0 2014-01-01 04:00:00
1 1 2014-01-01 05:00:00
0 2 2014-01-01 06:00:00
1 3 2014-01-01 08:00:00
0 4 2014-01-01 10:00:00
1 5 2014-01-01 12:00:00
First, create the entity set and normalize an entity for the users.
es = ft.EntitySet(id='test')
es.entity_from_dataframe(
dataframe=df,
entity_id='data',
make_index=True,
index='index',
time_index='time',
)
es.normalize_entity(
base_entity_id='data',
new_entity_id='users',
index='user_id',
)
Then, apply the group by transform primitive in DFS.
fm, fl = ft.dfs(
target_entity="data",
entityset=es,
groupby_trans_primitives=["diff"],
)
fm.filter(regex="DIFF", axis=1)
You should get the difference by user.
DIFF(x) by user_id
index
0 NaN
1 NaN
2 2.0
3 2.0
4 2.0
5 2.0

Related

Group rows by certain timeperiod dending on other factors

What I start with is a large dataframe (more than a million entires) of this structure:
id datetime indicator other_values ...
1 2020-01-14 00:12:00 0 ...
1 2020-01-17 00:23:00 1 ...
...
1 2021-02-01 00:00:00 0 ...
2 2020-01-15 00:05:00 0 ...
2 2020-03-10 00:07:00 0 ...
...
2 2021-05-22 00:00:00 1 ...
...
There is no specific order other than a sort by id and then datetime. The dataset is not complete (there is not data for every day, but there can be multiple entires of the same day).
Now for each time where indicator==1 I want to collect every row with the same id and a datetime that is at most 10 days before. All other rows which are not in range of the indicator can be dropped. In the best case I want it to be saved as a dataset of time series which each will be later used in a Neural network. (There can be more than one indicator==1 case per id, other values should be saved).
An example for one id: I want to convert this
id datetime indicator other_values ...
1 2020-01-14 00:12:00 0 ...
1 2020-01-17 00:23:00 1 ...
1 2020-01-17 00:13:00 0 ...
1 2020-01-20 00:05:00 0 ...
1 2020-03-10 00:07:00 0 ...
1 2020-05-19 00:00:00 0 ...
1 2020-05-20 00:00:00 1 ...
into this
id datetime group other_values ...
1 2020-01-14 00:12:00 A ...
1 2020-01-17 00:23:00 A ...
1 2020-01-17 00:13:00 A ...
1 2020-05-19 00:00:00 B ...
1 2020-05-20 00:00:00 B ...
or a similar way to group into group A, B, ... .
A naive python for-loop is not possible due to taking ages for a dataset like this.
There is propably a clever way to use df.groupby('id'), df.groupby('id').agg(...), df.sort_values(...) or df.apply(), but I just do not see it.
Here is a way to do it with pd.merge_asof(). Let's create our data:
data = {'id': [1,1,1,1,1,1,1],
'datetime': ['2020-01-14 00:12:00',
'2020-01-17 00:23:00',
'2020-01-17 00:13:00',
'2020-01-20 00:05:00',
'2020-03-10 00:07:00',
'2020-05-19 00:00:00',
'2020-05-20 00:00:00'],
'ind': [0,1,0,0,0,0,1]
}
df = pd.DataFrame(data)
df['datetime'] = df['datetime'].astype('datetime64')
Data:
id datetime ind
0 1 2020-01-14 00:12:00 0
1 1 2020-01-17 00:23:00 1
2 1 2020-01-17 00:13:00 0
3 1 2020-01-20 00:05:00 0
4 1 2020-03-10 00:07:00 0
5 1 2020-05-19 00:00:00 0
6 1 2020-05-20 00:00:00 1
Next, let's add a date to the dataset and pull all dates where the indicator is 1.
df['date'] = df['datetime'].dt.date.astype('datetime64')
df2 = df.loc[df['ind'] == 1, ['id', 'date', 'ind']].rename({'ind': 'ind2'}, axis=1)
Which gives us this:
df:
id datetime ind date
0 1 2020-01-14 00:12:00 0 2020-01-14
1 1 2020-01-17 00:23:00 1 2020-01-17
2 1 2020-01-17 00:13:00 0 2020-01-17
3 1 2020-01-20 00:05:00 0 2020-01-20
4 1 2020-03-10 00:07:00 0 2020-03-10
5 1 2020-05-19 00:00:00 0 2020-05-19
6 1 2020-05-20 00:00:00 1 2020-05-20
df2:
id date ind2
1 1 2020-01-17 1
6 1 2020-05-20 1
Now let's join them using pd.merge_asof() with direction=forward and a tolerance of 10 days. This will join all data up to 10 days looking forward.
df = pd.merge_asof(df.drop('ind', axis=1), df2, by='id', on='date', tolerance=pd.Timedelta('10d'), direction='forward')
Which gives us this:
id datetime ind date ind2
0 1 2020-01-14 00:12:00 0 2020-01-14 1.0
1 1 2020-01-17 00:23:00 1 2020-01-17 1.0
2 1 2020-01-17 00:13:00 0 2020-01-17 1.0
3 1 2020-01-20 00:05:00 0 2020-01-20 NaN
4 1 2020-03-10 00:07:00 0 2020-03-10 NaN
5 1 2020-05-19 00:00:00 0 2020-05-19 1.0
6 1 2020-05-20 00:00:00 1 2020-05-20 1.0
Next, let's work on creating groups. There are three rules we want to use:
The next value of ind2 is NaN
The next value of ID is not the current value of ID (we're at the last value in the group)
The next day is 10 days greater than the current
With these rules, we can create a Boolean which we can then cumulatively sum to create our groups.
df['group_id'] = df['ind2'].eq( (df['ind2'].shift() == np.NaN)
| (df['id'].shift() != df['id'])
| (df['date'] - df['date'].shift() > pd.Timedelta('10d') )
).cumsum()
id datetime ind date ind2 group_id
0 1 2020-01-14 00:12:00 0 2020-01-14 1.0 1
1 1 2020-01-17 00:23:00 1 2020-01-17 1.0 1
2 1 2020-01-17 00:13:00 0 2020-01-17 1.0 1
3 1 2020-01-20 00:05:00 0 2020-01-20 NaN 1
4 1 2020-03-10 00:07:00 0 2020-03-10 NaN 1
5 1 2020-05-19 00:00:00 0 2020-05-19 1.0 2
6 1 2020-05-20 00:00:00 1 2020-05-20 1.0 2
Now we need to drop all the NaNs from ind2, remove date and we're done.
df = df.dropna(subset='ind2').drop(['date', 'ind2'], axis=1)
Final output:
id datetime ind group_id
0 1 2020-01-14 00:12:00 0 1
1 1 2020-01-17 00:23:00 1 1
2 1 2020-01-17 00:13:00 0 1
5 1 2020-05-19 00:00:00 0 2
6 1 2020-05-20 00:00:00 1 2
I'm not aware of a way to do this with df.agg, but you can put your for loop inside the groupby using .apply(). That way, your comparisons/lookups can be done on smaller tables, then groupby will handle the re-concatenation:
import pandas as pd
import datetime
import uuid
df = pd.DataFrame({
"id": [1, 1, 1, 2, 2, 2],
"datetime": [
'2020-01-14 00:12:00',
'2020-01-17 00:23:00',
'2021-02-01 00:00:00',
'2020-01-15 00:05:00',
'2020-03-10 00:07:00',
'2021-05-22 00:00:00',
],
"indicator": [0, 1, 0, 0, 0, 1]
})
df.datetime = pd.to_datetime(df.datetime)
timedelta = datetime.timedelta(days=10)
def consolidate(grp):
grp['Group'] = None
for time in grp[grp.indicator == 1]['datetime']:
grp['Group'][grp['datetime'].between(time - timedelta, time)] = uuid.uuid4()
return grp.dropna(subset=['Group'])
df.groupby('id').apply(consolidate)
If there are multiple rows with indicator == 1 in each id grouping, then the for loop will apply in index order (so a later group might overwrite an earlier group). If you can be certain that there is only one indicator == 1 in each grouping, we can simplify the consolidate function:
def consolidate(grp):
time = grp[grp.indicator == 1]['datetime'].iloc[0]
grp = grp[grp['datetime'].between(time - timedelta, time)]
grp['Group'] = uuid.uuid4()
return grp

Pandas: Creating multiple indicator columns after condition with dates

So I have a data set with about 70,000 data points, and I'm trying to test out some code on a sample data set to make sure it will work on the large one. The sample data set follows this format:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'cond': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B','B', 'B', 'B', 'B', 'B','B','B'],
'time': ['2009-07-09 15:00:00',
'2009-07-09 18:33:00',
'2009-07-09 20:55:00',
'2009-07-10 00:01:00',
'2009-07-10 09:00:00',
'2009-07-10 15:00:00',
'2009-07-10 18:00:00',
'2009-07-11 00:01:00',
'2009-07-12 03:10:00',
'2009-07-09 06:00:00',
'2009-07-10 15:00:00',
'2009-07-11 18:00:00',
'2009-07-11 21:00:00',
'2009-07-12 00:30:00',
'2009-07-12 12:05:00',
'2009-07-12 15:00:00',
'2009-07-13 21:00:00',
'2009-07-14 00:01:00'],
'Score': [0.0, 1.0, 0.0, 0.0, 0.0, -1.0, 0.0, 0.0, 1.0, 0.0, -1.0, 0.0, 1.0, 0.0, 0.0, -1.0, 0.0, 0.0],
})
print(df)
I'm essentially trying to create 2 indicator columns. The first indicator column follows the rule that for each condition (A and B), once I have a score of -1, I should indicate that row as "1" for the rest of that condition. The second indicator column should indicate for each row whether at least 24 hours has passed since the last score of -1. Thus the final result should look something like:
cond time Score Indicator 1 Indicator 2
0 A 2009-07-09 15:00:00 0.0 0 0
1 A 2009-07-09 18:33:00 1.0 0 0
2 A 2009-07-09 20:55:00 0.0 0 0
3 A 2009-07-10 00:01:00 0.0 0 0
4 A 2009-07-10 09:00:00 0.0 0 0
5 A 2009-07-10 15:00:00 -1.0 1 0
6 A 2009-07-10 18:00:00 0.0 1 0
7 A 2009-07-11 00:01:00 0.0 1 0
8 A 2009-07-12 03:10:00 1.0 1 1
9 B 2009-07-09 06:00:00 0.0 0 0
10 B 2009-07-10 15:00:00 -1.0 1 0
11 B 2009-07-11 18:00:00 0.0 1 1
12 B 2009-07-11 21:00:00 1.0 1 1
13 B 2009-07-12 00:30:00 0.0 1 1
14 B 2009-07-12 12:05:00 0.0 1 1
15 B 2009-07-12 15:00:00 -1.0 1 0
16 B 2009-07-13 21:00:00 0.0 1 1
17 B 2009-07-14 00:01:00 0.0 1 1
This is in the similar realm to the question I asked yesterday about Indicator 1, but I realized that because my large data set has so many conditions (700+), I ended up needing help on how to apply the Indicator 1 solution when it's not feasible to individually write out all the cond values, and for Indicator 2, I was working on using a rolling window function, but all the conditions I saw for rolling window examples were looking at a rolling sums or rolling means which is not what I'm trying to compute here, so I'm unsure if what I want exists using a rolling window.
Try:
#convert to datetime if needed
df["time"] = pd.to_datetime(df["time"])
#get the first time the score is -1 for each ID
first = df["cond"].map(df[df["Score"].eq(-1)].groupby("cond")["time"].min())
#get the most recent time that the score is -1
recent = df.loc[df["Score"].eq(-1), "time"].reindex(df.index, method="ffill")
#check that the time is greater than the first -1
df["Indicator 1"] = df["time"].ge(first).astype(int)
#check that at least 1 day has passed since the most recent -1
df["Indicator 2"] = df["time"].sub(recent).dt.days.ge(1).astype(int)
>>> df
cond time Score Indicator 1 Indicator 2
0 A 2009-07-09 15:00:00 0.0 0 0
1 A 2009-07-09 18:33:00 1.0 0 0
2 A 2009-07-09 20:55:00 0.0 0 0
3 A 2009-07-10 00:01:00 0.0 0 0
4 A 2009-07-10 09:00:00 0.0 0 0
5 A 2009-07-10 15:00:00 -1.0 1 0
6 A 2009-07-10 18:00:00 0.0 1 0
7 A 2009-07-11 00:01:00 0.0 1 0
8 A 2009-07-12 03:10:00 1.0 1 1
9 B 2009-07-09 06:00:00 0.0 0 0
10 B 2009-07-10 15:00:00 -1.0 1 0
11 B 2009-07-11 18:00:00 0.0 1 1
12 B 2009-07-11 21:00:00 1.0 1 1
13 B 2009-07-12 00:30:00 0.0 1 1
14 B 2009-07-12 12:05:00 0.0 1 1
15 B 2009-07-12 15:00:00 -1.0 1 0
16 B 2009-07-13 21:00:00 0.0 1 1
17 B 2009-07-14 00:01:00 0.0 1 1
A simple approach IMO, using cummax for the first indicator, and a diff from the first value per group combined with a mask for the second:
# indicator 1
df['Indicator 1'] = df['Score'].eq(-1).astype(int).groupby(df['cond']).cummax()
# indicator 2
# convert to datetime
df['time'] = pd.to_datetime(df['time'])
# groups starting by -1
m1 = df['Score'].eq(-1).groupby(df['cond']).cumsum()
# is the time difference greater than 24h since the group start
m2 = df.groupby(['cond', m1])['time'].apply(lambda s: s.sub(s.iloc[0]).gt('24h'))
df['Indicator 2'] = (m1.ne(0) & m2).astype(int)
Output:
cond time Score Indicator 1 Indicator 2
0 A 2009-07-09 15:00:00 0.0 0 0
1 A 2009-07-09 18:33:00 1.0 0 0
2 A 2009-07-09 20:55:00 0.0 0 0
3 A 2009-07-10 00:01:00 0.0 0 0
4 A 2009-07-10 09:00:00 0.0 0 0
5 A 2009-07-10 15:00:00 -1.0 1 0
6 A 2009-07-10 18:00:00 0.0 1 0
7 A 2009-07-11 00:01:00 0.0 1 0
8 A 2009-07-12 03:10:00 1.0 1 1
9 B 2009-07-09 06:00:00 0.0 0 0
10 B 2009-07-10 15:00:00 -1.0 1 0
11 B 2009-07-11 18:00:00 0.0 1 1
12 B 2009-07-11 21:00:00 1.0. 1 1
13 B 2009-07-12 00:30:00 0.0 1 1
14 B 2009-07-12 12:05:00 0.0 1 1
15 B 2009-07-12 15:00:00 -1.0 1 0
16 B 2009-07-13 21:00:00 0.0 1 0
17 B 2009-07-14 00:01:00 0.0 1 0

Python Pandas: Trying to speed-up a per row per date in date_range operation

I have a dataframe of the following form where each row corresponds to a job run on a machine:
import pandas as pd
df = pd.DataFrame({
'MachineID': [4, 3, 2, 2, 1, 1, 5, 3],
'JobStartDate': ['2020-01-01', '2020-01-01', '2020-01-01', '2020-01-01', '2020-01-02', '2020-01-03', '2020-01-01', '2020-01-03'],
'JobEndDate': ['2020-01-03', '2020-01-03', '2020-01-04', '2020-01-02', '2020-01-04', '2020-01-05', '2020-01-02', '2020-01-04'],
'IsTypeAJob': [1, 1, 0, 1, 0, 0, 1, 1]
})
df
>>> MachineID JobStartDate JobEndDate IsTypeAJob
0 4 2020-01-01 2020-01-03 1
1 3 2020-01-01 2020-01-03 1
2 2 2020-01-01 2020-01-04 0
3 2 2020-01-01 2020-01-02 1
4 1 2020-01-02 2020-01-04 0
5 1 2020-01-03 2020-01-05 0
6 5 2020-01-01 2020-01-02 1
7 3 2020-01-03 2020-01-04 1
In my data there are two types of jobs that can be run on a machine, either type A or type B. My goal is to count the number of type A and type B jobs per machine per day. Thus the desired result would look something like
MachineID Date TypeAJobs TypeBJobs
0 1 2020-01-02 0 1
1 1 2020-01-03 0 2
2 1 2020-01-04 0 2
3 1 2020-01-05 0 1
4 2 2020-01-01 1 1
5 2 2020-01-02 1 1
6 2 2020-01-03 0 1
7 2 2020-01-04 0 1
8 3 2020-01-01 1 0
9 3 2020-01-02 1 0
10 3 2020-01-03 2 0
11 3 2020-01-04 1 0
12 4 2020-01-01 1 0
13 4 2020-01-02 1 0
14 4 2020-01-03 1 0
15 5 2020-01-01 1 0
16 5 2020-01-02 1 0
I have tried approaches found here and here with a resample() and apply() method, but the computing time is too slow. This has to do with the fact that some date ranges span multiple years in my set, meaning one row can blow up into 2000+ new rows during resampling (my data contains around a million rows to begin with). Thus something like creating a new machine/date row for each date in the range of a certain job is too slow (with the goal of doing a group_by(['MachineID', 'Date']).sum() at the end).
I am currently thinking about a new approach where I begin by grouping by MachineID then finding the earliest job start date and latest job end date for that machine. Then I could create a date range of days between these two dates (incrementing by day) which I would use to index a new per machine data frame. Then for each job for that MachineID I could potentially sum over a range of dates, ie in pseudocode:
df['TypeAJobs'][row['JobStartDate']:row['JobEndDate']] += 1 if it is a type A job or
df['TypeBJobs'][row['JobStartDate']:row['JobEndDate']] += 1 otherwise.
This seems like it would avoid creating a bunch of extra rows for each job as now we are creating extra rows for each machine. Furthermore, the addition operations seem like they would be fast since we are adding to an entire slice of a series at once. However, I don't know if something like this (indexing by date) is possible in Pandas. Maybe there is some conversion that can be done first? After doing the above, ideally I would have a number of data frames similar to the desired result but only with one MachineID, then I would concatenate these data frames to get the result.
I would love to hear any suggestions about the feasibility/effectiveness of this approach or another potential algorithm. Thanks so much for reading!
IIUC, try using pd.date_range and explode to create 'daily' rows, then groupby dates and IsTypeAJob and rename columns:
df_out = df.assign(JobDates=df.apply(lambda x: pd.date_range(x['JobStartDate'],
x['JobEndDate'], freq='D'),
axis=1))\
.explode('JobDates')
df_out = df_out.groupby([df_out['MachineID'],
df_out['JobDates'].dt.floor('D'),
'IsTypeAJob'])['MachineID'].count()\
.unstack()\
.rename(columns={0:'TypeBJobs', 1:'TypeAJobs'})\
.fillna(0).reset_index()
df_out
Output:
IsTypeAJob MachineID JobDates TypeBJobs TypeAJobs
0 1 2020-01-02 1.0 0.0
1 1 2020-01-03 2.0 0.0
2 1 2020-01-04 2.0 0.0
3 1 2020-01-05 1.0 0.0
4 2 2020-01-01 1.0 1.0
5 2 2020-01-02 1.0 1.0
6 2 2020-01-03 1.0 0.0
7 2 2020-01-04 1.0 0.0
8 3 2020-01-01 0.0 1.0
9 3 2020-01-02 0.0 1.0
10 3 2020-01-03 0.0 2.0
11 3 2020-01-04 0.0 1.0
12 4 2020-01-01 0.0 1.0
13 4 2020-01-02 0.0 1.0
14 4 2020-01-03 0.0 1.0
15 5 2020-01-01 0.0 1.0
16 5 2020-01-02 0.0 1.0
pd.concat([pd.DataFrame({'JobDates':pd.date_range(r.JobStartDate, r.JobEndDate, freq='D'),
'MachineID':r.MachineID,
'IsTypeAJob':r.IsTypeAJob}) for i, r in df.iterrows()])
Here is another way to do the job, the idea is similar to use str.get_dummies on both columns start and end, but done with array broadcasting. Use cumsum do get one between start and end and 0 otherwise. Create a dataframe with the columns as dates and the index as both Machine and Type. Then do similar operation than the answer from #Scott Boston to get the expected output shape.
#get all possible dates
dr = pd.date_range(df['JobStartDate'].min(),
df['JobEndDate'].max()).strftime("%Y-%m-%d").to_numpy()
df_ = (pd.DataFrame(
np.cumsum((df['JobStartDate'].to_numpy()[:, None] == dr).astype(int)
- np.pad(df['JobEndDate'].to_numpy()[:, None]==dr,((0,0),(1,False)),
mode='constant')[:, :-1], # pad is equivalent to shift along columns
axis=1),
index=pd.MultiIndex.from_frame(df[['MachineID', 'IsTypeAJob']]),
columns=dr,)
.sum(level=['MachineID', 'IsTypeAJob']) #equivalent to groupby(['MachineID', 'IsTypeAJob']).sum()
.replace(0, np.nan) #to remove extra dates per original row during the stack
.stack()
.unstack(level='IsTypeAJob', fill_value=0)
.astype(int)
.reset_index()
.rename_axis(columns=None)
.rename(columns={'level_1':'Date', 0:'TypeBJobs', 1:'TypeAJobs'})
)
and you get
MachineID Date TypeBJobs TypeAJobs
0 1 2020-01-02 1 0
1 1 2020-01-03 2 0
2 1 2020-01-04 2 0
3 1 2020-01-05 1 0
4 2 2020-01-01 1 1
5 2 2020-01-02 1 1
6 2 2020-01-03 1 0
7 2 2020-01-04 1 0
8 3 2020-01-01 0 1
9 3 2020-01-02 0 1
10 3 2020-01-03 0 2
11 3 2020-01-04 0 1
12 4 2020-01-01 0 1
13 4 2020-01-02 0 1
14 4 2020-01-03 0 1
15 5 2020-01-01 0 1
16 5 2020-01-02 0 1

Pandas: fill one column with count of # of obs between occurrences in a 2nd column

Say I have the following DataFrame which has a 0/1 entry depending on whether something happened/didn't happen within a certain month.
Y = [0,0,1,1,0,0,0,0,1,1,1]
X = pd.date_range(start = "2010", freq = "MS", periods = len(Y))
df = pd.DataFrame({'R': Y},index = X)
R
2010-01-01 0
2010-02-01 0
2010-03-01 1
2010-04-01 1
2010-05-01 0
2010-06-01 0
2010-07-01 0
2010-08-01 0
2010-09-01 1
2010-10-01 1
2010-11-01 1
What I want is to create a 2nd column that lists the # of months until the next occurrence of a 1.
That is, I need:
R F
2010-01-01 0 2
2010-02-01 0 1
2010-03-01 1 0
2010-04-01 1 0
2010-05-01 0 4
2010-06-01 0 3
2010-07-01 0 2
2010-08-01 0 1
2010-09-01 1 0
2010-10-01 1 0
2010-11-01 1 0
What I've tried: I haven't gotten far, but I'm able to fill the first bit
A = list(df.index)
T = df[df['R']==1]
a = df.index[0]
b = T.index[0]
c = A.index(b) - A.index(a)
df.loc[a:b, 'F'] = np.linspace(c,0,c+1)
R F
2010-01-01 0 2.0
2010-02-01 0 1.0
2010-03-01 1 0.0
2010-04-01 1 NaN
2010-05-01 0 NaN
2010-06-01 0 NaN
2010-07-01 0 NaN
2010-08-01 0 NaN
2010-09-01 1 NaN
2010-10-01 1 NaN
2010-11-01 1 NaN
EDIT Probably would have been better to provide an original example that spanned multiple years.
Y = [0,0,1,1,0,0,0,0,1,1,1,0,0,1,1,1,0,1,1,1]
X = pd.date_range(start = "2010", freq = "MS", periods = len(Y))
df = pd.DataFrame({'R': Y},index = X)
Here is my way
s=df.R.cumsum()
df.loc[df.R==0,'F']=s.groupby(s).cumcount(ascending=False)+1
df.F.fillna(0,inplace=True)
df
Out[12]:
R F
2010-01-01 0 2.0
2010-02-01 0 1.0
2010-03-01 1 0.0
2010-04-01 1 0.0
2010-05-01 0 4.0
2010-06-01 0 3.0
2010-07-01 0 2.0
2010-08-01 0 1.0
2010-09-01 1 0.0
2010-10-01 1 0.0
2010-11-01 1 0.0
Create a series containing your dates, mask this series when your R series is not equal to 1, bfill, and subtract!
u = df.index.to_series()
ii = u.where(df.R.eq(1)).bfill()
12 * (ii.dt.year - u.dt.year) + (ii.dt.month - u.dt.month)
2010-01-01 2
2010-02-01 1
2010-03-01 0
2010-04-01 0
2010-05-01 4
2010-06-01 3
2010-07-01 2
2010-08-01 1
2010-09-01 0
2010-10-01 0
2010-11-01 0
Freq: MS, dtype: int64
Here is a way that worked for me, not as elegant as #user3483203 but it does the job.
df['F'] = 0
for i in df.index:
j = i
while df.loc[j, 'R'] == 0:
df.loc[i, 'F'] =df.loc[i, 'F'] + 1
j=j+1
df
################
Out[39]:
index R F
0 2010-01-01 0 2
1 2010-02-01 0 1
2 2010-03-01 1 0
3 2010-04-01 1 0
4 2010-05-01 0 4
5 2010-06-01 0 3
6 2010-07-01 0 2
7 2010-08-01 0 1
8 2010-09-01 1 0
9 2010-10-01 1 0
10 2010-11-01 1 0
In [40]:
My take
s = (df.R.diff().ne(0) | df.R.eq(1)).cumsum()
s.groupby(s).transform(lambda s: np.arange(len(s),0,-1) if len(s)>1 else 0)
2010-01-01 2
2010-02-01 1
2010-03-01 0
2010-04-01 0
2010-05-01 4
2010-06-01 3
2010-07-01 2
2010-08-01 1
2010-09-01 0
2010-10-01 0
2010-11-01 0
Freq: MS, Name: R, dtype: int64

How to use loop to count the number of nan

There are a lot of stations in csv file, I don't know how to use loop to count the number of nan of every station. There is I got so far, count one by one. Can someone help me please, thank you in advance.
station1= train_df[train_df['station'] == 28079004]
station1 = station1[['date', 'O_3']]
count_nan = len(station1) - station1.count()
print(count_nan)
I think need create index by station column with set_index, filter columns for check missing values and last count them by sum:
train_df = pd.DataFrame({'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'date':pd.date_range('2015-01-01', periods=6),
'O_3':[np.nan,3,np.nan,9,2,np.nan],
'station':[28079004] * 2 + [28079005] * 4})
print (train_df)
B C date O_3 station
0 4 7 2015-01-01 NaN 28079004
1 5 8 2015-01-02 3.0 28079004
2 4 9 2015-01-03 NaN 28079005
3 5 4 2015-01-04 9.0 28079005
4 5 2 2015-01-05 2.0 28079005
5 4 3 2015-01-06 NaN 28079005
df = train_df.set_index('station')[['date', 'O_3']].isnull().sum(level=0).astype(int)
print (df)
date O_3
station
28079004 0 1
28079005 0 2
Another solution:
df = train_df[['date', 'O_3']].isnull().groupby(train_df['station']).sum().astype(int)
print (df)
date O_3
station
28079004 0 1
28079005 0 2
Although jez already answered and that answer is probably better here. This is how a groupby would look like:
import pandas as pd
import numpy as np
np.random.seed(444)
n = 10
train_df = pd.DataFrame({
'station': np.random.choice(np.arange(28079004,28079008), size=n),
'date': pd.date_range('2018-01-01', periods=n),
'O_3': np.random.choice([np.nan,1], size=n)
})
print(train_df)
s = train_df.groupby('station')['O_3'].apply(lambda x: x.isna().sum())
print(s)
prints:
station date O_3
0 28079007 2018-01-01 NaN
1 28079004 2018-01-02 1.0
2 28079007 2018-01-03 NaN
3 28079004 2018-01-04 NaN
4 28079007 2018-01-05 NaN
5 28079004 2018-01-06 1.0
6 28079007 2018-01-07 NaN
7 28079004 2018-01-08 NaN
8 28079006 2018-01-09 NaN
9 28079007 2018-01-10 1.0
And the output (s):
station
28079004 2
28079006 1
28079007 4

Categories