Excluding particular columns from being resampled using pandas resample - python

The following is a simplification of the problem at hand.
I have a dataframe containing three columns, the date a state began, the state itself, and a flag field. It looks similar to this:
df = pd.DataFrame(
{'begin': pd.to_datetime(['2018-01-05', '2018-07-11', '2018-11-14', '2019-02-19']),
'state': [1, 2, 3, 4],
'started': [1, 0, 0, 0]
}
)
df
begin state started
0 2018-01-05 1 1
1 2018-07-11 2 0
2 2018-11-14 3 0
3 2019-02-19 4 0
I want to resample the dates so that they have a monthly period, and I achieve this as follows:
df.set_index('begin', drop=False).resample('m').ffill()
df
begin state started
begin
2018-01-31 2018-01-05 1 1
2018-02-28 2018-01-05 1 1
2018-03-31 2018-01-05 1 1
2018-04-30 2018-01-05 1 1
2018-05-31 2018-01-05 1 1
2018-06-30 2018-01-05 1 1
2018-07-31 2018-07-11 2 0
2018-08-31 2018-07-11 2 0
2018-09-30 2018-07-11 2 0
2018-10-31 2018-07-11 2 0
2018-11-30 2018-11-14 3 0
2018-12-31 2018-11-14 3 0
2019-01-31 2018-11-14 3 0
2019-02-28 2019-02-19 4 0
Everything looks ok, except for the flag column (started). I need it to be 1 exactly once, at its first occurrence as in the original dataframe.
The desired output is :
begin state started
begin
2018-01-31 2018-01-05 1 1
2018-02-28 2018-01-05 1 0
2018-03-31 2018-01-05 1 0
2018-04-30 2018-01-05 1 0
2018-05-31 2018-01-05 1 0
2018-06-30 2018-01-05 1 0
2018-07-31 2018-07-11 2 0
2018-08-31 2018-07-11 2 0
2018-09-30 2018-07-11 2 0
2018-10-31 2018-07-11 2 0
2018-11-30 2018-11-14 3 0
2018-12-31 2018-11-14 3 0
2019-01-31 2018-11-14 3 0
2019-02-28 2019-02-19 4 0
Thus, for a given combination of begin and state, if started is 1, it should be one only at the first occurrence of this combination.
Is there an efficient way to achieve this?

If only 1 and 0 in started column use DataFrame.duplicated with specify both columns in list:
mask = df.duplicated(['begin','started'])
Also is possible rewrite only 1 values by chain another mask:
mask = df.duplicated(['begin','started']) & df['started'].eq(1)
df.loc[mask, 'started'] = 0
Or:
df['started'] = np.where(mask, 0, df['started'])
print (df)
begin state started
begin
2018-01-31 2018-01-05 1 1
2018-02-28 2018-01-05 1 0
2018-03-31 2018-01-05 1 0
2018-04-30 2018-01-05 1 0
2018-05-31 2018-01-05 1 0
2018-06-30 2018-01-05 1 0
2018-07-31 2018-07-11 2 0
2018-08-31 2018-07-11 2 0
2018-09-30 2018-07-11 2 0
2018-10-31 2018-07-11 2 0
2018-11-30 2018-11-14 3 0
2018-12-31 2018-11-14 3 0
2019-01-31 2018-11-14 3 0
2019-02-28 2019-02-19 4 0

Can you do:
df = df.set_index('begin', drop=False).resample('m').ffill()
df.loc[df['started'].duplicated(keep='first'), 'started'] = 0

Related

Group rows by certain timeperiod dending on other factors

What I start with is a large dataframe (more than a million entires) of this structure:
id datetime indicator other_values ...
1 2020-01-14 00:12:00 0 ...
1 2020-01-17 00:23:00 1 ...
...
1 2021-02-01 00:00:00 0 ...
2 2020-01-15 00:05:00 0 ...
2 2020-03-10 00:07:00 0 ...
...
2 2021-05-22 00:00:00 1 ...
...
There is no specific order other than a sort by id and then datetime. The dataset is not complete (there is not data for every day, but there can be multiple entires of the same day).
Now for each time where indicator==1 I want to collect every row with the same id and a datetime that is at most 10 days before. All other rows which are not in range of the indicator can be dropped. In the best case I want it to be saved as a dataset of time series which each will be later used in a Neural network. (There can be more than one indicator==1 case per id, other values should be saved).
An example for one id: I want to convert this
id datetime indicator other_values ...
1 2020-01-14 00:12:00 0 ...
1 2020-01-17 00:23:00 1 ...
1 2020-01-17 00:13:00 0 ...
1 2020-01-20 00:05:00 0 ...
1 2020-03-10 00:07:00 0 ...
1 2020-05-19 00:00:00 0 ...
1 2020-05-20 00:00:00 1 ...
into this
id datetime group other_values ...
1 2020-01-14 00:12:00 A ...
1 2020-01-17 00:23:00 A ...
1 2020-01-17 00:13:00 A ...
1 2020-05-19 00:00:00 B ...
1 2020-05-20 00:00:00 B ...
or a similar way to group into group A, B, ... .
A naive python for-loop is not possible due to taking ages for a dataset like this.
There is propably a clever way to use df.groupby('id'), df.groupby('id').agg(...), df.sort_values(...) or df.apply(), but I just do not see it.
Here is a way to do it with pd.merge_asof(). Let's create our data:
data = {'id': [1,1,1,1,1,1,1],
'datetime': ['2020-01-14 00:12:00',
'2020-01-17 00:23:00',
'2020-01-17 00:13:00',
'2020-01-20 00:05:00',
'2020-03-10 00:07:00',
'2020-05-19 00:00:00',
'2020-05-20 00:00:00'],
'ind': [0,1,0,0,0,0,1]
}
df = pd.DataFrame(data)
df['datetime'] = df['datetime'].astype('datetime64')
Data:
id datetime ind
0 1 2020-01-14 00:12:00 0
1 1 2020-01-17 00:23:00 1
2 1 2020-01-17 00:13:00 0
3 1 2020-01-20 00:05:00 0
4 1 2020-03-10 00:07:00 0
5 1 2020-05-19 00:00:00 0
6 1 2020-05-20 00:00:00 1
Next, let's add a date to the dataset and pull all dates where the indicator is 1.
df['date'] = df['datetime'].dt.date.astype('datetime64')
df2 = df.loc[df['ind'] == 1, ['id', 'date', 'ind']].rename({'ind': 'ind2'}, axis=1)
Which gives us this:
df:
id datetime ind date
0 1 2020-01-14 00:12:00 0 2020-01-14
1 1 2020-01-17 00:23:00 1 2020-01-17
2 1 2020-01-17 00:13:00 0 2020-01-17
3 1 2020-01-20 00:05:00 0 2020-01-20
4 1 2020-03-10 00:07:00 0 2020-03-10
5 1 2020-05-19 00:00:00 0 2020-05-19
6 1 2020-05-20 00:00:00 1 2020-05-20
df2:
id date ind2
1 1 2020-01-17 1
6 1 2020-05-20 1
Now let's join them using pd.merge_asof() with direction=forward and a tolerance of 10 days. This will join all data up to 10 days looking forward.
df = pd.merge_asof(df.drop('ind', axis=1), df2, by='id', on='date', tolerance=pd.Timedelta('10d'), direction='forward')
Which gives us this:
id datetime ind date ind2
0 1 2020-01-14 00:12:00 0 2020-01-14 1.0
1 1 2020-01-17 00:23:00 1 2020-01-17 1.0
2 1 2020-01-17 00:13:00 0 2020-01-17 1.0
3 1 2020-01-20 00:05:00 0 2020-01-20 NaN
4 1 2020-03-10 00:07:00 0 2020-03-10 NaN
5 1 2020-05-19 00:00:00 0 2020-05-19 1.0
6 1 2020-05-20 00:00:00 1 2020-05-20 1.0
Next, let's work on creating groups. There are three rules we want to use:
The next value of ind2 is NaN
The next value of ID is not the current value of ID (we're at the last value in the group)
The next day is 10 days greater than the current
With these rules, we can create a Boolean which we can then cumulatively sum to create our groups.
df['group_id'] = df['ind2'].eq( (df['ind2'].shift() == np.NaN)
| (df['id'].shift() != df['id'])
| (df['date'] - df['date'].shift() > pd.Timedelta('10d') )
).cumsum()
id datetime ind date ind2 group_id
0 1 2020-01-14 00:12:00 0 2020-01-14 1.0 1
1 1 2020-01-17 00:23:00 1 2020-01-17 1.0 1
2 1 2020-01-17 00:13:00 0 2020-01-17 1.0 1
3 1 2020-01-20 00:05:00 0 2020-01-20 NaN 1
4 1 2020-03-10 00:07:00 0 2020-03-10 NaN 1
5 1 2020-05-19 00:00:00 0 2020-05-19 1.0 2
6 1 2020-05-20 00:00:00 1 2020-05-20 1.0 2
Now we need to drop all the NaNs from ind2, remove date and we're done.
df = df.dropna(subset='ind2').drop(['date', 'ind2'], axis=1)
Final output:
id datetime ind group_id
0 1 2020-01-14 00:12:00 0 1
1 1 2020-01-17 00:23:00 1 1
2 1 2020-01-17 00:13:00 0 1
5 1 2020-05-19 00:00:00 0 2
6 1 2020-05-20 00:00:00 1 2
I'm not aware of a way to do this with df.agg, but you can put your for loop inside the groupby using .apply(). That way, your comparisons/lookups can be done on smaller tables, then groupby will handle the re-concatenation:
import pandas as pd
import datetime
import uuid
df = pd.DataFrame({
"id": [1, 1, 1, 2, 2, 2],
"datetime": [
'2020-01-14 00:12:00',
'2020-01-17 00:23:00',
'2021-02-01 00:00:00',
'2020-01-15 00:05:00',
'2020-03-10 00:07:00',
'2021-05-22 00:00:00',
],
"indicator": [0, 1, 0, 0, 0, 1]
})
df.datetime = pd.to_datetime(df.datetime)
timedelta = datetime.timedelta(days=10)
def consolidate(grp):
grp['Group'] = None
for time in grp[grp.indicator == 1]['datetime']:
grp['Group'][grp['datetime'].between(time - timedelta, time)] = uuid.uuid4()
return grp.dropna(subset=['Group'])
df.groupby('id').apply(consolidate)
If there are multiple rows with indicator == 1 in each id grouping, then the for loop will apply in index order (so a later group might overwrite an earlier group). If you can be certain that there is only one indicator == 1 in each grouping, we can simplify the consolidate function:
def consolidate(grp):
time = grp[grp.indicator == 1]['datetime'].iloc[0]
grp = grp[grp['datetime'].between(time - timedelta, time)]
grp['Group'] = uuid.uuid4()
return grp

How do I use conditional logic with Datetime columns in Pandas?

I have two datetime columns - ColumnA and ColumnB. I want to create a new column - ColumnC, using conditional logic.
Originally, I created ColumnB from a YearMonth column of dates such as 201907, 201908, etc.
When ColumnA is NaN, I want to choose ColumnB.
Otherwise, I want to choose ColumnA.
Currently, my code below is causing ColumnC to have different formats. I'm not sure how to get rid of all of those 0's. I want the whole column to be YYYY-MM-DD.
ID YearMonth ColumnA ColumnB ColumnC
0 1 201712 2017-12-29 2017-12-31 2017-12-29
1 1 201801 2018-01-31 2018-01-31 2018-01-31
2 1 201802 2018-02-28 2018-02-28 2018-02-28
3 1 201806 2018-06-29 2018-06-30 2018-06-29
4 1 201807 2018-07-31 2018-07-31 2018-07-31
5 1 201808 2018-08-31 2018-08-31 2018-08-31
6 1 201809 2018-09-28 2018-09-30 2018-09-28
7 1 201810 2018-10-31 2018-10-31 2018-10-31
8 1 201811 2018-11-30 2018-11-30 2018-11-30
9 1 201812 2018-12-31 2018-12-31 2018-12-31
10 1 201803 NaN 2018-03-31 1522454400000000000
11 1 201804 NaN 2018-04-30 1525046400000000000
12 1 201805 NaN 2018-05-31 1527724800000000000
13 1 201901 NaN 2019-01-31 1548892800000000000
14 1 201902 NaN 2019-02-28 1551312000000000000
15 1 201903 NaN 2019-03-31 1553990400000000000
16 1 201904 NaN 2019-04-30 1556582400000000000
17 1 201905 NaN 2019-05-31 1559260800000000000
18 1 201906 NaN 2019-06-30 1561852800000000000
19 1 201907 NaN 2019-07-31 1564531200000000000
20 1 201908 NaN 2019-08-31 1567209600000000000
21 1 201909 NaN 2019-09-30 1569801600000000000
df['ColumnB'] = pd.to_datetime(df['YearMonth'], format='%Y%m', errors='coerce').dropna() + pd.offsets.MonthEnd(0)
df['ColumnC'] = np.where(pd.isna(df['ColumnA']), pd.to_datetime(df['ColumnB'], format='%Y%m%d'), df['ColumnA'])
df['ColumnC'] = np.where(df['ColumnA'].isnull(),df['ColumnB'] , df['ColumnA'])
Just figured it out!
df['ColumnC'] = np.where(pd.isna(df['ColumnA']), pd.to_datetime(df['ColumnB']), pd.to_datetime(df['ColumnA']))

Pandas: Resample from weekly to daily with offset

I want to resample this following dataframe from weekly to daily then ffill the missing values.
Note: 2018-01-07 and 2018-01-14 is Sunday.
Date Val
0 2018-01-07 1
1 2018-01-14 2
I tried.
df.Date = pd.to_datetime(df.Date)
df.set_index('Date', inplace=True)
offset = pd.offsets.DateOffset(-6)
df.resample('D', loffset=offset).ffill()
Val
Date
2018-01-01 1
2018-01-02 1
2018-01-03 1
2018-01-04 1
2018-01-05 1
2018-01-06 1
2018-01-07 1
2018-01-08 2
But I want
Date Val
0 2018-01-01 1
1 2018-01-02 1
2 2018-01-03 1
3 2018-01-04 1
4 2018-01-05 1
5 2018-01-06 1
6 2018-01-07 1
7 2018-01-08 2
8 2018-01-09 2
9 2018-01-10 2
10 2018-01-11 2
11 2018-01-12 2
12 2018-01-13 2
13 2018-01-14 2
What did I do wrong?
You can add new last row manually with subtract offset for datetime:
df.loc[df.index[-1] - offset] = df.iloc[-1]
df = df.resample('D', loffset=offset).ffill()
print (df)
Val
Date
2018-01-01 1
2018-01-02 1
2018-01-03 1
2018-01-04 1
2018-01-05 1
2018-01-06 1
2018-01-07 1
2018-01-08 2
2018-01-09 2
2018-01-10 2
2018-01-11 2
2018-01-12 2
2018-01-13 2
2018-01-14 2

Pandas : How to sum column values over a date range

I am trying to sum the values of colA, over a date range based on "date" column, and store this rolling value in the new column "sum_col"
But I am getting the sum of all rows (=100), not just those in the date range.
I can't use rolling or groupby by as my dates (in the real data) are not sequential (some days are missing)
Amy idea how to do this? Thanks.
# Create data frame
df = pd.DataFrame()
# Create datetimes and data
df['date'] = pd.date_range('1/1/2018', periods=100, freq='D')
df['colA']= 1
df['colB']= 2
df['colC']= 3
StartDate = df.date- pd.to_timedelta(5, unit='D')
EndDate= df.date
dfx=df
dfx['StartDate'] = StartDate
dfx['EndDate'] = EndDate
dfx['sum_col']=df[(df['date'] > StartDate) & (df['date'] <= EndDate)].sum()['colA']
dfx.head(50)
I'm not sure whether you want 3 columns for the sum of colA, colB, colC respectively, or one column which sums all three, but here is an example of how you would sum the values for colA:
dfx['colAsum'] = dfx.apply(lambda x: df.loc[(df.date >= x.StartDate) &
(df.date <= x.EndDate), 'colA'].sum(), axis=1)
e.g. (withperiods=10):
date colA colB colC StartDate EndDate colAsum
0 2018-01-01 1 2 3 2017-12-27 2018-01-01 1
1 2018-01-02 1 2 3 2017-12-28 2018-01-02 2
2 2018-01-03 1 2 3 2017-12-29 2018-01-03 3
3 2018-01-04 1 2 3 2017-12-30 2018-01-04 4
4 2018-01-05 1 2 3 2017-12-31 2018-01-05 5
5 2018-01-06 1 2 3 2018-01-01 2018-01-06 6
6 2018-01-07 1 2 3 2018-01-02 2018-01-07 6
7 2018-01-08 1 2 3 2018-01-03 2018-01-08 6
8 2018-01-09 1 2 3 2018-01-04 2018-01-09 6
9 2018-01-10 1 2 3 2018-01-05 2018-01-10 6
If what I understand is correct:
for i in range(df.shape[0]):
dfx.loc[i,'sum_col']=df[(df['date'] > StartDate[i]) & (df['date'] <= EndDate[i])].sum()['colA']
For example, in range (2018-01-01, 2018-01-06) the sum is 6.
date colA colB colC StartDate EndDate sum_col
0 2018-01-01 1 2 3 2017-12-27 2018-01-01 1.0
1 2018-01-02 1 2 3 2017-12-28 2018-01-02 2.0
2 2018-01-03 1 2 3 2017-12-29 2018-01-03 3.0
3 2018-01-04 1 2 3 2017-12-30 2018-01-04 4.0
4 2018-01-05 1 2 3 2017-12-31 2018-01-05 5.0
5 2018-01-06 1 2 3 2018-01-01 2018-01-06 5.0
6 2018-01-07 1 2 3 2018-01-02 2018-01-07 5.0
7 2018-01-08 1 2 3 2018-01-03 2018-01-08 5.0
8 2018-01-09 1 2 3 2018-01-04 2018-01-09 5.0
9 2018-01-10 1 2 3 2018-01-05 2018-01-10 5.0
10 2018-01-11 1 2 3 2018-01-06 2018-01-11 5.0

pandas dataframe new column which checks previous day

I have a Dataframe which has a Datetime as Index and a column named "Holiday" which is an Flag with 1 or 0.
So if the datetimeindex is a holiday, the Holiday column has 1 in it and if not so 0.
I need a new column that says whether a given datetimeindex is the first day after a holiday or not.The new column should just look if its previous day has the flag "HOLIDAY" set to 1 and then set its flag to 1, otherwise 0.
EDIT
Doing:
df['DayAfter'] = df.Holiday.shift(1).fillna(0)
Has the Output:
Holiday DayAfter AnyNumber
Datum
...
2014-01-01 20:00:00 1 1.0 9
2014-01-01 20:30:00 1 1.0 2
2014-01-01 21:00:00 1 1.0 3
2014-01-01 21:30:00 1 1.0 3
2014-01-01 22:00:00 1 1.0 6
2014-01-01 22:30:00 1 1.0 1
2014-01-01 23:00:00 1 1.0 1
2014-01-01 23:30:00 1 1.0 1
2014-01-02 00:00:00 0 1.0 1
2014-01-02 00:30:00 0 0.0 2
2014-01-02 01:00:00 0 0.0 1
2014-01-02 01:30:00 0 0.0 1
...
if you check the first timestamp for 2014-01-02 the DayAfter flag is set right. But the other flags are 0. Thats wrong.
Create an array of unique days that are holidays and offset them by one day
days = pd.Series(df[df.Holiday == 1].index).add(pd.DateOffset(1)).dt.date.unique()
Create a new column with the one day holiday offsets (days)
df['DayAfter'] = np.where(pd.Series(df.index).dt.date.isin(days),1,0)
Holiday AnyNumber DayAfter
Datum
2014-01-01 20:00:00 1 9 0
2014-01-01 20:30:00 1 2 0
2014-01-01 21:00:00 1 3 0
2014-01-01 21:30:00 1 3 0
2014-01-01 22:00:00 1 6 0
2014-01-01 22:30:00 1 1 0
2014-01-01 23:00:00 1 1 0
2014-01-01 23:30:00 1 1 0
2014-01-02 00:00:00 0 1 1
2014-01-02 00:30:00 0 2 1
2014-01-02 01:00:00 0 1 1
2014-01-02 01:30:00 0 1 1

Categories