Cumsum by day of two dataframes considering repeated hours - python

I have the following two dataframes:
print(df_diff)
print(df_census_occupation)
pacients
2019-01-01 00:10:00 1
2019-01-01 00:20:00 1
2019-01-01 00:30:00 -1
2019-01-02 10:00:00 1
2019-01-02 11:30:00 1
2019-01-03 00:00:00 -1
2019-01-03 15:00:00 -1
2019-01-03 23:30:00 -1
2019-01-04 00:00:00 1
2019-01-04 00:00:00 1
2019-01-04 10:00:00 -1
2019-01-04 10:00:00 -1
pacients_census
2019-01-01 10
2019-01-02 20
2019-01-03 30
2019-01-04 10
And I need to transform them into:
pacients
2019-01-01 00:00:00 10
2019-01-01 00:10:00 11
2019-01-01 00:20:00 12
2019-01-01 00:30:00 11
2019-01-02 00:00:00 20
2019-01-02 10:00:00 21
2019-01-02 11:30:00 22
2019-01-03 00:00:00 30
2019-01-03 00:00:00 29
2019-01-03 15:00:00 28
2019-01-03 23:30:00 27
2019-01-04 00:00:00 10
2019-01-04 00:00:00 11
2019-01-04 00:00:00 12
2019-01-04 10:00:00 11
2019-01-04 10:00:00 10
It's like a cumsum by day, where each day starts over again based on another dataframe (df_census_occupation). Attention must be taken to consider repeated hours, there may be days where we have exactly the same hour in df_diff, and such hours may also coincide with the start of the day in df_census_occupation. This is what happens in 2019-01-04 00:00:00 for example.
I tried using cumsum with masks and shifts, and also some groupby operations, but the code was becoming difficult to understand and it was not considering the repeated hours issue.
Auxiliary code to generate the two dataframes:
import datetime
df_diff_index = [
"2019-01-01 00:10:00",
"2019-01-01 00:20:00",
"2019-01-01 00:30:00",
"2019-01-02 10:00:00",
"2019-01-02 11:30:00",
"2019-01-03 00:00:00",
"2019-01-03 15:00:00",
"2019-01-03 23:30:00",
"2019-01-04 00:00:00",
"2019-01-04 00:00:00",
"2019-01-04 10:00:00",
"2019-01-04 10:00:00",
]
df_diff_index = [datetime.datetime.strptime(date, "%Y-%m-%d %H:%M:%S") for date in df_diff_index]
df_census_occupation_index = [
"2019-01-01",
"2019-01-02",
"2019-01-03",
"2019-01-04",
]
df_census_occupation_index = [datetime.datetime.strptime(date, "%Y-%m-%d") for date in df_census_occupation_index]
df_diff = pd.DataFrame({"pacients": [1, 1, -1, 1, 1, -1, -1, -1, 1, 1, -1, -1]}, index=df_diff_index)
df_census_occupation = pd.DataFrame({"pacients_census": [10, 20, 30, 10]}, index=df_census_occupation_index)

Concatenate to data, sort by index, then groupby on the day and cumsum:
out = pd.concat([df_census_occupation.rename(columns={'pacients_census':'pacients'}), df_diff]
).sort_index().groupby(pd.Grouper(freq='D')).cumsum()
Output:
pacients
2019-01-01 00:00:00 10
2019-01-01 00:10:00 11
2019-01-01 00:20:00 12
2019-01-01 00:30:00 11
2019-01-02 00:00:00 20
2019-01-02 10:00:00 21
2019-01-02 11:30:00 22
2019-01-03 00:00:00 30
2019-01-03 00:00:00 29
2019-01-03 15:00:00 28
2019-01-03 23:30:00 27
2019-01-04 00:00:00 10
2019-01-04 00:00:00 11
2019-01-04 00:00:00 12
2019-01-04 10:00:00 11
2019-01-04 10:00:00 10
note you may want to pass kind='mergesort' to sort_index so the sort is stable, i.e. concensus goes before the data.

Related

pandas.cut day time: how to set a bin from 22:00 to 6:00?

I'm trying to use pd.cut to divide 24 hours into the following interval:
[6,11),[11,14),[14,17),[17,22),[22,6)
How could I achieve the last bin [22,6)?
Assuming some form of datetime column, try offsetting the datetime by 6 hours so that the lower bound becomes midnight. Then cutting based on those hours instead, with the custom labels:
import pandas as pd
# sample data
df = pd.DataFrame({
'datetime': pd.date_range('2021-01-01', periods=24, freq='H')
})
df['bins'] = pd.cut((df['datetime'] - pd.Timedelta(hours=6)).dt.hour,
bins=[0, 5, 8, 11, 16, 24],
labels=['[6,11)', '[11,14)', '[14,17)',
'[17,22)', '[22,6)'],
right=False)
df:
datetime bins
0 2021-01-01 00:00:00 [22,6)
1 2021-01-01 01:00:00 [22,6)
2 2021-01-01 02:00:00 [22,6)
3 2021-01-01 03:00:00 [22,6)
4 2021-01-01 04:00:00 [22,6)
5 2021-01-01 05:00:00 [22,6)
6 2021-01-01 06:00:00 [6,11)
7 2021-01-01 07:00:00 [6,11)
8 2021-01-01 08:00:00 [6,11)
9 2021-01-01 09:00:00 [6,11)
10 2021-01-01 10:00:00 [6,11)
11 2021-01-01 11:00:00 [11,14)
12 2021-01-01 12:00:00 [11,14)
13 2021-01-01 13:00:00 [11,14)
14 2021-01-01 14:00:00 [14,17)
15 2021-01-01 15:00:00 [14,17)
16 2021-01-01 16:00:00 [14,17)
17 2021-01-01 17:00:00 [17,22)
18 2021-01-01 18:00:00 [17,22)
19 2021-01-01 19:00:00 [17,22)
20 2021-01-01 20:00:00 [17,22)
21 2021-01-01 21:00:00 [17,22)
22 2021-01-01 22:00:00 [22,6)
23 2021-01-01 23:00:00 [22,6)

Choosing the minumum distance part 2

This question is already here, but now I have added an extra part to the previous question.
I have the following dataframe:
data = {'id': [0, 0, 0, 0, 0, 0],
'time_order': ['2019-01-01 0:00:00', '2019-01-01 00:11:00', '2019-01-02 00:04:00', '2019-01-02 00:15:00', '2019-01-03 00:07:00', '2019-01-03 00:10:00']}
df_data = pd.DataFrame(data)
df_data['time_order'] = pd.to_datetime(df_data['time_order'])
df_data['day_order'] = df_data['time_order'].dt.strftime('%Y-%m-%d')
df_data['time'] = df_data['time_order'].dt.strftime('%H:%M:%S')
I have been trying to calculate the short time difference between the orders each 15 minutes, e.g. I take a time window 15 minutes and take only its half 7:30 which means I would like to calculate the difference between the first order '2019-01-01 0:00:00' and 00:07:30 and between the second order '2019-01-01 0:11:00' and 00:07:30 and take only the order that is closer to 00:07:30 each day.
I did the following:
t = 0
s = pd.Time.datetime.fromtimestamp(t).strftime('%H:%M:%S')
#x = '00:00:00'
#y = '00:15:00'
tw = 900
g = 0
a = []
for k in range(30):
begin = pd.Timestamp(s).to_pydatetime()
begin1 = begin + datetime.timedelta(seconds=int(k*60))
last = begin1 + datetime.timedelta(seconds=int(tw))
x = begin1.strftime('%H:%M:%S')
y = last.strftime('%H:%M:%S')
for i in range(1, len(df_data)):
#g +=1
if x <= df_data.iat[i-1, 4] <= y:
half_time = (pd.Timestamp(y) - pd.Timstamp(x).to_pydatetime()) / 2
half_window = (half_time + pd.Timestamp(x).to_pydatetime()).strftime('%H:%M:%S')
for l in df_data['day_order']:
for k in df_data['time_order']:
if l == k.strftime('%Y-%m-%d')
distance1 = abs(pd.Timestamp(df_data.iat[i-1, 4].to_pydatetime() - pd.Timestamp(half_window).to_pydatetime())
distance2 = abs(pd.Timestamp(df_data.iat[i, 4].to_pydatetime() - pd.Timestamp(half_window).to_pydatetime())
if distance1 < distance2:
d = distance1
else:
d = distance2
a.append(d.seconds)
so the expected result for the first day is abs(00:11:00 - 00:07:30) = 00:03:30 which is less than abs(00:00:00 - 00:07:30) = 00:07:30 and by doing so I would like to consider only the short time distance which means the 00:03:30 and ignor the first order at that day. I would like to do it for each day. I tried it with my code above, it doesn't work. Any idea would be very appreciated. Thanks in advance.
Update:
I just have added an extra command to the code above, so that I move the time window each minute, e.g. from 00:00:00 - 00:15:00 to 00:01:00- 00:16:00 and look inside this period for the short distance, as previously discribed, and ignor other times that does not belong to that window. I tired to do this procedure for 30 minutes and it worked with your suggested solution. However, it took other times that does not belong to that period of time.
import pandas as pd
import datetime
data = {'id': [0, 0, 0, 0, 0, 0],
'time_order': ['2019-01-01 0:00:00', '2019-01-01 00:11:00', '2019-01-02 00:04:00', '2019-01-02 00:15:00', '2019-01-03 00:07:00', '2019-01-03 00:10:00']}
df_data = pd.DataFrame(data)
df_data['time_order'] = pd.to_datetime(df_data['time_order'])
df_data['day_order'] = df_data['time_order'].dt.strftime('%Y-%m-%d')
df_data['time'] = df_data['time_order'].dt.strftime('%H:%M:%S')
x = '00:00:00'
y = '00:15:00'
s = '00:00:00'
tw = 900
begin = pd.Timestamp(s).to_pydatetime()
for k in range(10): # 10 times shift will happen
begin1 = begin + datetime.timedelta(seconds=int(k*60))
last = begin1 + datetime.timedelta(seconds=int(tw))
x = begin1.strftime('%H:%M:%S')
y = last.strftime('%H:%M:%S')
print('\n========\n',x,y)
diff = (pd.Timedelta(y)-pd.Timedelta(x))/2
df_data2 = df_data[(last>=pd.to_datetime(df_data['time'])) & (pd.to_datetime(df_data['time'])>begin1)].copy()
#print(df_data2)
df_data2['diff'] = abs(df_data2['time'] - (diff + pd.Timedelta(x)))
mins = df_data2.groupby('day_order').apply(lambda z: z[z['diff']==min(z['diff'])])
mins.reset_index(drop=True, inplace=True)
print(mins)
Output after first 10 shifts:
========
00:00:00 00:15:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:03:30
1 0 2019-01-02 00:04:00 2019-01-02 00:04:00 0 days 00:03:30
2 0 2019-01-03 00:07:00 2019-01-03 00:07:00 0 days 00:00:30
========
00:01:00 00:16:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:02:30
1 0 2019-01-02 00:04:00 2019-01-02 00:04:00 0 days 00:04:30
2 0 2019-01-03 00:07:00 2019-01-03 00:07:00 0 days 00:01:30
3 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:01:30
========
00:02:00 00:17:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:01:30
1 0 2019-01-02 00:04:00 2019-01-02 00:04:00 0 days 00:05:30
2 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:05:30
3 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:00:30
========
00:03:00 00:18:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:00:30
1 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:04:30
2 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:00:30
========
00:04:00 00:19:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:00:30
1 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:03:30
2 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:01:30
========
00:05:00 00:20:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:01:30
1 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:02:30
2 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:02:30
========
00:06:00 00:21:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:02:30
1 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:01:30
2 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:03:30
========
00:07:00 00:22:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:03:30
1 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:00:30
2 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:04:30
========
00:08:00 00:23:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:04:30
1 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:00:30
2 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:05:30
========
00:09:00 00:24:00
id time_order day_order time diff
0 0 2019-01-01 00:11:00 2019-01-01 00:11:00 0 days 00:05:30
1 0 2019-01-02 00:15:00 2019-01-02 00:15:00 0 days 00:01:30
2 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:06:30
Now if you see, there were some iteration where there were 4 rows generated in output. If you see in the diff column you would find that, there could be pairs of rows that can have same time difference. This is due to the fact that we are considering positive and negative time difference as same.
So for example in the above output, the second iteration i.e. 00:01:00 to 00:16:00 we can see that there are two entries for 2019-01-03
2 0 2019-01-03 00:07:00 2019-01-03 00:07:00 0 days 00:01:30
3 0 2019-01-03 00:10:00 2019-01-03 00:10:00 0 days 00:01:30
And this is because both of their difference are of 00:01:30.
The mid for this range will be at 00:01:00 + 00:07:30 = 00:08:30
00:07:30 <----(- 01:30)----00:08:30---(+ 01:30)--->00:10:00
And that's why both orders were displayed

How to increase number of truncated displayed rows in pandas series

In the terminal, I have pd.options.display.max_rows set to 60. But for a series that goes over 60 rows, the display is truncated down to show only 10 rows. How do I increase the number of truncated rows shown?
For example, the following (which is within the max_rows setting), shows 60 rows of data:
s = pd.date_range('2019-01-01', '2019-06-01').to_series()
s[:60]
But if I ask for 61 rows, it gets severely truncated:
In [44]: s[:61]
Out[44]:
2019-01-01 2019-01-01
2019-01-02 2019-01-02
2019-01-03 2019-01-03
2019-01-04 2019-01-04
2019-01-05 2019-01-05
...
2019-02-26 2019-02-26
2019-02-27 2019-02-27
2019-02-28 2019-02-28
2019-03-01 2019-03-01
2019-03-02 2019-03-02
Freq: D, Length: 61, dtype: datetime64[ns]
How can I set it so that I see, for example, 20 rows, every time it goes beyond the max_rows limit?
From the docs, you can use pd.options.display.min_rows.
Once the display.max_rows is exceeded, the display.min_rows options determines how many rows are shown in the truncated repr.
Example:
>>> pd.set_option('max_rows', 59)
>>> pd.set_option('min_rows', 20)
>>> s = pd.date_range('2019-01-01', '2019-06-01').to_series()
>>> s[:60]
2019-01-01 2019-01-01
2019-01-02 2019-01-02
2019-01-03 2019-01-03
2019-01-04 2019-01-04
2019-01-05 2019-01-05
2019-01-06 2019-01-06
2019-01-07 2019-01-07
2019-01-08 2019-01-08
2019-01-09 2019-01-09
2019-01-10 2019-01-10
...
2019-02-20 2019-02-20
2019-02-21 2019-02-21
2019-02-22 2019-02-22
2019-02-23 2019-02-23
2019-02-24 2019-02-24
2019-02-25 2019-02-25
2019-02-26 2019-02-26
2019-02-27 2019-02-27
2019-02-28 2019-02-28
2019-03-01 2019-03-01
Freq: D, Length: 60, dtype: datetime64[ns]

How do I display a subset of a pandas dataframe?

I have a dataframe df that contains datetimes for every hour of a day between 2003-02-12 to 2017-06-30 and I want to delete all datetimes between 24th Dec and 1st Jan of EVERY year.
An extract of my data frame is:
...
7505,2003-12-23 17:00:00
7506,2003-12-23 18:00:00
7507,2003-12-23 19:00:00
7508,2003-12-23 20:00:00
7509,2003-12-23 21:00:00
7510,2003-12-23 22:00:00
7511,2003-12-23 23:00:00
7512,2003-12-24 00:00:00
7513,2003-12-24 01:00:00
7514,2003-12-24 02:00:00
7515,2003-12-24 03:00:00
7516,2003-12-24 04:00:00
7517,2003-12-24 05:00:00
7518,2003-12-24 06:00:00
...
7723,2004-01-01 19:00:00
7724,2004-01-01 20:00:00
7725,2004-01-01 21:00:00
7726,2004-01-01 22:00:00
7727,2004-01-01 23:00:00
7728,2004-01-02 00:00:00
7729,2004-01-02 01:00:00
7730,2004-01-02 02:00:00
7731,2004-01-02 03:00:00
7732,2004-01-02 04:00:00
7733,2004-01-02 05:00:00
7734,2004-01-02 06:00:00
7735,2004-01-02 07:00:00
...
and my expected output is:
...
7505,2003-12-23 17:00:00
7506,2003-12-23 18:00:00
7507,2003-12-23 19:00:00
7508,2003-12-23 20:00:00
7509,2003-12-23 21:00:00
7510,2003-12-23 22:00:00
7511,2003-12-23 23:00:00
...
7728,2004-01-02 00:00:00
7729,2004-01-02 01:00:00
7730,2004-01-02 02:00:00
7731,2004-01-02 03:00:00
7732,2004-01-02 04:00:00
7733,2004-01-02 05:00:00
7734,2004-01-02 06:00:00
7735,2004-01-02 07:00:00
...
Sample dataframe:
dates
0 2003-12-23 23:00:00
1 2003-12-24 05:00:00
2 2004-12-27 05:00:00
3 2003-12-13 23:00:00
4 2002-12-23 23:00:00
5 2004-01-01 05:00:00
6 2014-12-24 05:00:00
Solution:
If you want it for every year between the following dates excluded, then extract the month and dates first:
df['month'] = df['dates'].dt.month
df['day'] = df['dates'].dt.day
And now put the condition check:
dec_days = [24, 25, 26, 27, 28, 29, 30, 31]
## if the month is dec, then check for these dates
## if the month is jan, then just check for the day to be 1 like below
df = df[~(((df.month == 12) & (df.day.isin(dec_days))) | ((df.month == 1) & (df.day == 1)))]
Sample output:
dates month day
0 2003-12-23 23:00:00 12 23
3 2003-12-13 23:00:00 12 13
4 2002-12-23 23:00:00 12 23
This takes advantage of the fact that datetime-strings in the form mm-dd are sortable. Read everything in from the CSV file then filter for the dates you want:
df = pd.read_csv('...', parse_dates=['DateTime'])
s = df['DateTime'].dt.strftime('%m-%d')
excluded = (s == '01-01') | (s >= '12-24') # Jan 1 or >= Dec 24
df[~excluded]
You can try dropping on conditionals. Maybe with a pattern match to the date string or parsing the date as a number (like in Java) and conditionally removing.
datesIdontLike = df[df['colname'] == <stringPattern>].index
newDF = df.drop(datesIdontLike, inplace=True)
Check this out: https://thispointer.com/python-pandas-how-to-drop-rows-in-dataframe-by-conditions-on-column-values/
(If you have issues, let me know.)
You can use pandas and boolean filtering with strftime
# version 0.23.4
import pandas as pd
# make df
df = pd.DataFrame(pd.date_range('20181223', '20190103', freq='H'), columns=['date'])
# string format the date to only include the month and day
# then set it strictly less than '12-24' AND greater than or equal to `01-02`
df = df.loc[
(df.date.dt.strftime('%m-%d') < '12-24') &
(df.date.dt.strftime('%m-%d') >= '01-02')
].copy()
print(df)
date
0 2018-12-23 00:00:00
1 2018-12-23 01:00:00
2 2018-12-23 02:00:00
3 2018-12-23 03:00:00
4 2018-12-23 04:00:00
5 2018-12-23 05:00:00
6 2018-12-23 06:00:00
7 2018-12-23 07:00:00
8 2018-12-23 08:00:00
9 2018-12-23 09:00:00
10 2018-12-23 10:00:00
11 2018-12-23 11:00:00
12 2018-12-23 12:00:00
13 2018-12-23 13:00:00
14 2018-12-23 14:00:00
15 2018-12-23 15:00:00
16 2018-12-23 16:00:00
17 2018-12-23 17:00:00
18 2018-12-23 18:00:00
19 2018-12-23 19:00:00
20 2018-12-23 20:00:00
21 2018-12-23 21:00:00
22 2018-12-23 22:00:00
23 2018-12-23 23:00:00
240 2019-01-02 00:00:00
241 2019-01-02 01:00:00
242 2019-01-02 02:00:00
243 2019-01-02 03:00:00
244 2019-01-02 04:00:00
245 2019-01-02 05:00:00
246 2019-01-02 06:00:00
247 2019-01-02 07:00:00
248 2019-01-02 08:00:00
249 2019-01-02 09:00:00
250 2019-01-02 10:00:00
251 2019-01-02 11:00:00
252 2019-01-02 12:00:00
253 2019-01-02 13:00:00
254 2019-01-02 14:00:00
255 2019-01-02 15:00:00
256 2019-01-02 16:00:00
257 2019-01-02 17:00:00
258 2019-01-02 18:00:00
259 2019-01-02 19:00:00
260 2019-01-02 20:00:00
261 2019-01-02 21:00:00
262 2019-01-02 22:00:00
263 2019-01-02 23:00:00
264 2019-01-03 00:00:00
This will work with multiple years because we are only filtering on the month and day.
# change range to include 2017
df = pd.DataFrame(pd.date_range('20171223', '20190103', freq='H'), columns=['date'])
df = df.loc[
(df.date.dt.strftime('%m-%d') < '12-24') &
(df.date.dt.strftime('%m-%d') >= '01-02')
].copy()
print(df)
date
0 2017-12-23 00:00:00
1 2017-12-23 01:00:00
2 2017-12-23 02:00:00
3 2017-12-23 03:00:00
4 2017-12-23 04:00:00
5 2017-12-23 05:00:00
6 2017-12-23 06:00:00
7 2017-12-23 07:00:00
8 2017-12-23 08:00:00
9 2017-12-23 09:00:00
10 2017-12-23 10:00:00
11 2017-12-23 11:00:00
12 2017-12-23 12:00:00
13 2017-12-23 13:00:00
14 2017-12-23 14:00:00
15 2017-12-23 15:00:00
16 2017-12-23 16:00:00
17 2017-12-23 17:00:00
18 2017-12-23 18:00:00
19 2017-12-23 19:00:00
20 2017-12-23 20:00:00
21 2017-12-23 21:00:00
22 2017-12-23 22:00:00
23 2017-12-23 23:00:00
240 2018-01-02 00:00:00
241 2018-01-02 01:00:00
242 2018-01-02 02:00:00
243 2018-01-02 03:00:00
244 2018-01-02 04:00:00
245 2018-01-02 05:00:00
... ...
8779 2018-12-23 19:00:00
8780 2018-12-23 20:00:00
8781 2018-12-23 21:00:00
8782 2018-12-23 22:00:00
8783 2018-12-23 23:00:00
9000 2019-01-02 00:00:00
9001 2019-01-02 01:00:00
9002 2019-01-02 02:00:00
9003 2019-01-02 03:00:00
9004 2019-01-02 04:00:00
9005 2019-01-02 05:00:00
9006 2019-01-02 06:00:00
9007 2019-01-02 07:00:00
9008 2019-01-02 08:00:00
9009 2019-01-02 09:00:00
9010 2019-01-02 10:00:00
9011 2019-01-02 11:00:00
9012 2019-01-02 12:00:00
9013 2019-01-02 13:00:00
9014 2019-01-02 14:00:00
9015 2019-01-02 15:00:00
9016 2019-01-02 16:00:00
9017 2019-01-02 17:00:00
9018 2019-01-02 18:00:00
9019 2019-01-02 19:00:00
9020 2019-01-02 20:00:00
9021 2019-01-02 21:00:00
9022 2019-01-02 22:00:00
9023 2019-01-02 23:00:00
9024 2019-01-03 00:00:00
Since you want this to happen for every year, we can first define a series that where we replace the year by a static value (2000 for example). Let date be the column that stores the date, we can generate such column as:
dt = pd.to_datetime({'year': 2000, 'month': df['date'].dt.month, 'day': df['date'].dt.day})
For the given sample data, we get:
>>> dt
0 2000-12-23
1 2000-12-23
2 2000-12-23
3 2000-12-23
4 2000-12-23
5 2000-12-23
6 2000-12-23
7 2000-12-24
8 2000-12-24
9 2000-12-24
10 2000-12-24
11 2000-12-24
12 2000-12-24
13 2000-12-24
14 2000-01-01
15 2000-01-01
16 2000-01-01
17 2000-01-01
18 2000-01-01
19 2000-01-02
20 2000-01-02
21 2000-01-02
22 2000-01-02
23 2000-01-02
24 2000-01-02
25 2000-01-02
26 2000-01-02
dtype: datetime64[ns]
Next we can filter the rows, like:
from datetime import date
df[(dt >= date(2000,1,2)) & (dt < date(2000,12,24))]
This gives us the following data for your sample data:
>>> df[(dt >= date(2000,1,2)) & (dt < date(2000,12,24))]
id dt
0 7505 2003-12-23 17:00:00
1 7506 2003-12-23 18:00:00
2 7507 2003-12-23 19:00:00
3 7508 2003-12-23 20:00:00
4 7509 2003-12-23 21:00:00
5 7510 2003-12-23 22:00:00
6 7511 2003-12-23 23:00:00
19 7728 2004-01-02 00:00:00
20 7729 2004-01-02 01:00:00
21 7730 2004-01-02 02:00:00
22 7731 2004-01-02 03:00:00
23 7732 2004-01-02 04:00:00
24 7733 2004-01-02 05:00:00
25 7734 2004-01-02 06:00:00
26 7735 2004-01-02 07:00:00
So regardless what the year is, we will only consider dates between the 2nd of January and the 23rd of December (both inclusive).

Indentify cells by condition within the same day

Let's say I have the below Dataframe. How would I do to get an extra column 'flag' with 1's where a day has a age bigger than 90 and only if it happens in 2 consecutive days (48h in this case)? The output should contain 1' on 2 or more days, depending on how many days the condition is met The dataset is much bigger, but I put here just a small portion so you get an idea.
Age
Dates
2019-01-01 00:00:00 29
2019-01-01 01:00:00 56
2019-01-01 02:00:00 82
2019-01-01 03:00:00 13
2019-01-01 04:00:00 35
2019-01-01 05:00:00 53
2019-01-01 06:00:00 25
2019-01-01 07:00:00 23
2019-01-01 08:00:00 21
2019-01-01 09:00:00 12
2019-01-01 10:00:00 15
2019-01-01 11:00:00 9
2019-01-01 12:00:00 13
2019-01-01 13:00:00 87
2019-01-01 14:00:00 9
2019-01-01 15:00:00 63
2019-01-01 16:00:00 62
2019-01-01 17:00:00 52
2019-01-01 18:00:00 43
2019-01-01 19:00:00 77
2019-01-01 20:00:00 95
2019-01-01 21:00:00 79
2019-01-01 22:00:00 77
2019-01-01 23:00:00 5
2019-01-02 00:00:00 78
2019-01-02 01:00:00 41
2019-01-02 02:00:00 10
2019-01-02 03:00:00 10
2019-01-02 04:00:00 88
2019-01-02 05:00:00 19
This would be the desired output:
Dates Age flag
0 2019-01-01 00:00:00 29 1
1 2019-01-01 01:00:00 56 1
2 2019-01-01 02:00:00 82 1
3 2019-01-01 03:00:00 13 1
4 2019-01-01 04:00:00 35 1
5 2019-01-01 05:00:00 53 1
6 2019-01-01 06:00:00 25 1
7 2019-01-01 07:00:00 23 1
8 2019-01-01 08:00:00 21 1
9 2019-01-01 09:00:00 12 1
10 2019-01-01 10:00:00 15 1
11 2019-01-01 11:00:00 9 1
12 2019-01-01 12:00:00 13 1
13 2019-01-01 13:00:00 87 1
14 2019-01-01 14:00:00 9 1
15 2019-01-01 15:00:00 63 1
16 2019-01-01 16:00:00 62 1
17 2019-01-01 17:00:00 52 1
18 2019-01-01 18:00:00 43 1
19 2019-01-01 19:00:00 77 1
20 2019-01-01 20:00:00 95 1
21 2019-01-01 21:00:00 79 1
22 2019-01-01 22:00:00 77 1
23 2019-01-01 23:00:00 5 1
24 2019-01-02 00:00:00 78 0
25 2019-01-02 01:00:00 41 0
26 2019-01-02 02:00:00 10 0
27 2019-01-02 03:00:00 10 0
28 2019-01-02 04:00:00 88 0
29 2019-01-02 05:00:00 19 0
The dates is the index of the dataframe and is incremented by 1h.
thanks
You can first compare column by Series.gt, then grouping by DatetimeIndex.date and ccheck if at least one True per groups by GroupBy.transform with GroupBy.any, last cast mask to integers for True/False to 1/0 mapping, then combinae it with previous answer:
df = pd.DataFrame({'Age': 10}, index=pd.date_range('2019-01-01', freq='5H', periods=24))
#for test 1H timestamp use
#df = pd.DataFrame({'Age': 10}, index=pd.date_range('2019-01-01', freq='H', periods=24 * 5))
df.loc[pd.Timestamp('2019-01-02 01:00:00'), 'Age'] = 95
df.loc[pd.Timestamp('2019-01-03 02:00:00'), 'Age'] = 95
df.loc[pd.Timestamp('2019-01-05 19:00:00'), 'Age'] = 95
#print (df)
#for test 48 consecutive values change N = 48
N = 10
s = df['Age'].gt(90)
s1 = (s.groupby(df.index.date).transform('any'))
g1 = s1.ne(s1.shift()).cumsum()
df['flag'] = (s.groupby(g1).transform('size').ge(N) & s1).astype(int)
print (df)
Age flag
2019-01-01 00:00:00 10 0
2019-01-01 05:00:00 10 0
2019-01-01 10:00:00 10 0
2019-01-01 15:00:00 10 0
2019-01-01 20:00:00 10 0
2019-01-02 01:00:00 95 1
2019-01-02 06:00:00 10 1
2019-01-02 11:00:00 10 1
2019-01-02 16:00:00 10 1
2019-01-02 21:00:00 10 1
2019-01-03 02:00:00 95 1
2019-01-03 07:00:00 10 1
2019-01-03 12:00:00 10 1
2019-01-03 17:00:00 10 1
2019-01-03 22:00:00 10 1
2019-01-04 03:00:00 10 0
2019-01-04 08:00:00 10 0
2019-01-04 13:00:00 10 0
2019-01-04 18:00:00 10 0
2019-01-04 23:00:00 10 0
2019-01-05 04:00:00 10 0
2019-01-05 09:00:00 10 0
2019-01-05 14:00:00 10 0
2019-01-05 19:00:00 95 0
Apparently, this could be a solution to the first version of the question: how to add a column whose row values are 1 if at least one of the rows with the same date (y-m-d) has an Age value greater than 90.
import pandas as pd
df = pd.DataFrame({
'Dates':['2019-01-01 00:00:00',
'2019-01-01 01:00:00',
'2019-01-01 02:00:00',
'2019-01-02 00:00:00',
'2019-01-02 01:00:00',
'2019-01-03 02:00:00',
'2019-01-03 03:00:00',],
'Age':[29, 56, 92, 13, 1, 2, 93],})
df.set_index('Dates', inplace=True)
df.index = pd.to_datetime(df.index)
df['flag'] = pd.DatetimeIndex(df.index).day
df['flag'] = df.flag.isin(df['flag'][df['Age']>90]).astype(int)
It returns:
Age flag
Dates
2019-01-01 00:00:00 29 1
2019-01-01 01:00:00 56 1
2019-01-01 02:00:00 92 1
2019-01-02 00:00:00 13 0
2019-01-02 01:00:00 1 0
2019-01-03 02:00:00 2 1
2019-01-03 03:00:00 93 1

Categories