I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
'login_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM','06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM']})
df.login_date = pd.to_datetime(df.login_date)
df['logout_date'] = df.login_date + pd.Timedelta(days=5)
df['login_id'] = [1,1,1,1,8,8,8]
As you can see in the sample dataframe, the login_id is the same even though login and logout dates are different for the person.
For example, person = 101, has logged in and out at 4 different timestamps. but he has got the same login_ids which is incorrect.
Instead, I would like to generate a new login_id column where each person gets a new login_id but retains the 1st login_id information in their subsequent logins. So, we can know its a sequence
I tried the below but it doesn't work well
df.groupby(['person_id','login_date','logout_date'])['login_id'].rank(method="first", ascending=True) + 100000
I expect my output to be like as shown below. You can see how 1 and 8, the 1st login_id for each person is retained in their subsequent login_ids. We just add a sequence by adding 00001 and plus one based on number of rows.
Please note I would like to apply this on a big data and the login_ids may not just be single digit in real data. For ex, 1st login_id could even be 576869578 etc kind of random number. In that case, the subsequent login id will be 57686957800001. Hope this helps. Whatever is the 1st login_id for that subject, add 00001, 00002 etc based on the number of rows that person has. Hope this helps
Update 2: Just realized my previous answers also added 100000 to the first index. Here is a version that uses GroupBy.transform() to add 100000 only to subsequent indexes:
cumcount = df.groupby(['person_id','login_id']).login_id.cumcount()
df.login_id = df.groupby(['person_id','login_id']).login_id.transform(
lambda x: x.shift().mul(100000).fillna(x.min())
).add(cumcount)
person_id login_date logout_date login_id
# 0 101 2013-05-07 09:27:00 2013-05-12 09:27:00 1
# 1 101 2013-09-08 11:21:00 2013-09-13 11:21:00 100001
# 2 101 2014-06-06 08:00:00 2014-06-11 08:00:00 100002
# 3 101 2014-06-06 05:00:00 2014-06-11 05:00:00 100003
# 4 202 2011-12-11 10:00:00 2011-12-16 10:00:00 8
# 5 202 2012-10-13 00:00:00 2012-10-18 00:00:00 800001
# 6 202 2012-12-13 11:45:00 2012-12-18 11:45:00 800002
Update: Faster option is to build the sequence with GroupBy.cumcount():
cumcount = df.groupby(['person_id','login_id']).login_id.cumcount()
df.login_id = df.login_id.mul(100000).add(cumcount)
# person_id login_date logout_date login_id
# 0 101 2013-05-07 09:27:00 2013-05-12 09:27:00 100000
# 1 101 2013-09-08 11:21:00 2013-09-13 11:21:00 100001
# 2 101 2014-06-06 08:00:00 2014-06-11 08:00:00 100002
# 3 101 2014-06-06 05:00:00 2014-06-11 05:00:00 100003
# 4 202 2011-12-11 10:00:00 2011-12-16 10:00:00 800000
# 5 202 2012-10-13 00:00:00 2012-10-18 00:00:00 800001
# 6 202 2012-12-13 11:45:00 2012-12-18 11:45:00 800002
You can build the sequence in a GroupBy.apply():
df.login_id = df.groupby(['person_id','login_id']).login_id.apply(
lambda x: pd.Series([x.min()*100000+seq for seq in range(len(x))], x.index)
)
login_id = df.groupby('person_id').login_id.apply(list)
def modify_id(x):
result= []
for index,value in enumerate(x):
if index > 0:
value = (int(value) * 100000) + index
result.append(value)
return result
df['ogin_id'] = login_id.apply(lambda x : modify_id(x)).explode().to_list()
Will give output -
person_id
login_date
logout_date
login_id
101
2013-05-07 09:27:00
2013-05-12 09:27:00
1
101
2013-09-08 11:21:00
2013-09-13 11:21:00
100001
101
2014-06-06 08:00:00
2014-06-11 08:00:00
100002
101
2014-06-06 05:00:00
2014-06-11 05:00:00
100003
202
2011-12-11 10:00:00
2011-12-16 10:00:00
8
202
2012-10-13 00:00:00
2012-10-18 00:00:00
800001
202
2012-12-13 11:45:00
2012-12-18 11:45:00
800002
You can make use of your original rank()
df['login_id'] = df['login_id'] * 100000 + df.groupby(['person_id'])['login_id'].rank(method="first") - 1
# print(df)
person_id login_date logout_date login_id
0 101 2013-05-07 09:27:00 2013-05-12 09:27:00 100000.0
1 101 2013-09-08 11:21:00 2013-09-13 11:21:00 100001.0
2 101 2014-06-06 08:00:00 2014-06-11 08:00:00 100002.0
3 101 2014-06-06 05:00:00 2014-06-11 05:00:00 100003.0
4 202 2011-12-11 10:00:00 2011-12-16 10:00:00 800000.0
5 202 2012-10-13 00:00:00 2012-10-18 00:00:00 800001.0
6 202 2012-12-13 11:45:00 2012-12-18 11:45:00 800002.0
Then changed the first row of each group
def change_first(group):
group.loc[group.index[0], 'login_id'] = group.iloc[0]['login_id'] / 100000
return group
df['login_id'] = df.groupby(['person_id']).apply(lambda group: change_first(group))['login_id']
# print(df)
person_id login_date logout_date login_id
0 101 2013-05-07 09:27:00 2013-05-12 09:27:00 1.0
1 101 2013-09-08 11:21:00 2013-09-13 11:21:00 100001.0
2 101 2014-06-06 08:00:00 2014-06-11 08:00:00 100002.0
3 101 2014-06-06 05:00:00 2014-06-11 05:00:00 100003.0
4 202 2011-12-11 10:00:00 2011-12-16 10:00:00 8.0
5 202 2012-10-13 00:00:00 2012-10-18 00:00:00 800001.0
6 202 2012-12-13 11:45:00 2012-12-18 11:45:00 800002.0
Or make use of where() to only update the row where condition is False.
df_ = df['login_id'] * 100000 + df.groupby(['person_id'])['login_id'].rank(method="first") - 1
firsts = df.groupby(['person_id']).head(1).index
df['login_id'] = df['login_id'].where(df.index.isin(firsts), df_)
Related
I have a dataframe like as shown below. This is a continuation of this post
df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
'person_type':['A','A','B','C','D','B','A'],
'login_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM','06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM'],
'logout_date':[np.nan,'11/08/2013 11:21:00 AM',np.nan,'06/06/2014 05:00:00 AM',np.nan,'13/10/2012 12:00:00 AM',np.nan]})
df.login_date = pd.to_datetime(df.login_date)
df.logout_date = pd.to_datetime(df.logout_date)
I would like to apply 2 rules to the logout_date column
Rule 1 - If person type is B, C,D,E AND logout_date is NaN, then copy the login date value
Rule 2 - If person type is A AND logout_date is NaN, then add 2 days to the login date
When I try the below
m1 = df['person_type'].isin(['B','C','D'])
m2 = df['person_type'].isin(['A'])
m3 = df['logout_datetime'].isna()
df['logout_datetime'] = np.select([m1 & m3, m2 & m3],
[df['login_datetime'], df['login_datetime'] + pd.DateOffset(days=2)],
default=df['logout_datetime'])
df['logout_date'] = np.select([m1 & m3, m2 & m3],
[df['login_datetime'].dt.date, (df['login_datetime'] + pd.DateOffset(days=2)).dt.date],
default=df['logout_datetime'])
I would like to get the logout_date column directly by using np.select as shown in sample code.
Currently I get an output like below which is incorrect
I don't understand why some rows are causing issues while other rows are working fine.
Can help me with this? I expect my output to have proper date values
I think problem is missing converting in default parameter in np.select (default=df['logout_datetime']) and change it to default=df['logout_datetime'].dt.date for same types returned from np.select:
df['logout_date'] = np.select([m1 & m3, m2 & m3],
[df['login_date'].dt.date,
(df['login_date'] + pd.DateOffset(days=2)).dt.date],
default=df['logout_date'].dt.date)
print (df)
person_id person_type login_date logout_date
0 101 A 2013-05-07 09:27:00 2013-05-09
1 101 A 2013-09-08 11:21:00 2013-11-08
2 101 B 2014-06-06 08:00:00 2014-06-06
3 101 C 2014-06-06 05:00:00 2014-06-06
4 202 D 2011-12-11 10:00:00 2011-12-11
5 202 B 2012-10-13 00:00:00 2012-10-13
6 202 A 2012-12-13 11:45:00 2012-12-15
If need default with datetimes then Series.dt.normalize remove times (set to 00:00:00) and all types are datetimes, so working well:
df['logout_date'] = np.select([m1 & m3, m2 & m3],
[df['login_date'].dt.normalize(),
(df['login_date'] + pd.DateOffset(days=2)).dt.normalize()],
default=df['logout_date'])
print (df)
person_id person_type login_date logout_date
0 101 A 2013-05-07 09:27:00 2013-05-09 00:00:00
1 101 A 2013-09-08 11:21:00 2013-11-08 11:21:00
2 101 B 2014-06-06 08:00:00 2014-06-06 00:00:00
3 101 C 2014-06-06 05:00:00 2014-06-06 05:00:00
4 202 D 2011-12-11 10:00:00 2011-12-11 00:00:00
5 202 B 2012-10-13 00:00:00 2012-10-13 00:00:00
6 202 A 2012-12-13 11:45:00 2012-12-15 00:00:00
For origional datetimes use:
df['logout_date'] = np.select([m1 & m3, m2 & m3],
[df['login_date'],
(df['login_date'] + pd.DateOffset(days=2))],
default=df['logout_date'])
print (df)
person_id person_type login_date logout_date
0 101 A 2013-05-07 09:27:00 2013-05-09 09:27:00
1 101 A 2013-09-08 11:21:00 2013-11-08 11:21:00
2 101 B 2014-06-06 08:00:00 2014-06-06 08:00:00
3 101 C 2014-06-06 05:00:00 2014-06-06 05:00:00
4 202 D 2011-12-11 10:00:00 2011-12-11 10:00:00
5 202 B 2012-10-13 00:00:00 2012-10-13 00:00:00
6 202 A 2012-12-13 11:45:00 2012-12-15 11:45:00
I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [101,101,101,101,202,202,202],
'start_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM','06/06/2014 05:00:00 AM','12/11/2011 10:00:00 AM','13/10/2012 12:00:00 AM','13/12/2012 11:45:00 AM'],
'end_date':['5/12/2013 09:27:00 AM',np.nan,'06/11/2014 08:00:00 AM',np.nan,'12/16/2011 10:00:00','10/18/2012 00:00:00',np.nan],
'type':['O','I','O','O','I','O','I']})
df.start_date = pd.to_datetime(df.start_date)
df['end_date'] = pd.to_datetime(df.end_date)
I would like to fillna() under the end_date column based on two approaches below
a) If NA is found in any row except last row of that person, fillna by copying the value from next row
b) If NA is found in the last row of that person fillna by adding 10 days to his start_date (because there is no next row for that person to copy from. So, we give random value of 10 days)
The rules a and b only for persons with type=I.
For persons with type=O, just fillna by copying the value from start_date.
This is what I tried. You can see am writing the same code line twice.
df['end_date'] = np.where(df['type'].str.contains('I'),pd.DatetimeIndex(df['end_date'].bfill()),pd.DatetimeIndex(df.start_date.dt.date))
df['end_date'] = np.where(df['type'].str.contains('I'),pd.DatetimeIndex(df['start_date'] + pd.DateOffset(10)),pd.DatetimeIndex(df.start_date.dt.date))
Any elegant and efficient way to write this as I have to apply this on a big data with 15 million rows?
I expect my output to be like as shown below
Solution
s1 = df.groupby('person_id')['start_date'].shift(-1)
s1 = s1.fillna(df['start_date'] + pd.DateOffset(days=10))
s1 = df['end_date'].fillna(s1)
s2 = df['end_date'].fillna(df['start_date'])
df['end_date'] = np.where(df['type'].eq('I'), s1, s2)
Explanations
Group the dataframe on person_id and shift the column start_date one units upwards.
>>> df.groupby('person_id')['start_date'].shift(-1)
0 2013-09-08 11:21:00
1 2014-06-06 08:00:00
2 2014-06-06 05:00:00
3 NaT
4 2012-10-13 00:00:00
5 2012-12-13 11:45:00
6 NaT
Name: start_date, dtype: datetime64[ns]
Fill the NaN values in the shifted column with the values from start_date column after adding an offset of 10 days
>>> s1.fillna(df['start_date'] + pd.DateOffset(days=10))
0 2013-09-08 11:21:00
1 2014-06-06 08:00:00
2 2014-06-06 05:00:00
3 2014-06-16 05:00:00
4 2012-10-13 00:00:00
5 2012-12-13 11:45:00
6 2012-12-23 11:45:00
Name: start_date, dtype: datetime64[ns]
Now fill the NaN values in end_date column with the above series s1
>>> df['end_date'].fillna(s1)
0 2013-05-12 09:27:00
1 2014-06-06 08:00:00
2 2014-06-11 08:00:00
3 2014-06-16 05:00:00
4 2011-12-16 10:00:00
5 2012-10-18 00:00:00
6 2012-12-23 11:45:00
Name: end_date, dtype: datetime64[ns]
Similarly fill the NaN values in end_date column with the values from start_date column to create a series s2
>>> df['end_date'].fillna(df['start_date'])
0 2013-05-12 09:27:00
1 2013-09-08 11:21:00
2 2014-06-11 08:00:00
3 2014-06-06 05:00:00
4 2011-12-16 10:00:00
5 2012-10-18 00:00:00
6 2012-12-13 11:45:00
Name: end_date, dtype: datetime64[ns]
Then use np.where to select the values from s1 / s2 based on the condition where the type is I or O
>>> df
person_id start_date end_date type
0 101 2013-05-07 09:27:00 2013-05-12 09:27:00 O
1 101 2013-09-08 11:21:00 2014-06-06 08:00:00 I
2 101 2014-06-06 08:00:00 2014-06-11 08:00:00 O
3 101 2014-06-06 05:00:00 2014-06-06 05:00:00 O
4 202 2011-12-11 10:00:00 2011-12-16 10:00:00 I
5 202 2012-10-13 00:00:00 2012-10-18 00:00:00 O
6 202 2012-12-13 11:45:00 2012-12-23 11:45:00 I
Suppose I have two Date inputs : 2020-01-20 11:35:00 and 2020-01-25 08:00:00 .
I want a output DataFrame as :
time1 time2
-------------------------------------------
2020-01-20 11:35:00 | 2020-01-21 00:00:00
2020-01-21 00:00:00 | 2020-01-22 00:00:00
2020-01-22 00:00:00 | 2020-01-23 00:00:00
2020-01-23 00:00:00 | 2020-01-24 00:00:00
2020-01-24 00:00:00 | 2020-01-25 00:00:00
2020-01-25 00:00:00 | 2020-01-25 08:00:00
no built in way to do this, we can use iloc and pd.date_range to assign the first and last dates and generate your date range.
t1 = pd.Timestamp('2020-01-20 11:35:00')
t2 = pd.Timestamp('2020-01-25 08:00:00')
df = pd.DataFrame({'Time1' : pd.date_range(t1.date(),t2.date())})
df = df.assign(Time2 = df['Time1'] + pd.DateOffset(days=1))
df.iloc[0,0] = t1
df.iloc[-1,1] = t2
print(df)
Time1 Time2
0 2020-01-20 11:35:00 2020-01-21 00:00:00
1 2020-01-21 00:00:00 2020-01-22 00:00:00
2 2020-01-22 00:00:00 2020-01-23 00:00:00
3 2020-01-23 00:00:00 2020-01-24 00:00:00
4 2020-01-24 00:00:00 2020-01-25 00:00:00
5 2020-01-25 00:00:00 2020-01-25 08:00:00
You can use date_range with both dates and then create the dataframe.
d1 = pd.to_datetime('2020-01-20 11:35:00')
d2 = pd.to_datetime('2020-01-25 08:00:00')
l = pd.date_range(d1.date(), d2.date(), freq='d').tolist()[1:] #remove the first date
df = pd.DataFrame({'time1':[d1] + l, 'time2':l + [d2]})
print (df)
time1 time2
0 2020-01-20 11:35:00 2020-01-21 00:00:00
1 2020-01-21 00:00:00 2020-01-22 00:00:00
2 2020-01-22 00:00:00 2020-01-23 00:00:00
3 2020-01-23 00:00:00 2020-01-24 00:00:00
4 2020-01-24 00:00:00 2020-01-25 00:00:00
5 2020-01-25 00:00:00 2020-01-25 08:00:00
I have a dataframe df that contains datetimes for every hour of a day between 2003-02-12 to 2017-06-30 and I want to delete all datetimes between 24th Dec and 1st Jan of EVERY year.
An extract of my data frame is:
...
7505,2003-12-23 17:00:00
7506,2003-12-23 18:00:00
7507,2003-12-23 19:00:00
7508,2003-12-23 20:00:00
7509,2003-12-23 21:00:00
7510,2003-12-23 22:00:00
7511,2003-12-23 23:00:00
7512,2003-12-24 00:00:00
7513,2003-12-24 01:00:00
7514,2003-12-24 02:00:00
7515,2003-12-24 03:00:00
7516,2003-12-24 04:00:00
7517,2003-12-24 05:00:00
7518,2003-12-24 06:00:00
...
7723,2004-01-01 19:00:00
7724,2004-01-01 20:00:00
7725,2004-01-01 21:00:00
7726,2004-01-01 22:00:00
7727,2004-01-01 23:00:00
7728,2004-01-02 00:00:00
7729,2004-01-02 01:00:00
7730,2004-01-02 02:00:00
7731,2004-01-02 03:00:00
7732,2004-01-02 04:00:00
7733,2004-01-02 05:00:00
7734,2004-01-02 06:00:00
7735,2004-01-02 07:00:00
...
and my expected output is:
...
7505,2003-12-23 17:00:00
7506,2003-12-23 18:00:00
7507,2003-12-23 19:00:00
7508,2003-12-23 20:00:00
7509,2003-12-23 21:00:00
7510,2003-12-23 22:00:00
7511,2003-12-23 23:00:00
...
7728,2004-01-02 00:00:00
7729,2004-01-02 01:00:00
7730,2004-01-02 02:00:00
7731,2004-01-02 03:00:00
7732,2004-01-02 04:00:00
7733,2004-01-02 05:00:00
7734,2004-01-02 06:00:00
7735,2004-01-02 07:00:00
...
Sample dataframe:
dates
0 2003-12-23 23:00:00
1 2003-12-24 05:00:00
2 2004-12-27 05:00:00
3 2003-12-13 23:00:00
4 2002-12-23 23:00:00
5 2004-01-01 05:00:00
6 2014-12-24 05:00:00
Solution:
If you want it for every year between the following dates excluded, then extract the month and dates first:
df['month'] = df['dates'].dt.month
df['day'] = df['dates'].dt.day
And now put the condition check:
dec_days = [24, 25, 26, 27, 28, 29, 30, 31]
## if the month is dec, then check for these dates
## if the month is jan, then just check for the day to be 1 like below
df = df[~(((df.month == 12) & (df.day.isin(dec_days))) | ((df.month == 1) & (df.day == 1)))]
Sample output:
dates month day
0 2003-12-23 23:00:00 12 23
3 2003-12-13 23:00:00 12 13
4 2002-12-23 23:00:00 12 23
This takes advantage of the fact that datetime-strings in the form mm-dd are sortable. Read everything in from the CSV file then filter for the dates you want:
df = pd.read_csv('...', parse_dates=['DateTime'])
s = df['DateTime'].dt.strftime('%m-%d')
excluded = (s == '01-01') | (s >= '12-24') # Jan 1 or >= Dec 24
df[~excluded]
You can try dropping on conditionals. Maybe with a pattern match to the date string or parsing the date as a number (like in Java) and conditionally removing.
datesIdontLike = df[df['colname'] == <stringPattern>].index
newDF = df.drop(datesIdontLike, inplace=True)
Check this out: https://thispointer.com/python-pandas-how-to-drop-rows-in-dataframe-by-conditions-on-column-values/
(If you have issues, let me know.)
You can use pandas and boolean filtering with strftime
# version 0.23.4
import pandas as pd
# make df
df = pd.DataFrame(pd.date_range('20181223', '20190103', freq='H'), columns=['date'])
# string format the date to only include the month and day
# then set it strictly less than '12-24' AND greater than or equal to `01-02`
df = df.loc[
(df.date.dt.strftime('%m-%d') < '12-24') &
(df.date.dt.strftime('%m-%d') >= '01-02')
].copy()
print(df)
date
0 2018-12-23 00:00:00
1 2018-12-23 01:00:00
2 2018-12-23 02:00:00
3 2018-12-23 03:00:00
4 2018-12-23 04:00:00
5 2018-12-23 05:00:00
6 2018-12-23 06:00:00
7 2018-12-23 07:00:00
8 2018-12-23 08:00:00
9 2018-12-23 09:00:00
10 2018-12-23 10:00:00
11 2018-12-23 11:00:00
12 2018-12-23 12:00:00
13 2018-12-23 13:00:00
14 2018-12-23 14:00:00
15 2018-12-23 15:00:00
16 2018-12-23 16:00:00
17 2018-12-23 17:00:00
18 2018-12-23 18:00:00
19 2018-12-23 19:00:00
20 2018-12-23 20:00:00
21 2018-12-23 21:00:00
22 2018-12-23 22:00:00
23 2018-12-23 23:00:00
240 2019-01-02 00:00:00
241 2019-01-02 01:00:00
242 2019-01-02 02:00:00
243 2019-01-02 03:00:00
244 2019-01-02 04:00:00
245 2019-01-02 05:00:00
246 2019-01-02 06:00:00
247 2019-01-02 07:00:00
248 2019-01-02 08:00:00
249 2019-01-02 09:00:00
250 2019-01-02 10:00:00
251 2019-01-02 11:00:00
252 2019-01-02 12:00:00
253 2019-01-02 13:00:00
254 2019-01-02 14:00:00
255 2019-01-02 15:00:00
256 2019-01-02 16:00:00
257 2019-01-02 17:00:00
258 2019-01-02 18:00:00
259 2019-01-02 19:00:00
260 2019-01-02 20:00:00
261 2019-01-02 21:00:00
262 2019-01-02 22:00:00
263 2019-01-02 23:00:00
264 2019-01-03 00:00:00
This will work with multiple years because we are only filtering on the month and day.
# change range to include 2017
df = pd.DataFrame(pd.date_range('20171223', '20190103', freq='H'), columns=['date'])
df = df.loc[
(df.date.dt.strftime('%m-%d') < '12-24') &
(df.date.dt.strftime('%m-%d') >= '01-02')
].copy()
print(df)
date
0 2017-12-23 00:00:00
1 2017-12-23 01:00:00
2 2017-12-23 02:00:00
3 2017-12-23 03:00:00
4 2017-12-23 04:00:00
5 2017-12-23 05:00:00
6 2017-12-23 06:00:00
7 2017-12-23 07:00:00
8 2017-12-23 08:00:00
9 2017-12-23 09:00:00
10 2017-12-23 10:00:00
11 2017-12-23 11:00:00
12 2017-12-23 12:00:00
13 2017-12-23 13:00:00
14 2017-12-23 14:00:00
15 2017-12-23 15:00:00
16 2017-12-23 16:00:00
17 2017-12-23 17:00:00
18 2017-12-23 18:00:00
19 2017-12-23 19:00:00
20 2017-12-23 20:00:00
21 2017-12-23 21:00:00
22 2017-12-23 22:00:00
23 2017-12-23 23:00:00
240 2018-01-02 00:00:00
241 2018-01-02 01:00:00
242 2018-01-02 02:00:00
243 2018-01-02 03:00:00
244 2018-01-02 04:00:00
245 2018-01-02 05:00:00
... ...
8779 2018-12-23 19:00:00
8780 2018-12-23 20:00:00
8781 2018-12-23 21:00:00
8782 2018-12-23 22:00:00
8783 2018-12-23 23:00:00
9000 2019-01-02 00:00:00
9001 2019-01-02 01:00:00
9002 2019-01-02 02:00:00
9003 2019-01-02 03:00:00
9004 2019-01-02 04:00:00
9005 2019-01-02 05:00:00
9006 2019-01-02 06:00:00
9007 2019-01-02 07:00:00
9008 2019-01-02 08:00:00
9009 2019-01-02 09:00:00
9010 2019-01-02 10:00:00
9011 2019-01-02 11:00:00
9012 2019-01-02 12:00:00
9013 2019-01-02 13:00:00
9014 2019-01-02 14:00:00
9015 2019-01-02 15:00:00
9016 2019-01-02 16:00:00
9017 2019-01-02 17:00:00
9018 2019-01-02 18:00:00
9019 2019-01-02 19:00:00
9020 2019-01-02 20:00:00
9021 2019-01-02 21:00:00
9022 2019-01-02 22:00:00
9023 2019-01-02 23:00:00
9024 2019-01-03 00:00:00
Since you want this to happen for every year, we can first define a series that where we replace the year by a static value (2000 for example). Let date be the column that stores the date, we can generate such column as:
dt = pd.to_datetime({'year': 2000, 'month': df['date'].dt.month, 'day': df['date'].dt.day})
For the given sample data, we get:
>>> dt
0 2000-12-23
1 2000-12-23
2 2000-12-23
3 2000-12-23
4 2000-12-23
5 2000-12-23
6 2000-12-23
7 2000-12-24
8 2000-12-24
9 2000-12-24
10 2000-12-24
11 2000-12-24
12 2000-12-24
13 2000-12-24
14 2000-01-01
15 2000-01-01
16 2000-01-01
17 2000-01-01
18 2000-01-01
19 2000-01-02
20 2000-01-02
21 2000-01-02
22 2000-01-02
23 2000-01-02
24 2000-01-02
25 2000-01-02
26 2000-01-02
dtype: datetime64[ns]
Next we can filter the rows, like:
from datetime import date
df[(dt >= date(2000,1,2)) & (dt < date(2000,12,24))]
This gives us the following data for your sample data:
>>> df[(dt >= date(2000,1,2)) & (dt < date(2000,12,24))]
id dt
0 7505 2003-12-23 17:00:00
1 7506 2003-12-23 18:00:00
2 7507 2003-12-23 19:00:00
3 7508 2003-12-23 20:00:00
4 7509 2003-12-23 21:00:00
5 7510 2003-12-23 22:00:00
6 7511 2003-12-23 23:00:00
19 7728 2004-01-02 00:00:00
20 7729 2004-01-02 01:00:00
21 7730 2004-01-02 02:00:00
22 7731 2004-01-02 03:00:00
23 7732 2004-01-02 04:00:00
24 7733 2004-01-02 05:00:00
25 7734 2004-01-02 06:00:00
26 7735 2004-01-02 07:00:00
So regardless what the year is, we will only consider dates between the 2nd of January and the 23rd of December (both inclusive).
I have two data frames like following, data frame A has datetime even with minutes, data frame B only has hour.
df:A
dataDate original
2018-09-30 11:20:00 3
2018-10-01 12:40:00 10
2018-10-02 07:00:00 5
2018-10-27 12:50:00 5
2018-11-28 19:45:00 7
df:B
dataDate count
2018-09-30 10:00:00 300
2018-10-01 12:00:00 50
2018-10-02 07:00:00 120
2018-10-27 12:00:00 234
2018-11-28 19:05:00 714
I like to merge the two on the basis of hour date and hour, so that now in dataframe A should have all the rows filled on the basis of merge on date and hour
I can try to do it via
A['date'] = A.dataDate.date
B['date'] = B.dataDate.date
A['hour'] = A.dataDate.hour
B['hour'] = B.dataDate.hour
and then merge
merge_df = pd.merge(A,B, how='left', left_on=['date', 'hour'],
right_on=['date', 'hour'])
but its a very long process, Is their an efficient way to perform the same operation with the help of pandas time series or date functionality?
Use map if need append only one column from B to A with floor for set minutes and seconds if exist to 0:
d = dict(zip(B.dataDate.dt.floor('H'), B['count']))
A['count'] = A.dataDate.dt.floor('H').map(d)
print (A)
dataDate original count
0 2018-09-30 11:20:00 3 NaN
1 2018-10-01 12:40:00 10 50.0
2 2018-10-02 07:00:00 5 120.0
3 2018-10-27 12:50:00 5 234.0
4 2018-11-28 19:45:00 7 714.0
For general solution use DataFrame.join:
A.index = A.dataDate.dt.floor('H')
B.index = B.dataDate.dt.floor('H')
A = A.join(B, lsuffix='_left')
print (A)
dataDate_left original dataDate count
dataDate
2018-09-30 11:00:00 2018-09-30 11:20:00 3 NaT NaN
2018-10-01 12:00:00 2018-10-01 12:40:00 10 2018-10-01 12:00:00 50.0
2018-10-02 07:00:00 2018-10-02 07:00:00 5 2018-10-02 07:00:00 120.0
2018-10-27 12:00:00 2018-10-27 12:50:00 5 2018-10-27 12:00:00 234.0
2018-11-28 19:00:00 2018-11-28 19:45:00 7 2018-11-28 19:05:00 714.0