Pandas df comparing two dates condition - python

I'd like to add 1 if date_ > buy_date larger than 12 months else 0
example df
customer_id date_ buy_date
34555 2019-01-01 2017-02-01
24252 2019-01-01 2018-02-10
96477 2019-01-01 2017-02-18
output df
customer_id date_ buy_date buy_date>_than_12_months
34555 2019-01-01 2017-02-01 1
24252 2019-01-01 2018-02-10 0
96477 2019-01-01 2018-02-18 1

Based on what I understand, you can try adding a year to buy_date and then subtract from date_ , then check if days are + or -.
df['buy_date>_than_12_months'] = ((df['date_'] -
(df['buy_date']+pd.offsets.DateOffset(years=1)))
.dt.days.gt(0).astype(int))
print(df)
customer_id date_ buy_date buy_date>_than_12_months
0 34555 2019-01-01 2017-02-01 1
1 24252 2019-01-01 2018-02-10 0
2 96477 2019-01-01 2017-02-18 1

import pandas as pd
import numpy as np
values = {'customer_id': [34555,24252,96477],
'date_': ['2019-01-01','2019-01-01','2019-01-01'],
'buy_date': ['2017-02-01','2018-02-10','2017-02-18'],
}
df = pd.DataFrame(values, columns = ['customer_id', 'date_', 'buy_date'])
df['date_'] = pd.to_datetime(df['date_'], format='%Y-%m-%d')
df['buy_date'] = pd.to_datetime(df['buy_date'], format='%Y-%m-%d')
print(df['date_'] - df['buy_date'])
df['buy_date>_than_12_months'] = pd.Series([1 if ((df['date_'] - df['buy_date'])[i]> np.timedelta64(1, 'Y')) else 0 for i in range(3)])
print (df)

Related

Pandas change time values based on condition

I have a dataframe:
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
I would like to convert the time based on conditions: if the hour is less than 9, I want to set it to 9 and if the hour is more than 17, I need to set it to 17.
I tried this approach:
df['time'] = np.where(((df['time'].dt.hour < 9) & (df['time'].dt.hour != 0)), dt.time(9, 00))
I am getting an error: Can only use .dt. accesor with datetimelike values.
Can anyone please help me with this? Thanks.
Here's a way to do what your question asks:
df.time = pd.to_datetime(df.time)
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
Input:
time
0 2022-06-06 08:45:00
1 2022-06-06 09:30:00
2 2022-06-06 18:00:00
3 2022-06-06 15:00:00
Output:
time
0 2022-06-06 09:45:00
1 2022-06-06 09:30:00
2 2022-06-06 17:00:00
3 2022-06-06 15:00:00
UPDATE:
Here's alternative code to try to address OP's error as described in the comments:
import pandas as pd
import datetime
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print('', 'df loaded as strings:', df, sep='\n')
df.time = pd.to_datetime(df.time, format='%H:%M:%S')
print('', 'df converted to datetime by pd.to_datetime():', df, sep='\n')
df.loc[df.time.dt.hour < 9, 'time'] = (df.time.astype('int64') + (9 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.loc[df.time.dt.hour > 17, 'time'] = (df.time.astype('int64') + (17 - df.time.dt.hour)*3600*1000000000).astype('datetime64[ns]')
df.time = [time.time() for time in pd.to_datetime(df.time)]
print('', 'df with time column adjusted to have hour between 9 and 17, converted to type "time":', df, sep='\n')
Output:
df loaded as strings:
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
df converted to datetime by pd.to_datetime():
time
0 1900-01-01 08:45:00
1 1900-01-01 09:30:00
2 1900-01-01 18:00:00
3 1900-01-01 15:00:00
df with time column adjusted to have hour between 9 and 17, converted to type "time":
time
0 09:45:00
1 09:30:00
2 17:00:00
3 15:00:00
UPDATE #2:
To not just change the hour for out-of-window times, but to simply apply 9:00 and 17:00 as min and max times, respectively (see OP's comment on this), you can do this:
df.loc[df['time'].dt.hour < 9, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[9]*len(df.index)}))
df.loc[df['time'].dt.hour > 17, 'time'] = pd.to_datetime(pd.DataFrame({
'year':df['time'].dt.year, 'month':df['time'].dt.month, 'day':df['time'].dt.day,
'hour':[17]*len(df.index)}))
df['time'] = [time.time() for time in pd.to_datetime(df['time'])]
Since your 'time' column contains strings they can kept as strings and assign new string values where appropriate. To filter for your criteria it is convenient to: create datetime Series from the 'time' column, create boolean Series by comparing the datetime Series with your criteria, use the boolean Series to filter the rows which need to be changed.
Your data:
import numpy as np
import pandas as pd
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00']}
df = pd.DataFrame(data)
print(df.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
Convert to datetime, make boolean Series with your criteria
dts = pd.to_datetime(df['time'])
lt_nine = dts.dt.hour < 9
gt_seventeen = (dts.dt.hour >= 17)
print(lt_nine)
print(gt_seventeen)
>>>
0 True
1 False
2 False
3 False
Name: time, dtype: bool
0 False
1 False
2 True
3 False
Name: time, dtype: bool
Use the boolean series to assign a new value:
df.loc[lt_nine,'time'] = '09:00:00'
df.loc[gt_seventeen,'time'] = '17:00:00'
print(df.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
Or just stick with strings altogether and create the boolean Series using regex patterns and .str.match.
data = {'time':['08:45:00', '09:30:00', '18:00:00', '15:00:00','07:22:00','22:02:06']}
dg = pd.DataFrame(data)
print(dg.to_string())
>>>
time
0 08:45:00
1 09:30:00
2 18:00:00
3 15:00:00
4 07:22:00
5 22:02:06
# regex patterns
pattern_lt_nine = '^00|01|02|03|04|05|06|07|08'
pattern_gt_seventeen = '^17|18|19|20|21|22|23'
Make boolean Series and assign new values
gt_seventeen = dg['time'].str.match(pattern_gt_seventeen)
lt_nine = dg['time'].str.match(pattern_lt_nine)
dg.loc[lt_nine,'time'] = '09:00:00'
dg.loc[gt_seventeen,'time'] = '17:00:00'
print(dg.to_string())
>>>
time
0 09:00:00
1 09:30:00
2 17:00:00
3 15:00:00
4 09:00:00
5 17:00:00
Time series / date functionality
Working with text data

I want percentage change in two time column (column format is hh:mm:ss) in pandas

import pandas as pd
import numpy as np
data = {'Name':['Si','Ov','Sp','Sa','An'],
'Time1':['02:00:00', '03:02:00', '04:00:30','01:02:30','0'],
'Time2':['03:00:00', '0', '05:00:30','02:02:30','02:00:00']}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
print (df)
Output
Name Time1 Time2
0 Siya 02:00:00 03:00:00
1 Ovi 03:02:00 0
2 Spruha 04:00:30 05:00:30
3 Saanvi 01:02:30 02:02:30
4 Ansh 0 02:00:00
want to add one more column to and apply the formula
Time3=(Time1-Time2)/Time2
There is 0 or nan value also.
Use to_timedelta for convert times to timedeltas:
t1 = pd.to_timedelta(df['Time1'])
t2 = pd.to_timedelta(df['Time2'])
df['Time3'] = t1.sub(t2).div(t2)
print (df)
Name Time1 Time2 Time3
0 Si 02:00:00 03:00:00 -0.333333
1 Ov 03:02:00 0 inf
2 Sp 04:00:30 05:00:30 -0.199667
3 Sa 01:02:30 02:02:30 -0.489796
4 An 0 02:00:00 -1.000000
EDIT:
For add new row and column use:
def format_timedelta(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
t1 = pd.to_timedelta(df['Time1'])
t2 = pd.to_timedelta(df['Time2'])
df['Time3'] = t1.sub(t2).div(t2)
idx = len(df)
df.loc[idx] = (pd.concat([t1, t2], axis=1)
.sum()
.apply(format_timedelta))
df.loc[idx, ['Name','Time3']] = ['Total', df['Time3'].mask(np.isinf(df['Time3'])).sum()]
print (df)
Name Time1 Time2 Time3
0 Si 02:00:00 03:00:00 -0.333333
1 Ov 03:02:00 0 inf
2 Sp 04:00:30 05:00:30 -0.199667
3 Sa 01:02:30 02:02:30 -0.489796
4 An 0 02:00:00 -1.000000
5 Total 10:05:00 12:03:00 -2.022796

How to create new column in my data frame by condition's

I'm trying to creating a new column in my data frame by the follow condition's :
If the value in Date_of_basket_entryis NAN then respond 0.
If the value in Date_of_basket_entryis greater(DATE STILL IN THE
FUTURE) then in month_year then respond 1.
If the value in Date_of_basket_entryis lower (DATE STILL IN THE
PAST) then in month_year then respond 0.
month_year Date_of_basket_entry
0 03/2017 01.04.2005
1 02/2019 01.01.1995
2 07/2017 None
4 02/2017 None
5 04/2017 01.01.2020
it should be something like this:
month_year Date_of_basket_entry Date_of_basket_boolean
0 03/2017 01.04.2005 0
1 02/2019 01.01.1995 0
2 07/2017 None 0
4 02/2017 None 0
5 04/2017 01.01.2020 1
#Danielhab I like np.where for this situation.
import numpy as np
# if dtype is wrong the condition won't work correctly
df = df.astype(np.datetime64)
df.loc[:, 'Date_of_basket_boolean'] = np.where((df.Date_of_basket_entry.isna()) | (df.Date_of_basket_entry < df.month_year), 0, 1)
I think this should work, just check your logic.
I think it may be difficult to compare month/year to month.day.year. I would start by converting columns to have the same structure. Then you can use numpy's np.where function.
import pandas as pd
import numpy as np
df = pd.DataFrame({'month_year':['03/2017','02/2019', '07/2017', '02/2017', '04/2017'],
'Date_of_basket_entry':['1.04.2005','01.01.1995', None, None, '01.01.2020']})
df['new1'] = pd.to_datetime(df['month_year'], infer_datetime_format=True)
df['new2'] = pd.to_datetime(df['Date_of_basket_entry'], infer_datetime_format=True)
print(df)
month_year Date_of_basket_entry new1 new2
0 03/2017 1.04.2005 2017-03-01 2005-01-04
1 02/2019 01.01.1995 2019-02-01 1995-01-01
2 07/2017 None 2017-07-01 NaT
3 02/2017 None 2017-02-01 NaT
4 04/2017 01.01.2020 2017-04-01 2020-01-01
df['Date_of_basket_boolean'] = np.where(df['new2']>df['new1'],1,0)
print(df)
month_year Date_of_basket_entry new1 new2 Date_of_basket_boolean
0 03/2017 1.04.2005 2017-03-01 2005-01-04 0
1 02/2019 01.01.1995 2019-02-01 1995-01-01 0
2 07/2017 None 2017-07-01 NaT 0
3 02/2017 None 2017-02-01 NaT 0
4 04/2017 01.01.2020 2017-04-01 2020-01-01 1

Pandas, is a date holiday?

I have the following pandas dataframe. The dates are with time:
from pandas.tseries.holiday import USFederalHolidayCalendar
import pandas as pd<BR>
df = pd.DataFrame([[6,0,"2016-01-02 01:00:00",0.0],
[7,0,"2016-07-04 02:00:00",0.0]])
cal = USFederalHolidayCalendar()
holidays = cal.holidays(start='2014-01-01', end='2018-12-31')
I want to add a new boolean column with True/False if the date is holiday or not.
Tried df["hd"] = df[2].isin(holidays), but it doesn't work because of time digits.
Use Series.dt.floor or Series.dt.normalize for remove times:
df[2] = pd.to_datetime(df[2])
df["hd"] = df[2].dt.floor('d').isin(holidays)
#alternative
df["hd"] = df[2].dt.normalize().isin(holidays)
print (df)
0 1 2 3 hd
0 6 0 2016-01-02 01:00:00 0.0 False
1 7 0 2016-07-04 02:00:00 0.0 True

How to add time values manually to Pandas dataframe TimeStamp column?

Suppose I have a dataframe df looking like this
df
TimeStamp. Column1......Column n.
2017-01-01
2017-01-02
...
But I want it like this
TimeStamp. Column1......Column n.
2017-01-01 00:00:00
2017-01-02.00:00:00
...
How can I add this (00:00:00) to all TimeStamps in the dataframe? Thanks
Find the below code:
import pandas as pd
df=pd.DataFrame([{"Timestamp":"2017-01-01"},{"Timestamp":"2017-01-01"}],columns=['Timestamp'])
df_new=df['Timestamp'].apply(lambda k:k+" 00:00:00")
Output:
df_new['Timestamp']
0 2017-01-01 00:00:00
1 2017-01-01 00:00:00
Name: Timestamp, dtype: object
import pandas as pd
from datetime import datetime, timedelta
Name = ['a', 'b', 'c', 'd']
Age = [10, 20, 30, 40]
somedate = datetime.date(datetime.now())
DOB = [somedate] * 4
somelistdata = list(zip(Name, Age, DOB))
df = pd.DataFrame(somelistdata, columns = ['Name', 'Age', 'DOB'])
# problem statement
print(df)
# solution to your problem
df['DOB'] = pd.to_datetime(df['DOB']).dt.strftime('%Y-%m-%d %H:%M:%S')
print(df)
Problem statement
Name Age DOB
0 a 10 2019-09-19
1 b 20 2019-09-19
2 c 30 2019-09-19
3 d 40 2019-09-19
Solution
Name Age DOB
0 a 10 2019-09-19 00:00:00
1 b 20 2019-09-19 00:00:00
2 c 30 2019-09-19 00:00:00
3 d 40 2019-09-19 00:00:00

Categories