How to create new column in my data frame by condition's - python

I'm trying to creating a new column in my data frame by the follow condition's :
If the value in Date_of_basket_entryis NAN then respond 0.
If the value in Date_of_basket_entryis greater(DATE STILL IN THE
FUTURE) then in month_year then respond 1.
If the value in Date_of_basket_entryis lower (DATE STILL IN THE
PAST) then in month_year then respond 0.
month_year Date_of_basket_entry
0 03/2017 01.04.2005
1 02/2019 01.01.1995
2 07/2017 None
4 02/2017 None
5 04/2017 01.01.2020
it should be something like this:
month_year Date_of_basket_entry Date_of_basket_boolean
0 03/2017 01.04.2005 0
1 02/2019 01.01.1995 0
2 07/2017 None 0
4 02/2017 None 0
5 04/2017 01.01.2020 1

#Danielhab I like np.where for this situation.
import numpy as np
# if dtype is wrong the condition won't work correctly
df = df.astype(np.datetime64)
df.loc[:, 'Date_of_basket_boolean'] = np.where((df.Date_of_basket_entry.isna()) | (df.Date_of_basket_entry < df.month_year), 0, 1)
I think this should work, just check your logic.

I think it may be difficult to compare month/year to month.day.year. I would start by converting columns to have the same structure. Then you can use numpy's np.where function.
import pandas as pd
import numpy as np
df = pd.DataFrame({'month_year':['03/2017','02/2019', '07/2017', '02/2017', '04/2017'],
'Date_of_basket_entry':['1.04.2005','01.01.1995', None, None, '01.01.2020']})
df['new1'] = pd.to_datetime(df['month_year'], infer_datetime_format=True)
df['new2'] = pd.to_datetime(df['Date_of_basket_entry'], infer_datetime_format=True)
print(df)
month_year Date_of_basket_entry new1 new2
0 03/2017 1.04.2005 2017-03-01 2005-01-04
1 02/2019 01.01.1995 2019-02-01 1995-01-01
2 07/2017 None 2017-07-01 NaT
3 02/2017 None 2017-02-01 NaT
4 04/2017 01.01.2020 2017-04-01 2020-01-01
df['Date_of_basket_boolean'] = np.where(df['new2']>df['new1'],1,0)
print(df)
month_year Date_of_basket_entry new1 new2 Date_of_basket_boolean
0 03/2017 1.04.2005 2017-03-01 2005-01-04 0
1 02/2019 01.01.1995 2019-02-01 1995-01-01 0
2 07/2017 None 2017-07-01 NaT 0
3 02/2017 None 2017-02-01 NaT 0
4 04/2017 01.01.2020 2017-04-01 2020-01-01 1

Related

Is there a way to optimize this date range transformation? Conditional merge in pandas?

I have Sales data like this as a DataFrame, the datatype of the columns is datetime[64] of pandas:
Shop ID
Special Offer Start
Special Offer End
A
'2022-01-01'
'2022-01-03'
B
'2022-01-09'
'2022-01-11'
etc.
I want to transform the data into a new binary format, that shows me the date in one column and the special offer information as 0 and 1.
The resulting table should look like this:
Shop ID
Date
Special Offer?
A
'2022-01-01'
1
A
'2022-01-02'
1
A
'2022-01-03'
1
B
'2022-01-09'
1
B
'2022-01-10'
1
B
'2022-01-11'
1
I wrote a function, which iterates every row and creates an DataFrame containing Pandas DateRange and the Special Offer information. These DataFrame are then concatenated. As you can imagine the code runs very slow.
I was thinking to append a Special Offer? Column to the Sales DataFrame and then joining it to a DataFrame containing all dates. Afterwards I could just fill the NaN with the dropna or fillna-function. But I couldn't find a function which lets me join on conditions in pandas.
See example below:
Shop ID
Special Offer Start
Special Offer End
Special Offer ?
A
'2022-01-01'
'2022-01-03'
1
B
'2022-01-09'
'2022-01-11'
1
join with (the join condition being: if Date between Special Offer Start and Special Offer End):
Date
'2022-01-01'
'2022-01-02'
'2022-01-03'
'2022-01-04'
'2022-01-05'
'2022-01-06'
'2022-01-07'
'2022-01-08'
'2022-01-09'
'2022-01-10'
'2022-01-11'
creates:
Shop ID
Date
Special Offer?
A
'2022-01-01'
1
A
'2022-01-02'
1
A
'2022-01-03'
1
A
'2022-01-04'
NaN
A
'2022-01-05'
NaN
A
'2022-01-06'
NaN
A
'2022-01-07'
NaN
A
'2022-01-08'
NaN
A
'2022-01-09'
NaN
A
'2022-01-10'
NaN
A
'2022-01-11'
NaN
B
'2022-01-01'
NaN
B
'2022-01-02'
NaN
B
'2022-01-03'
NaN
B
'2022-01-04'
NaN
B
'2022-01-05'
NaN
B
'2022-01-06'
NaN
B
'2022-01-07'
NaN
B
'2022-01-08'
NaN
B
'2022-01-09'
1
B
'2022-01-10'
1
B
'2022-01-11'
1
EDIT:
here is the code I've written:
new_list = []
for i, row in sales_df.iterrows():
df = pd.DataFrame(pd.date_range(start=row["Special Offer Start"],end=row["Special Offer End"]), columns=['Date'])
df['Shop ID'] = row['Shop ID']
df["Special Offer?"] = 1
new_list.append(df)
result = pd.concat(new_list ).reset_index(drop=True)
Update
The Shop ID column is missing
You can use date_range to expand the dates:
# Setup minimal reproducible example
data = [{'Shop ID': 'A', 'Special Offer Start': '2022-01-01', 'Special Offer End': '2022-01-03'},
{'Shop ID': 'B', 'Special Offer Start': '2022-01-09', 'Special Offer End': '2022-01-11'}]
df = pd.DataFrame(data)
# Not mandatory if you have already DatetimeIndex
df['Special Offer Start'] = pd.to_datetime(df['Special Offer Start'])
df['Special Offer End'] = pd.to_datetime(df['Special Offer End'])
# create full date range
start = df['Special Offer Start'].min()
end = df['Special Offer End'].max()
dti = pd.date_range(start, end, freq='D', name='Date')
date_range = lambda x: pd.date_range(x['Special Offer Start'], x['Special Offer End'])
out = (df.assign(Offer=df.apply(date_range, axis=1), dummy=1).explode('Offer')
.pivot_table(index='Offer', columns='Shop ID', values='dummy', fill_value=0)
.reindex(dti, fill_value=0).unstack().rename('Special Offer?').reset_index())
>>> out
Shop ID Date Special Offer?
0 A 2022-01-01 1
1 A 2022-01-02 1
2 A 2022-01-03 1
3 A 2022-01-04 0
4 A 2022-01-05 0
5 A 2022-01-06 0
6 A 2022-01-07 0
7 A 2022-01-08 0
8 A 2022-01-09 0
9 A 2022-01-10 0
10 A 2022-01-11 0
11 B 2022-01-01 0
12 B 2022-01-02 0
13 B 2022-01-03 0
14 B 2022-01-04 0
15 B 2022-01-05 0
16 B 2022-01-06 0
17 B 2022-01-07 0
18 B 2022-01-08 0
19 B 2022-01-09 1
20 B 2022-01-10 1
21 B 2022-01-11 1

Pandas df comparing two dates condition

I'd like to add 1 if date_ > buy_date larger than 12 months else 0
example df
customer_id date_ buy_date
34555 2019-01-01 2017-02-01
24252 2019-01-01 2018-02-10
96477 2019-01-01 2017-02-18
output df
customer_id date_ buy_date buy_date>_than_12_months
34555 2019-01-01 2017-02-01 1
24252 2019-01-01 2018-02-10 0
96477 2019-01-01 2018-02-18 1
Based on what I understand, you can try adding a year to buy_date and then subtract from date_ , then check if days are + or -.
df['buy_date>_than_12_months'] = ((df['date_'] -
(df['buy_date']+pd.offsets.DateOffset(years=1)))
.dt.days.gt(0).astype(int))
print(df)
customer_id date_ buy_date buy_date>_than_12_months
0 34555 2019-01-01 2017-02-01 1
1 24252 2019-01-01 2018-02-10 0
2 96477 2019-01-01 2017-02-18 1
import pandas as pd
import numpy as np
values = {'customer_id': [34555,24252,96477],
'date_': ['2019-01-01','2019-01-01','2019-01-01'],
'buy_date': ['2017-02-01','2018-02-10','2017-02-18'],
}
df = pd.DataFrame(values, columns = ['customer_id', 'date_', 'buy_date'])
df['date_'] = pd.to_datetime(df['date_'], format='%Y-%m-%d')
df['buy_date'] = pd.to_datetime(df['buy_date'], format='%Y-%m-%d')
print(df['date_'] - df['buy_date'])
df['buy_date>_than_12_months'] = pd.Series([1 if ((df['date_'] - df['buy_date'])[i]> np.timedelta64(1, 'Y')) else 0 for i in range(3)])
print (df)

Pandas, is a date holiday?

I have the following pandas dataframe. The dates are with time:
from pandas.tseries.holiday import USFederalHolidayCalendar
import pandas as pd<BR>
df = pd.DataFrame([[6,0,"2016-01-02 01:00:00",0.0],
[7,0,"2016-07-04 02:00:00",0.0]])
cal = USFederalHolidayCalendar()
holidays = cal.holidays(start='2014-01-01', end='2018-12-31')
I want to add a new boolean column with True/False if the date is holiday or not.
Tried df["hd"] = df[2].isin(holidays), but it doesn't work because of time digits.
Use Series.dt.floor or Series.dt.normalize for remove times:
df[2] = pd.to_datetime(df[2])
df["hd"] = df[2].dt.floor('d').isin(holidays)
#alternative
df["hd"] = df[2].dt.normalize().isin(holidays)
print (df)
0 1 2 3 hd
0 6 0 2016-01-02 01:00:00 0.0 False
1 7 0 2016-07-04 02:00:00 0.0 True

python compare date list to start and end date columns in dataframe

Problem: I have a dataframe with two columns: Start date and End date. I also have a list of dates. So lets say the data looks something like this:
data = [[1/1/2018,3/1/2018],[2/1/2018,3/1/2018],[4/1/2018,6/1/2018]]
df = pd.DataFrame(data,columns=['startdate','enddate'])
dates=[1/1/2018,2/1/2018]
What I need to do is:
1)Create a new column for each date in the dates list
2)for each row in the df, if the date for the new column created is in between the start and end date, assign a 1; if not, assign a 0.
I have tried to use zip but then I realized that the df rows will be thousands of rows, where as the dates list will contain about 24 items (spanning 2 years), so it stops when the dates list is exhausted, i.e., at 24.
So below is what the original df looks like and how it should look like afterwards:
Before:
startdate enddate
0 2018-01-01 2018-03-01
1 2018-02-01 2018-03-01
2 2018-04-01 2018-06-01
After:
startdate enddate 1/1/2018 2/1/2018
0 1/1/2018 3/1/2018 1 1
1 2/1/2018 3/1/2018 0 1
2 4/1/2018 6/1/2018 0 0
Any help on this would be much appreciated, thanks!
Using numpy broadcast
s1=df.startdate.values
s2=df.enddate.values
v=pd.to_datetime(pd.Series(dates)).values[:,None]
newdf=pd.DataFrame(((s1<=v)&(s2>=v)).T.astype(int),columns=dates,index=df.index)
pd.concat([df,newdf],axis=1)
startdate enddate 1/1/2018 2/1/2018
0 2018-01-01 2018-03-01 1 1
1 2018-02-01 2018-03-01 0 1
2 2018-04-01 2018-06-01 0 0

finding time slots present between start time and end time in python

We have csv file containing predefined time slots.
According to start time and end time provided by the user we want time slots present between the start time and end time.
eg
start time =11:00:00
end time=19:00:00
output- slot_no 2,3,4,5
I think you need boolean indexing with loc and between for selecting column Slot_no, all columns and values are converted to_timedelta, also midnight is replaced to 24:00:00:
df = pd.DataFrame(
{'Slot_no':[1,2,3,4,5,6,7],
'start_time':['0:01:00','8:01:00','10:01:01','12:01:00','14:01:00','18:01:01','20:01:00'],
'end_time':['8:00:00','10:00:00','12:00:00','14:00:00','18:00:00','20:00:00','0:00:00']})
df = df.reindex_axis(['Slot_no','start_time','end_time'], axis=1)
df['start_time'] = pd.to_timedelta(df['start_time'])
df['end_time'] = pd.to_timedelta(df['end_time'].replace('0:00:00', '24:00:00'))
print (df)
Slot_no start_time end_time
0 1 00:01:00 0 days 08:00:00
1 2 08:01:00 0 days 10:00:00
2 3 10:01:01 0 days 12:00:00
3 4 12:01:00 0 days 14:00:00
4 5 14:01:00 0 days 18:00:00
5 6 18:01:01 0 days 20:00:00
6 7 20:01:00 1 days 00:00:00
start = pd.to_timedelta('11:00:00')
end = pd.to_timedelta('19:00:00')
mask = df['start_time'].between(start, end) | df['end_time'].between(start, end)
s = df.loc[mask, 'Slot_no']
print (s)
2 3
3 4
4 5
5 6
Name: Slot_no, dtype: int64
L = df.loc[mask, 'Slot_no'].tolist()
print (L)
[3, 4, 5, 6]

Categories