I have some data that will looks like this:
Dates Delta
0 2022-10-01 10
1 2022-10-01 21
2 2022-10-01 34
I am trying to add a new column, where I can subtract the number in the Delta column from the date in the Dates column. Ideally, the output will look like this (i did this by hand so if the dates are wrong, please excuse me).
Dates Delta CalculatedDate
0 2022-10-01 10 2022-09-21
1 2022-10-01 21 2022-09-10
2 2022-10-01 34 2022-08-23
I've tried various versions of this and I'm not having any luck.
# importing libraries to create and manipulate toy data
import pandas as pd
from datetime import datetime, timedelta
# create toy data
df = pd.DataFrame({'Dates': ['2022-10-01', '2022-10-01', '2022-10-01'],
'Delta': [10, 21, 34]})
# cast the `Dates` column as dates
df['Dates'] = pd.to_datetime(df['Dates'])
##### Need help here
# Create a new column, showing the calculated date
df['CalculatedDate'] = df['Dates'] - timedelta(days=df['Delta'])
df['CalculatedDate'] = pd.to_datetime(df['Dates']) - pd.TimedeltaIndex(df['Delta'], unit='D')
df
Dates Delta CalculatedDate
0 2022-10-01 10 2022-09-21
1 2022-10-01 21 2022-09-10
2 2022-10-01 34 2022-08-28
Here is one way to do it
# for each row subtract the delta from the date in the row
# using Day offset
df['calculatedDate']= df.apply(lambda x: x['Dates'] - pd.offsets.Day(x['Delta']), axis=1)
df
Dates Delta calculatedDate
0 2022-10-01 10 2022-09-21
1 2022-10-01 21 2022-09-10
2 2022-10-01 34 2022-08-28
I see Naveed and Panda has a fix that works great, suggesting the one I came up with as well:
for x in range(len(df)):
df.loc[x,'CalculatedDate'] = df.loc[x, 'Dates'] - timedelta(days=int(df.loc[x, 'Delta']))
print(df)
Put it in a for loop so that you can index through each row and do each row individually. Also make df['CalculatedDate'] into df.loc[x,'CalculatedDate']. This way you will do each row individually. Hope this helps
Related
I have a table like below containing values for multiple IDs:
ID
value
date
1
20
2022-01-01 12:20
2
25
2022-01-04 18:20
1
10
2022-01-04 11:20
1
150
2022-01-06 16:20
2
200
2022-01-08 13:20
3
40
2022-01-04 21:20
1
75
2022-01-09 08:20
I would like to calculate week wise sum of values for all IDs:
The start date is given (for example, 01-01-2022).
Weeks are calculated based on range:
every Saturday 00:00 to next Friday 23:59 (i.e. Week 1 is from 01-01-2022 00:00 to 07-01-2022 23:59)
ID
Week 1 sum
Week 2 sum
Week 3 sum
...
1
180
75
--
--
2
25
200
--
--
3
40
--
--
--
There's a pandas function (pd.Grouper) that allows you to specify a groupby instruction.1 In this case, that specification is to "resample" date by a weekly frequency that starts on Fridays.2 Since you also need to group by ID as well, add it to the grouper.
# convert to datetime
df['date'] = pd.to_datetime(df['date'])
# pivot the dataframe
df1 = (
df.groupby(['ID', pd.Grouper(key='date', freq='W-FRI')])['value'].sum()
.unstack(fill_value=0)
)
# rename columns
df1.columns = [f"Week {c} sum" for c in range(1, df1.shape[1]+1)]
df1 = df1.reset_index()
1 What you actually need is a pivot_table result but groupby + unstack is equivalent to pivot_table and groupby + unstack is more convenient here.
2 Because Jan 1, 2022 is a Saturday, you need to specify the anchor on Friday.
You can compute a week column. In case you've data for same year, you can extract just week number, which is less likely in real-time scenarios. In case you've data from multiple years, it might be wise to derive a combination of Year & week number.
df['Year-Week'] = df['Date'].dt.strftime('%Y-%U')
In your case the dates 2022-01-01 & 2022-01-04 18:2 should be convert to 2022-01 as per the scenario you considered.
To calculate your pivot table, you can use the pandas pivot_table. Example code:
pd.pivot_table(df, values='value', index=['ID'], columns=['year_weeknumber'], aggfunc=np.sum)
Let's define a formatting helper.
def fmt(row):
return f"{row.year}-{row.week:02d}" # We ignore row.day
Now it's easy.
>>> df = pd.DataFrame([dict(id=1, value=20, date="2022-01-01 12:20"),
dict(id=2, value=25, date="2022-01-04 18:20")])
>>> df['date'] = pd.to_datetime(df.date)
>>> df['iso'] = df.date.dt.isocalendar().apply(fmt, axis='columns')
>>> df
id value date iso
0 1 20 2022-01-01 12:20:00 2021-52
1 2 25 2022-01-04 18:20:00 2022-01
Just groupby
the ISO week.
I have a dataset with a date-time column with a specific format. I need to create new features out of this column that means I need to add new columns to the dataframe by extracting information from the above-mentioned date-time column. My sample input dataframe column is like below.
id datetime feature2
1 12/3/2020 0:56 1
2 11/25/2020 13:26 0
The expected output is:
id date hour mints feature2
1 12/3/2020 0 56 1
2 11/25/2020 13 26 0
Pandas apply() method may not work for this as new columns are added. What is the best way to do this?
Is there any way which I can apply a single function on each record of the column to do this by applying on the whole column?
pandas series .dt accessor
Your datetime data is coming from a pandas column (series), so use the .dt accessor
import pandas as pd
df = pd.DataFrame({'id': [1, 2],
'datetime': ['12/3/2020 0:56', '11/25/2020 13:26'],
'feature2': [1, 0]})
df['datetime'] = pd.to_datetime(df['datetime'])
id datetime feature2
1 2020-12-03 00:56:00 1
2 2020-11-25 13:26:00 0
# create columns
df['hour'] = df['datetime'].dt.hour
df['min'] = df['datetime'].dt.minute
df['date'] = df['datetime'].dt.date
# final
id datetime feature2 hour min date
1 2020-12-03 00:56:00 1 0 56 2020-12-03
2 2020-11-25 13:26:00 0 13 26 2020-11-25
IICU
df.date=pd.to_datetime(df.date)
df.set_index(df.date, inplace=True)
df['hour']=df.index.hour
df['mints']=df.index.minute
Edit: You can use the alleged duplicate solution with reindex() if your dates don't include times, otherwise you need a solution like the one by #kosnik. In addition, their solution doesn't need your dates to be the index!
I have data formatted like this
df = pd.DataFrame(data=[['2017-02-12 20:25:00', 'Sam', '8'],
['2017-02-15 16:33:00', 'Scott', '10'],
['2017-02-15 16:45:00', 'Steve', '5']],
columns=['Datetime', 'Sender', 'Count'])
df['Datetime'] = pd.to_datetime(df['Datetime'], format='%Y-%m-%d %H:%M:%S')
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-15 16:33:00 Scott 10
2 2017-02-15 16:45:00 Steve 5
I need there to be at least one row for every date, so the expected result would be
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-13 00:00:00 None 0
2 2017-02-14 00:00:00 None 0
3 2017-02-15 16:33:00 Scott 10
4 2017-02-15 16:45:00 Steve 5
I have tried to make datetime the index, add the dates and use reindex() like so
df.index = df['Datetime']
values = df['Datetime'].tolist()
for i in range(len(values)-1):
if values[i].date() + timedelta < values[i+1].date():
values.insert(i+1, pd.Timestamp(values[i].date() + timedelta))
print(df.reindex(values, fill_value=0))
This makes every row forget about the other columns and the same thing happens for asfreq('D') or resample()
ID Sender Count
Datetime
2017-02-12 16:25:00 0 Sam 8
2017-02-13 00:00:00 0 0 0
2017-02-14 00:00:00 0 0 0
2017-02-15 20:25:00 0 0 0
2017-02-15 20:25:00 0 0 0
What would be the appropriate way of going about this?
I would create a new DataFrame column which contains all the required data and then left join with your data frame.
A working code example is the following
df['Datetime'] = pd.to_datetime(df['Datetime']) # first convert to datetimes
datetimes = df['Datetime'].tolist() # these are existing datetimes - will add here the missing
dates = [x.date() for x in datetimes] # these are converted to dates
min_date = min(dates)
max_date = max(dates)
for d in range((max_date - min_date).days):
forward_date = min_date + datetime.timedelta(d)
if forward_date not in dates:
datetimes.append(np.datetime64(forward_date))
# create new dataframe, merge and fill 'Count' column with zeroes
df = pd.DataFrame({'Datetime': datetimes}).merge(df, on='Datetime', how='left')
df['Count'].fillna(0, inplace=True)
I have tried to calculate the number of business days between two date (stored in separate columns in a dataframe ).
MonthBegin MonthEnd
0 2014-06-09 2014-06-30
1 2014-07-01 2014-07-31
2 2014-08-01 2014-08-31
3 2014-09-01 2014-09-30
4 2014-10-01 2014-10-31
I have tried to apply numpy.busday_count but I get the following error:
Iterator operand 0 dtype could not be cast from dtype('<M8[ns]') to dtype('<M8[D]') according to the rule 'safe'
I have tried to change the type into Timestamp as the following :
Timestamp('2014-08-31 00:00:00')
or datetime :
datetime.date(2014, 8, 31)
or to numpy.datetime64:
numpy.datetime64('2014-06-30T00:00:00.000000000')
Anyone knows how to fix it?
Note 1: I have passed tried np.busday_count in two way :
1. Passing dataframe columns, t['Days']=np.busday_count(t.MonthBegin,t.MonthEnd)
Passing arrays np.busday_count(dt1,dt2)
Note2: My dataframe has over 150K rows so I need to use an efficient algorithm
You can using bdate_range, also I corrected your input , since the most of MonthEnd is early than the MonthBegin
[len(pd.bdate_range(x,y))for x,y in zip(df['MonthBegin'],df['MonthEnd'])]
Out[519]: [16, 21, 22, 23, 20]
I think the best way to do is
df.apply(lambda row : np.busday_count(row['MBegin'],row['MEnd']),axis=1)
For my dataframe df as below:
MBegin MEnd
0 2011-01-01 2011-02-01
1 2011-01-10 2011-02-10
2 2011-01-02 2011-02-02
doing :
df['MBegin'] = df['MBegin'].values.astype('datetime64[D]')
df['MEnd'] = df['MEnd'].values.astype('datetime64[D]')
df['busday'] = df.apply(lambda row : np.busday_count(row['MBegin'],row['MEnd']),axis=1)
>>df
MBegin MEnd busday
0 2011-01-01 2011-02-01 21
1 2011-01-10 2011-02-10 23
2 2011-01-02 2011-02-02 22
You need to provide the template in which your dates are written.
a = datetime.strptime('2014-06-9', '%Y-%m-%d')
Calculate this for your
b = datetime.strptime('2014-06-30', '%Y-%m-%d')
Now their difference
c = b-a
c.days
which gives you the difference 21 days, You can now use list comprehension to get the difference between two dates as days.
will give you datetime.timedelta(21), to convert it into days, just use
You can modify your code to get the desired result as below:
df = pd.DataFrame({'MonthBegin': ['2014-06-09', '2014-08-01', '2014-09-01', '2014-10-01', '2014-11-01'],
'MonthEnd': ['2014-06-30', '2014-08-31', '2014-09-30', '2014-10-31', '2014-11-30']})
df['MonthBegin'] = df['MonthBegin'].astype('datetime64[ns]')
df['MonthEnd'] = df['MonthEnd'].astype('datetime64[ns]')
df['BDays'] = np.busday_count(df['MonthBegin'].tolist(), df['MonthEnd'].tolist())
print(df)
MonthBegin MonthEnd BDays
0 2014-06-09 2014-06-30 15
1 2014-08-01 2014-08-31 21
2 2014-09-01 2014-09-30 21
3 2014-10-01 2014-10-31 22
4 2014-11-01 2014-11-30 20
Additionally numpy.busday_count has few other optional arguments like weekmask, holidays ... which you can use according to your need.
My dataset has dates in the European format, and I'm struggling to convert it into the correct format before I pass it through a pd.to_datetime, so for all day < 12, my month and day switch.
Is there an easy solution to this?
import pandas as pd
import datetime as dt
df = pd.read_csv(loc,dayfirst=True)
df['Date']=pd.to_datetime(df['Date'])
Is there a way to force datetime to acknowledge that the input is formatted at dd/mm/yy?
Thanks for the help!
Edit, a sample from my dates:
renewal["Date"].head()
Out[235]:
0 31/03/2018
2 30/04/2018
3 28/02/2018
4 30/04/2018
5 31/03/2018
Name: Earliest renewal date, dtype: object
After running the following:
renewal['Date']=pd.to_datetime(renewal['Date'],dayfirst=True)
I get:
Out[241]:
0 2018-03-31 #Correct
2 2018-04-01 #<-- this number is wrong and should be 01-04 instad
3 2018-02-28 #Correct
Add format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
You can control the date construction directly if you define separate columns for 'year', 'month' and 'day', like this:
import pandas as pd
df = pd.DataFrame(
{'Date': ['01/03/2018', '06/08/2018', '31/03/2018', '30/04/2018']}
)
date_parts = df['Date'].apply(lambda d: pd.Series(int(n) for n in d.split('/')))
date_parts.columns = ['day', 'month', 'year']
df['Date'] = pd.to_datetime(date_parts)
date_parts
# day month year
# 0 1 3 2018
# 1 6 8 2018
# 2 31 3 2018
# 3 30 4 2018
df
# Date
# 0 2018-03-01
# 1 2018-08-06
# 2 2018-03-31
# 3 2018-04-30