I have two dataframes:
print (df1)
ID Birthday
0 A000 1990-01-01
1 A001 1991-05-05
2 A002 1970-10-01
3 A003 1980-07-07
4 A004 1945-08-15
print (df2)
ID Date from
0 A000 2010.01
1 A001 2012.01
2 A002 2010.01
3 A002 2010.01
4 A002 2010.11
5 A003 2009.05
6 A003 2010.01
7 A004 2010.01
8 A005 2007.11
9 A006 2017.01
df1 consists of IDs and and the birthday and df2 contains ID and the dates. Some of the values in df2.ID are not in df1.ID (i.e. A005 and A006).
What I am trying:
I'd like to calculate the difference between df1.Birthday and df2.Date if df2.ID exists in df1.ID.
What I have done so far:
df1['Birthday'] = pd.to_datetime(df1['Birthday'])
df2['Date from'] = pd.to_datetime(df2['Date from'])
x1 = df1.set_index(['ID'])['Birthday']
x2 = df2.set_index(['ID'])['Date from']
x3 = x2.sub(x1,fill_value=0)
print(x3)
ID
A000 -7305 days +00:00:00.000002
A001 -7794 days +00:00:00.000002
A002 -273 days +00:00:00.000002
A002 -273 days +00:00:00.000002
A002 -273 days +00:00:00.000002
A003 -3840 days +00:00:00.000002
A003 -3840 days +00:00:00.000002
A004 8905 days 00:00:00.000002
A005 0 days 00:00:00.000002
A006 0 days 00:00:00.000002
dtype: timedelta64[ns]
There is an error as ID A003 have a same value but it consists of different dates. I am not sure how I could go proceed to the next step. Thank you in advance for any assistance you can provide.
First, I would left merge the dataframes to make sure things were lining up properly. Then subtract the two date columns in a new column:
import pandas
from io import StringIO
data1 = StringIO("""\
ID Birthday
A000 1990-01-01
A001 1991-05-05
A002 1970-10-01
A003 1980-07-07
A004 1945-08-15
""")
data2 = StringIO("""\
ID Date_from
A000 2010.01
A001 2012.01
A002 2010.01
A002 2010.01
A002 2010.11
A003 2009.05
A003 2010.01
A004 2010.01
A005 2007.11
A006 2017.01
""")
x1 = pandas.read_table(data1, sep='\s+', parse_dates=['Birthday'])
x2 = pandas.read_table(data2, sep='\s+', parse_dates=['Date_from'])
data = (
x2.merge(right=x1, left_on='ID', right_on='ID', how='left')
.assign(Date_diff=lambda df: df['Date_from'] - df['Birthday'])
)
print(data)
And that gives me:
ID Date_from Birthday Date_diff
0 A000 2010-01-01 1990-01-01 7305 days
1 A001 2012-01-01 1991-05-05 7546 days
2 A002 2010-01-01 1970-10-01 14337 days
3 A002 2010-01-01 1970-10-01 14337 days
4 A002 2010-11-01 1970-10-01 14641 days
5 A003 2009-05-01 1980-07-07 10525 days
6 A003 2010-01-01 1980-07-07 10770 days
7 A004 2010-01-01 1945-08-15 23515 days
8 A005 2007-11-01 NaT NaT
9 A006 2017-01-01 NaT NaT
use dateutil package to get the diference in years, month, days:
from dateutil import relativedelta as rdelta
from datetime import date
d1 = date(2010,5,1)
d2 = date(2012,1,1)
rd = rdelta.relativedelta(d2,d1)
'
Related
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have two tables
masterblok:
BLOCKID PLANTINGDATE PLANTED_HA
A001 01-JAN-08 13.86
A002 01-JAN-08 13.24
A002 31-MAR-18 1.99
A003 01-JAN-08 14.76
A003 31-MAR-18 2.48
pest_perperiod: (note that there is FIELDCODE other than A002)
FIELDCODE PERIOD
A002 2019-01-01
A002 2019-02-01
A002 2019-03-01
A002 2019-04-01
A002 2019-05-01
I want to join two dataframe so that for each data in pest_perperiod will have one or many corresponding PLANTINGDATE (like cross join in SQL) so I can calculate retention rate since active month for each BLOCKID and PLANTINGDATE
i tried using: (vice versa)
pest_perperiod.join(masterblok.set_index('BLOCKID'), on='FIELDCODE')
returned error because duplicated value still exist, how to do this?
I think you just want merge
pest_perperiod.merge(masterblock, left_on='BLOCKID', right_on='FIELDCODE')
output:
BLOCKID PLANTINGDATE PLANTED_HA FIELDCODE PERIOD
0 A002 01-JAN-08 13.24 A002 2019-01-01
1 A002 01-JAN-08 13.24 A002 2019-02-01
2 A002 01-JAN-08 13.24 A002 2019-03-01
3 A002 01-JAN-08 13.24 A002 2019-04-01
4 A002 01-JAN-08 13.24 A002 2019-05-01
5 A002 31-MAR-18 1.99 A002 2019-01-01
6 A002 31-MAR-18 1.99 A002 2019-02-01
7 A002 31-MAR-18 1.99 A002 2019-03-01
8 A002 31-MAR-18 1.99 A002 2019-04-01
9 A002 31-MAR-18 1.99 A002 2019-05-01
I have three column in a data frame
ID - A001
DoA - 15-03-2014 - Date of Admission
DoL - 17-08-2020 - Date of Leaving
Create three new column:
Cal_Yr - Calender Year
Str_Date - Start of Date
End_Date - End of Date
If the year of admission is less than 2015 than
Str_Date = 01-01-2015 else DoA
End_Date = 15-03-2015
I am dividing the year in two parts ... One part before anniversary date( start dd-mm of the year) and other part after anniversary date so that I can find weight of both parts ... but the date before 01-01-2015 should be revauled as 01-01-2015
I have to design a loop which create repetative 12 rows as shown in figure.
input table is:
ID
DoA
status
DoL
Duration(years)
fee amt
A23
02-Jan-16
DH
18-Aug-18
2
2345
B23
01-Mar-09
IS
31-Dec-20
11
1000
C23
16-Sep-12
SU
12-Jul-19
7
14565
D23
01-Jun-20
LA
07-Sep-20
0
123
E23
15-Sep-16
IS
31-Dec-20
4
6790
F23
01-Jan-19
IS
31-Dec-20
1
7272
This does what you want. This is not a hard job; like most similar tasks, you just have to take it step by step. "What do I know here", "what information do I need here"? Note that I have converted to datetime.date objects for the dates, assuming you will want to do some analyses based on the dates.
import pandas as pd
import datetime
data = [
[ "A001", "15-03-2014", "17-08-2020" ],
[ "A002", "01-06-2018", "01-06-2020" ]
]
rows = []
for id, stdate, endate in data:
s = stdate.split('-')
startdate = datetime.date(int(s[2]),int(s[1]),int(s[0]))
s = endate.split('-')
enddate = datetime.date(int(s[2]),int(s[1]),int(s[0]))
for year in range(startdate.year, enddate.year + 1 ):
start1 = datetime.date(year,1,1)
anniv = datetime.date(year,startdate.month,startdate.day)
end1 = datetime.date(year,12,31)
if year != startdate.year:
rows.append( [id, year, start1, anniv] )
if anniv == enddate:
break
if year != enddate.year:
rows.append( [id, year, anniv, end1] )
elif anniv < enddate:
rows.append( [id, year, anniv, enddate] )
df = pd.DataFrame( rows, columns=["ID", "Cal_Yr", "Str_date", "End_date"] )
print( df )
Output:
ID Cal_Yr Str_date End_date
0 A001 2014 2014-03-15 2014-12-31
1 A001 2015 2015-01-01 2015-03-15
2 A001 2015 2015-03-15 2015-12-31
3 A001 2016 2016-01-01 2016-03-15
4 A001 2016 2016-03-15 2016-12-31
5 A001 2017 2017-01-01 2017-03-15
6 A001 2017 2017-03-15 2017-12-31
7 A001 2018 2018-01-01 2018-03-15
8 A001 2018 2018-03-15 2018-12-31
9 A001 2019 2019-01-01 2019-03-15
10 A001 2019 2019-03-15 2019-12-31
11 A001 2020 2020-01-01 2020-03-15
12 A001 2020 2020-03-15 2020-08-17
13 A002 2018 2018-06-01 2018-12-31
14 A002 2019 2019-01-01 2019-06-01
15 A002 2019 2019-06-01 2019-12-31
16 A002 2020 2020-01-01 2020-06-01
How do I add future dates to a data frame? This datetime delta only adds deltas to adjacent columns.
import pandas as pd
from datetime import timedelta
df = pd.DataFrame({
'date': ['2001-02-01','2001-02-02','2001-02-03', '2001-02-04'],
'Monthly Value': [100, 200, 300, 400]
})
df["future_date"] = df["date"] + timedelta(days=4)
print(df)
date future_date
0 2001-02-01 00:00:00 2001-02-05 00:00:00
1 2001-02-02 00:00:00 2001-02-06 00:00:00
2 2001-02-03 00:00:00 2001-02-07 00:00:00
3 2001-02-04 00:00:00 2001-02-08 00:00:00
Desired dataframe:
date future_date
0 2001-02-01 00:00:00 2001-02-01 00:00:00
1 2001-02-02 00:00:00 2001-02-02 00:00:00
2 2001-02-03 00:00:00 2001-02-03 00:00:00
3 2001-02-04 00:00:00 2001-02-04 00:00:00
4 2001-02-05 00:00:00
5 2001-02-06 00:00:00
6 2001-02-07 00:00:00
7 2001-02-08 00:00:00
You can do the following:
# set to timestamp
df['date'] = pd.to_datetime(df['date'])
# create a future date df
ftr = (df['date'] + pd.Timedelta(4, unit='days')).to_frame()
ftr['Monthly Value'] = None
# join the future data
df1 = pd.concat([df, ftr], ignore_index=True)
date Monthly Value
0 2001-02-01 100
1 2001-02-02 200
2 2001-02-03 300
3 2001-02-04 400
4 2001-02-05 None
5 2001-02-06 None
6 2001-02-07 None
7 2001-02-08 None
I found that this also works:
df.append(pd.DataFrame({'date': pd.date_range(start=df.date.iloc[-1], periods= 4, freq='d', closed='right')}))
If I understand you correctly,
we can create a new dataframe using the min of your date, and max + 4 days.
we just concat this back using axis = 1.
df['date'] = pd.to_datetime(df['date'])
fdates = pd.DataFrame(
pd.date_range(df["date"].min(), df["date"].max() + pd.DateOffset(days=4))
,columns=['future_date'])
df_new = pd.concat([df,fdates],axis=1)
print(df_new[['date','future_date','Monthly Value']])
0 2001-02-01 2001-02-01 100.0
1 2001-02-02 2001-02-02 200.0
2 2001-02-03 2001-02-03 300.0
3 2001-02-04 2001-02-04 400.0
4 NaT 2001-02-05 NaN
5 NaT 2001-02-06 NaN
6 NaT 2001-02-07 NaN
7 NaT 2001-02-08 NaN
I need get 0 days 08:00:00 to 08:00:00.
code:
import pandas as pd
df = pd.DataFrame({
'Slot_no':[1,2,3,4,5,6,7],
'start_time':['0:01:00','8:01:00','10:01:00','12:01:00','14:01:00','18:01:00','20:01:00'],
'end_time':['8:00:00','10:00:00','12:00:00','14:00:00','18:00:00','20:00:00','0:00:00'],
'location_type':['not considered','Food','Parks & Outdoors','Food',
'Arts & Entertainment','Parks & Outdoors','Food']})
df = df.reindex_axis(['Slot_no','start_time','end_time','location_type','loc_set'], axis=1)
df['start_time'] = pd.to_timedelta(df['start_time'])
df['end_time'] = pd.to_timedelta(df['end_time'].replace('0:00:00', '24:00:00'))
output:
print (df)
Slot_no start_time end_time location_type loc_set
0 1 00:01:00 0 days 08:00:00 not considered NaN
1 2 08:01:00 0 days 10:00:00 Food NaN
2 3 10:01:00 0 days 12:00:00 Parks & Outdoors NaN
3 4 12:01:00 0 days 14:00:00 Food NaN
4 5 14:01:00 0 days 18:00:00 Arts & Entertainment NaN
5 6 18:01:00 0 days 20:00:00 Parks & Outdoors NaN
6 7 20:01:00 1 days 00:00:00 Food NaN
You can use to_datetime with dt.time:
df['end_time_times'] = pd.to_datetime(df['end_time']).dt.time
print (df)
Slot_no start_time end_time location_type loc_set \
0 1 00:01:00 0 days 08:00:00 not considered NaN
1 2 08:01:00 0 days 10:00:00 Food NaN
2 3 10:01:00 0 days 12:00:00 Parks & Outdoors NaN
3 4 12:01:00 0 days 14:00:00 Food NaN
4 5 14:01:00 0 days 18:00:00 Arts & Entertainment NaN
5 6 18:01:00 0 days 20:00:00 Parks & Outdoors NaN
6 7 20:01:00 1 days 00:00:00 Food NaN
end_time_times
0 08:00:00
1 10:00:00
2 12:00:00
3 14:00:00
4 18:00:00
5 20:00:00
6 00:00:00
I'm trying to use pandas to group subscribers by subscription type for a given day and get the average price of a subscription type on that day. The data I have resembles:
Sub_Date Sub_Type Price
2011-03-31 00:00:00 12 Month 331.00
2012-04-16 00:00:00 12 Month 334.70
2013-08-06 00:00:00 12 Month 344.34
2014-08-21 00:00:00 12 Month 362.53
2015-08-31 00:00:00 6 Month 289.47
2016-09-03 00:00:00 6 Month 245.57
2013-04-10 00:00:00 4 Month 148.79
2014-03-13 00:00:00 12 Month 348.46
2015-03-15 00:00:00 12 Month 316.86
2011-02-09 00:00:00 12 Month 333.25
2012-03-09 00:00:00 12 Month 333.88
...
2013-04-03 00:00:00 12 Month 318.34
2014-04-15 00:00:00 12 Month 350.73
2015-04-19 00:00:00 6 Month 291.63
2016-04-19 00:00:00 6 Month 247.35
2011-02-14 00:00:00 12 Month 333.25
2012-05-23 00:00:00 12 Month 317.77
2013-05-28 00:00:00 12 Month 328.16
2014-05-31 00:00:00 12 Month 360.02
2011-07-11 00:00:00 12 Month 335.00
...
I'm looking to get something that resembles:
Sub_Date Sub_type Quantity Price
2011-03-31 00:00:00 3 Month 2 125.00
4 Month 0 0.00 # Promo not available this month
6 Month 1 250.78
12 Month 2 334.70
2011-04-01 00:00:00 3 Month 2 125.00
4 Month 2 145.00
6 Month 0 250.78
12 Month 0 334.70
2013-04-02 00:00:00 3 Month 1 125.00
4 Month 3 145.00
6 Month 0 250.78
12 Month 1 334.70
...
2015-06-23 00:00:00 3 Month 4 135.12
4 Month 0 0.00 # Promo not available this month
6 Month 0 272.71
12 Month 3 354.12
...
I'm only able to get the total number of Sub_Types for a given date.
df.Sub_Date.groupby([df.Sub_Date.values.astype('datetime64[D]')]).size()
This is somewhat of a good start, but not exactly what is needed. I've had a look at the groupby documentation on the pandas site but I can't get the output I desire.
I think you need aggregate by mean and size and then add missing values by unstack with stack.
Also if need change order of level Sub_Type, use ordered categorical.
#generating all months ('1 Month','2 Month'...'12 Month')
cat = [str(x) + ' Month' for x in range(1,13)]
df.Sub_Type = df.Sub_Type.astype('category', categories=cat, ordered=True)
df1 = df.Price.groupby([df.Sub_Date.values.astype('datetime64[D]'), df.Sub_Type])
.agg(['mean', 'size'])
.rename(columns={'size':'Quantity','mean':'Price'})
.unstack(fill_value=0)
.stack()
print (df1)
Price Quantity
Sub_Type
2011-02-09 4 Month 0.00 0
6 Month 0.00 0
12 Month 333.25 1
2011-02-14 4 Month 0.00 0
6 Month 0.00 0
12 Month 333.25 1
2011-03-31 4 Month 0.00 0
6 Month 0.00 0
12 Month 331.00 1