Join and sum on subset of rows in a dataframe - python

I have a pandas dataframe which stores date ranges and some associated colums:
date_start date_end ... lots of other columns ...
1 2016-07-01 2016-07-02
2 2016-07-01 2016-07-03
3 2016-07-01 2016-07-04
4 2016-07-02 2016-07-07
5 2016-07-05 2016-07-06
and another dataframe of Pikachu sightings indexed by date:
pikachu_sightings
date
2016-07-01 2
2016-07-02 4
2016-07-03 6
2016-07-04 8
2016-07-05 10
2016-07-06 12
2016-07-07 14
For each row in the first df I'd like to calculate the sum of pikachu_sightings within that date range (i.e., date_start to date_end) and store that in a new column. So would end up with a df like this (numbers left in for clarity):
date_start date_end total_pikachu_sightings
1 2016-07-01 2016-07-02 2 + 4
2 2016-07-01 2016-07-03 2 + 4 + 6
3 2016-07-01 2016-07-04 2 + 4 + 6 + 8
4 2016-07-02 2016-07-07 4 + 6 + 8 + 10 + 12 + 14
5 2016-07-05 2016-07-06 10 + 12
If I was doing this iteratively I'd iterate over each row in the table of date ranges, select the subset of rows in the table of sightings that match the date range and perform a sum on it - but this is way too slow for my dataset:
for range in ranges.itertuples():
sightings_in_range = sightings[(sightings.index >= range.date_start) & (sightings.index <= range.date_end)]
sum_sightings_in_range = sightings_in_range["pikachu_sightings"].sum()
ranges.set_value(range.Index, 'total_pikachu_sightings', sum_sightings_in_range)
This is my attempt at using pandas, but fails because the length of the two dataframes does not match (and even if they did, there's probably some other flaw in my approach):
range["total_pikachu_sightings"] =
sightings[(sightings.index >= range.date_start) & (sightings.index <= range.date_end)
["pikachu_sightings"].sum()
I'm trying to understand what the general approach/design should look like as I'd like to aggregate with other functions too, sum just seems like the easiest for an example. Sorry if this is an obvious question - I'm new to pandas!

A sketch of a vectorized solution:
Start with a p as in piRSquared's answer.
Make sure date_ cols have datetime64 dtypes, i.e.:
df['date_start'] = pd.to_datetime(df.date_time)
Then calculate cumulative sums:
psums = p.cumsum()
and
result = psums.asof(df.date_end) - psums.asof(df.date_start)
It's not yet the end, though. asof returns the last good value, so it sometimes will take the exact start date and sometimes not (depending on your data). So, you have to adjust for that. (If the date frequency is day, then probably moving the index of p an hour backwards, e.g. -pd.Timedelta(1, 'h'), and then adding p.asof(df.start_date) might do the trick.)

First make sure that pikachu_sightings has a datetime index and is sorted.
p = pikachu_sightings.squeeze() # force into a series
p.index = pd.to_datetime(p.index)
p = p.sort_index()
Then make sure your date_start and date_end are datetime.
df.date_start = pd.to_datetime(df.date_start)
df.date_end = pd.to_datetime(df.date_end)
Then its simply
df.apply(lambda x: p[x.date_start:x.date_end].sum(), axis=1)
0 6
1 12
2 20
3 54
4 22
dtype: int64

Related

Query for one dataframe row based on row in another dataframe & compare values

So I have two data frames. The first data frame contains numerical data that is used to "score" the second data frame which contains simulation data.
df1 = base records
df2 = simulation records
Part 1: What I am trying to accomplish is to query df1 'base records' to find the row that has the most recent timestamp to that in the df2 'simulation records' where the "Name" & "Time" columns match exactly.
Part 2: Then I want to use an if then function to determine whether a value in the simulation record row fall between a range created using two values from the base record row and return a boolean.
low range = df1['Po']-df1['Ref']
high range = df1['Po']+df1['Ref']
if df2['Sim'] falls in between the low range & high range of its most recent df1 base record then I want to return true in the new column "Sim Score"
otherwise return false
Part 3: I want to repeat Part 1 & Part 2 for each row in the simulation records.
helpful information:
df1 (base records) have more or less rows than df2 (simulation records)
df1 has more columns than df2
some columns in df1 have the same name but different values in df2
ideally want to be able to slice both dataframes where the if then function only sees the two rows used in the comparison
only need the most recent df1 base record to compare to the df2 simulation record
previously accomplished this in google sheets with if then & query combination formula dragged down entire sheet (want to replace with python & pandas)
df1 base records example (columns that matter)
Timestamp Name Time Po Ref
7/11/2022 11:30:00 trial 20 mins 5 2
7/10/2022 04:00:00 trial 20 mins 4 4
7/09/2022 02:45:00 trial 20 mins 2 2
6/28/2022 03:45:00 trial 20 mins 3 6
df2 simulation records example (columns that matter)
Timestamp Name Time Sim
7/10/2022 05:15:00 trial 20 mins 7
7/11/2022 12:45:00 trial 20 mins 4
7/12/2022 03:30:00 trial 20 mins 8
desired result of new column added to df2
Timestamp Name Time Sim Sim Score
7/10/2022 05:15:00 trial 20 mins 7 True
7/11/2022 12:45:00 trial 20 mins 4 True
7/12/2022 03:30:00 trial 20 mins 8 False
Use pandas.DataFrame.reindex, its method offer nearest to find the computable index(e.g., string can not calculate distance)
Or use merge_asof, its direction offer nearest.
Method 1:
reindex() with method='nearest'
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
print(df1)
###
Name Time Po Ref l_r h_r
Timestamp
2022-07-11 11:30:00 trial 20 mins 5 2 3 7
2022-07-10 04:00:00 trial 20 mins 4 4 0 8
2022-07-09 02:45:00 trial 20 mins 2 2 0 4
2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
df2.set_index('Timestamp', inplace=True)
print(df2)
###
Name Time Sim
Timestamp
2022-07-10 05:15:00 trial 20 mins 7
2022-07-11 12:45:00 trial 20 mins 4
2022-07-12 03:30:00 trial 20 mins 8
temp = df2.join(df1.reindex(df2.index, method='nearest'), lsuffix='_left', rsuffix='_right')
print(temp)
As you can see, this is df2.join(df1),
join multiple DataFrame objects by index at once.
with method='nearest', in this case, it would join df2 and df1 by the nearest Timestamp index.
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
df2.reset_index(inplace=True)
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False
Method 2:
merge_asof() with direction='nearest'
this way is not executed with indexed value, therefore we don't have to set column Timestamp as index. But it needs binding objects(in this case we merge on column Timestamp)sorted.
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
# df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
df1.sort_values(by='Timestamp', inplace=True)
print(df1)
###
Timestamp Name Time Po Ref l_r h_r
3 2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
2 2022-07-09 02:45:00 trial 20 mins 2 2 0 4
1 2022-07-10 04:00:00 trial 20 mins 4 4 0 8
0 2022-07-11 11:30:00 trial 20 mins 5 2 3 7
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
# df2.set_index('Timestamp', inplace=True)
df2.sort_values(by='Timestamp', inplace=True)
print(df2)
###
Timestamp Name Time Sim
0 2022-07-10 05:15:00 trial 20 mins 7
1 2022-07-11 12:45:00 trial 20 mins 4
2 2022-07-12 03:30:00 trial 20 mins 8
temp = pd.merge_asof(df2 ,df1[['Timestamp', 'l_r', 'h_r']], on='Timestamp', direction='nearest')
print(temp)
As you can see, this is pd.merge_asof(df2, df1),
This is similar to a left-join except that we match on nearest key rather than equal keys. Both DataFrames must be sorted by the key.
For each row in the left DataFrame:
A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False
Frankly speaking, working on indexed things would be faster if you have a large dataset.
Method 2 (on multiple keys)
I remodified df1 adding different Name and Time
df1 = pd.DataFrame({'Timestamp':['7/11/2022 11:30:00','7/11/2022 11:30:00','7/10/2022 04:00:00','7/10/2022 04:00:00','7/09/2022 02:45:00','6/28/2022 03:45:00'],
'Name':['trial','trial','trial','non-trial','trial','trial'],
'Time':['20 mins','30 mins','20 mins','20 mins','20 mins','20 mins'],
'Po':[5, 6, 4, 1, 2, 3],
'Ref':[2, 2, 4, 3, 2, 6]})
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
df1.sort_values(by='Timestamp', inplace=True)
print(df1)
###
Timestamp Name Time Po Ref l_r h_r
5 2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
4 2022-07-09 02:45:00 trial 20 mins 2 2 0 4
2 2022-07-10 04:00:00 trial 20 mins 4 4 0 8
3 2022-07-10 04:00:00 non-trial 20 mins 1 3 -2 4
0 2022-07-11 11:30:00 trial 20 mins 5 2 3 7
1 2022-07-11 11:30:00 trial 30 mins 6 2 4 8
print(df2)
###
Timestamp Name Time Sim
0 2022-07-10 05:15:00 trial 20 mins 7
1 2022-07-11 12:45:00 trial 20 mins 4
2 2022-07-12 03:30:00 trial 20 mins 8
Important:
can only merge_asof on a single key, therefore others would utilize by= to deal with.
temp = pd.merge_asof(df2, df1[['Timestamp', 'Name', 'Time', 'l_r', 'h_r']], on='Timestamp', by=['Name','Time'], direction='nearest')
print(temp)
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False
Reference:
pandas.DataFrame.join
pandas.merge_asof
merging/join concept
Because you don't provide code to construct the dataframe, I will sketch a potential solution:
First, I will assume that you have only one timestamp per day (which it looks like in your examples). Accordingly, I would truncate or split the timestamp to only have the date in one column. This is done so we can join the dataframes based on the date, i.e. use set_index("date_column") for both dataframes (use an inner-join to only keep the rows where the date was present in both dataframes). Finally, you can use apply() to check your condition:
df_joined['Sim Score'] = df_joined.apply(lambda row: (row['Po']-row['Ref'] <= row['Sim']) and (row['Po']+row['Ref'] >= row['Sim']), axis = 1)
You can do it via pandasql:
But note that you better add a unique constraint to one of the columns (e.g. a number of trial)
from pandasql import sqldf
df3 = sqldf('''
SELECT df2.Timestamp AS Date, df1.Name, df1.Time, df2.Sim,
CASE
WHEN Sim >= (df1.Po - df1.Ref) AND Sim <= (df1.Po + df1.Ref) THEN 'True'
WHEN Sim < (df1.Po - df1.Ref) OR Sim > (df1.Po + df1.Ref) THEN 'False'
END AS 'Sim Score'
FROM df1, df2
WHERE df2.Name == df1.Name AND df2.Time == df1.Time
ORDER BY Date ASC
''')
Also to work with datetime format in sqldf you need to name your Timestamp column as date in the query. If you need to get only let's say first/earliest 5 results add LIMIT 5 in the end of the query.
If you need to get closest date in df2 to df1 try this:
from pandasql import sqldf
df3 = sqldf('''
SELECT df2.Timestamp AS Date1, df2.Timestamp AS Date2,
df1.Name, df1.Time, df2.Sim,
CASE
WHEN Sim >= (df1.Po - df1.Ref) AND Sim <= (df1.Po + df1.Ref) THEN 'True'
WHEN Sim < (df1.Po - df1.Ref) OR Sim > (df1.Po + df1.Ref) THEN 'False'
END AS 'Sim Score'
FROM df1, df2
WHERE df2.Name == df1.Name AND df2.Time == df1.Time
and Date1 <= Date2
group by Date2
ORDER BY Date1 ASC
''')

Filling NaN values from another dataframe based on a condition

I need to populate NaN values for some columns in one dataframe based on a condition between two data frames.
DF1 has SOL (start of line) and EOL (end of line) columns and DF2 has UTC_TIME for each entry.
For every point in DF2 where the UTC_TIME is >= the SOL and is <= the EOL of each record in the DF1, that row in DF2 must be assigned the LINE, DEVICE and TAPE_FILE.
So, every one of the points will be assigned a LINE, DEVICE and TAPE_FILE based on the SOL/EOL time the UTC_TIME is between in DF1.
I'm trying to use the numpy where function for each column like this
df2['DEVICE'] = np.where(df2['UTC_TIME'] >= df1['SOL'] and <= df1['EOL'])
Or using a for loop to iterate through each row
for point in points:
if df1['SOL'] >= df2['UTC_TIME'] and df1['EOL'] <= df2['UTC_TIME']
return df1['DEVICE']
Try with merge_asof:
#convert to datetime if needed
df1["SOL"] = pd.to_datetime(df1["SOL"])
df1["EOL"] = pd.to_datetime(df1["EOL"])
df2["UTC_TIME"] = pd.to_datetime(df2["UTC_TIME"])
output = pd.merge_asof(df2[["ID", "UTC_TIME"]],df1,left_on="UTC_TIME",right_on="SOL").drop(["SOL","EOL"],axis=1)
>>> output
ID UTC_TIME LINE DEVICE TAPE_FILE
0 1 2022-04-25 06:50:00 1 Huntec 10
1 2 2022-04-25 07:15:00 2 Teledyne 11
2 3 2022-04-25 10:20:00 3 Huntec 12
3 4 2022-04-25 10:30:00 3 Huntec 12
4 5 2022-04-25 10:50:00 3 Huntec 12

How to remove specific day timestamps from a big dataframe

I have a big dataframe consisting of 600 days worth of data. Each day has 100 timestamps. I have a separate list of 30 days from which I want to data. How do I remove data from these 30 days from the dataframe?
I tried a for loop, but it did not work. I know there is a simple method. But I don't know how to implement it.
df #is main dataframe which has many columns and rows. Index is a timestamp.
df['dates'] = df.index.strftime('%Y-%m-%d') # date part of timestamp is sliced and
#a new column is created. Instead of index, I want to use this column for comparing with bad list.
bad_list # it is a list of bad dates
for i in range(0,len(df)):
for j in range(0,len(bad_list)):
if str(df['dates'][i])== bad_list[j]:
df.drop(df[i].index,inplace=True)
You can do the following
df['dates'] = df.index.strftime('%Y-%m-%d')
#badlist should be in date format too.
newdf = df[~df['dates'].isin(badlist)]
# the ~ is used to denote "not in" the list.
#if Jan 1, 2000 is a bad date, it should be in the list as datetime(2000,1,1)
You can perform simple comparison:
>>> dates = pd.Series(pd.to_datetime(np.random.randint(int(time()) - 60 * 60 * 24 * 5, int(time()), 12), unit='s'))
>>> dates
0 2019-03-19 05:25:32
1 2019-03-20 00:58:29
2 2019-03-19 01:03:36
3 2019-03-22 11:45:24
4 2019-03-19 08:14:29
5 2019-03-21 10:17:13
6 2019-03-18 09:09:15
7 2019-03-20 00:14:16
8 2019-03-21 19:47:02
9 2019-03-23 06:19:35
10 2019-03-23 05:42:34
11 2019-03-21 11:37:46
>>> start_date = pd.to_datetime('2019-03-20')
>>> end_date = pd.to_datetime('2019-03-22')
>>> dates[(dates > start_date) & (dates < end_date)]
1 2019-03-20 00:58:29
5 2019-03-21 10:17:13
7 2019-03-20 00:14:16
8 2019-03-21 19:47:02
11 2019-03-21 11:37:46
If your source Series is not in datetime format, then you will need to use pd.to_datetime to convert it.

Pandas : SQL SelfJoin With Date Criteria

One query I often do in SQL within a relational database is to join a table back to itself and summarize each row based on records for the same id either backwards or forward in time.
For example, assume table1 as columns 'ID','Date', 'Var1'
In SQL I could sum var1 for the past 3 months for each record like this:
Select a.ID, a.Date, sum(b.Var1) as sum_var1
from table1 a
left outer join table1 b
on a.ID = b.ID
and months_between(a.date,b.date) <0
and months_between(a.date,b.date) > -3
Is there any way to do this in Pandas?
It seems you need GroupBy + rolling. Implementing the logic in precisely the same way it is written in SQL is likely to be expensive as it will involve repeated loops. Let's take an example dataframe:
Date ID Var1
0 2015-01-01 1 0
1 2015-02-01 1 1
2 2015-03-01 1 2
3 2015-04-01 1 3
4 2015-05-01 1 4
5 2015-01-01 2 5
6 2015-02-01 2 6
7 2015-03-01 2 7
8 2015-04-01 2 8
9 2015-05-01 2 9
You can add a column which, by group, looks back and sums a variable over a fixed period. First define a function utilizing pd.Series.rolling:
def lookbacker(x):
"""Sum over past 70 days"""
return x.rolling('70D').sum().astype(int)
Then apply it on a GroupBy object and extract values for assignment:
df['Lookback_Sum'] = df.set_index('Date').groupby('ID')['Var1'].apply(lookbacker).values
print(df)
Date ID Var1 Lookback_Sum
0 2015-01-01 1 0 0
1 2015-02-01 1 1 1
2 2015-03-01 1 2 3
3 2015-04-01 1 3 6
4 2015-05-01 1 4 9
5 2015-01-01 2 5 5
6 2015-02-01 2 6 11
7 2015-03-01 2 7 18
8 2015-04-01 2 8 21
9 2015-05-01 2 9 24
It appears pd.Series.rolling does not work with months, e.g. using '2M' (2 months) instead of '70D' (70 days) gives ValueError: <2 * MonthEnds> is a non-fixed frequency. This makes sense since a "month" is ambiguous given months have different numbers of days.
Another point worth mentioning is you can use GroupBy + rolling directly and possibly more efficiently by bypassing apply, but this requires ensuring your index is monotic. For example, via sort_index:
df['Lookback_Sum'] = df.set_index('Date').sort_index()\
.groupby('ID')['Var1'].rolling('70D').sum()\
.astype(int).values
I don't think pandas.DataFrame.rolling() supports rolling-window aggregation by some number of months; currently, you must specify a fixed number of days, or other fixed-length time period.
But as #jpp mentioned, you can use python loops to perform rolling aggregation over a window size specified in calendar months, where the number of days in each window will vary, depending on what part of the calendar you're rolling over.
The following approach builds on this SO answer as well as #jpp's:
# Build some example data:
# 3 unique IDs, each with 365 samples, one sample per day throughout 2015
df = pd.DataFrame({'Date': pd.date_range('2015-01-01', '2015-12-31', freq='D'),
'Var1': list(range(365))})
df = pd.concat([df] * 3)
df['ID'] = [1]*365 + [2]*365 + [3]*365
df.head()
Date Var1 ID
0 2015-01-01 0 1
1 2015-01-02 1 1
2 2015-01-03 2 1
3 2015-01-04 3 1
4 2015-01-05 4 1
# Define a lookback function that mimics rolling aggregation,
# but uses DateOffset() slicing, rather than a window of fixed size.
# Use .count() here as a sanity check; you will need .sum()
def lookbacker(ser):
return pd.Series([ser.loc[d - pd.offsets.DateOffset(months=3):d].count()
for d in ser.index])
# By default, groupby.agg output is sorted by key. So make sure to
# sort df by (ID, Date) before inserting the flattened groupby result
# into a new column
df.sort_values(['ID', 'Date'], inplace=True)
df.set_index('Date', inplace=True)
df['window_size'] = df.groupby('ID')['Var1'].apply(lookbacker).values
# Manually check the resulting window sizes
df.head()
Var1 ID window_size
Date
2015-01-01 0 1 1
2015-01-02 1 1 2
2015-01-03 2 1 3
2015-01-04 3 1 4
2015-01-05 4 1 5
df.tail()
Var1 ID window_size
Date
2015-12-27 360 3 92
2015-12-28 361 3 92
2015-12-29 362 3 92
2015-12-30 363 3 92
2015-12-31 364 3 93
df[df.ID == 1].loc['2015-05-25':'2015-06-05']
Var1 ID window_size
Date
2015-05-25 144 1 90
2015-05-26 145 1 90
2015-05-27 146 1 90
2015-05-28 147 1 90
2015-05-29 148 1 91
2015-05-30 149 1 92
2015-05-31 150 1 93
2015-06-01 151 1 93
2015-06-02 152 1 93
2015-06-03 153 1 93
2015-06-04 154 1 93
2015-06-05 155 1 93
The last column gives the lookback window size in days, looking back from that date, including both the start and end dates.
Looking "3 months" before 2016-05-31 would land you at 2015-02-31, but February has only 28 days in 2015. As you can see in the sequence 90, 91, 92, 93 in the above sanity check, This DateOffset approach maps the last four days in May to the last day in February:
pd.to_datetime('2015-05-31') - pd.offsets.DateOffset(months=3)
Timestamp('2015-02-28 00:00:00')
pd.to_datetime('2015-05-30') - pd.offsets.DateOffset(months=3)
Timestamp('2015-02-28 00:00:00')
pd.to_datetime('2015-05-29') - pd.offsets.DateOffset(months=3)
Timestamp('2015-02-28 00:00:00')
pd.to_datetime('2015-05-28') - pd.offsets.DateOffset(months=3)
Timestamp('2015-02-28 00:00:00')
I don't know if this matches SQL's behaviour, but in any case, you'll want to test this and decide if this makes sense in your case.
you could use lambda to achieve it.
table1['sum_var1'] = table1.apply(lambda row: findSum(row), axis=1)
and we should write an equivalent method for months_between
the complete example is
from datetime import datetime
import datetime as dt
import pandas as pd
def months_between(date1, date2):
if date1.day == date2.day:
return (date1.year - date2.year) * 12 + date1.month - date2.month
# if both are last days
if date1.month != (date1 + dt.timedelta(days=1)).month :
if date2.month != (date2 + dt.timedelta(days=1)).month :
return date1.month - date2.month
return (date1 - date2).days / 31
def findSum(cRow):
table1['month_diff'] = table1['Date'].apply(months_between, date2=cRow['Date'])
filtered_table = table1[(table1["month_diff"] < 0) & (table1["month_diff"] > -3) & (table1['ID'] == cRow['ID'])]
if filtered_table.empty:
return 0
return filtered_table['Var1'].sum()
table1 = pd.DataFrame(columns = ['ID', 'Date', 'Var1'])
table1.loc[len(table1)] = [1, datetime.strptime('2015-01-01','%Y-%m-%d'), 0]
table1.loc[len(table1)] = [1, datetime.strptime('2015-02-01','%Y-%m-%d'), 1]
table1.loc[len(table1)] = [1, datetime.strptime('2015-03-01','%Y-%m-%d'), 2]
table1.loc[len(table1)] = [1, datetime.strptime('2015-04-01','%Y-%m-%d'), 3]
table1.loc[len(table1)] = [1, datetime.strptime('2015-05-01','%Y-%m-%d'), 4]
table1.loc[len(table1)] = [2, datetime.strptime('2015-01-01','%Y-%m-%d'), 5]
table1.loc[len(table1)] = [2, datetime.strptime('2015-02-01','%Y-%m-%d'), 6]
table1.loc[len(table1)] = [2, datetime.strptime('2015-03-01','%Y-%m-%d'), 7]
table1.loc[len(table1)] = [2, datetime.strptime('2015-04-01','%Y-%m-%d'), 8]
table1.loc[len(table1)] = [2, datetime.strptime('2015-05-01','%Y-%m-%d'), 9]
table1['sum_var1'] = table1.apply(lambda row: findSum(row), axis=1)
table1.drop(columns=['month_diff'], inplace=True)
print(table1)

Simplest way to find the difference between two dates in pandas

I'm trying to find the difference between two dates in a multi index data frame that is the result of a pivot table operation.
The data frame contains three columns. The first is a measurement the second is the end date and the third is the start date.
I've been able to successfully add a third multi index column to the data frame but only to make the result of reach cell zero
Pt["min"]["start_date"] = 0 but when I try to subtract the two dates I get a string error and appending .Dt.Days to the end of each column results in an error as well.
What is the simplest way to find the difference in days between two dates in a multi index pandas data frame?
You can select Multiindex in columns by tuples and subtract columns:
print (df)
a
meas end start
0 7 2015-04-05 2015-04-01
1 8 2015-04-07 2015-04-02
2 9 2015-04-14 2015-04-04
#if dtypes not datetime
df[('a','end')] = pd.to_datetime(df[('a','end')])
df[('a','start')] = pd.to_datetime(df[('a','start')])
df[('a','diff')] = df[('a','end')] - df[('a','start')]
print (df)
a
meas end start diff
0 7 2015-04-05 2015-04-01 4 days
1 8 2015-04-07 2015-04-02 5 days
2 9 2015-04-14 2015-04-04 10 days
If need output in days:
df[('a','diff')] = (df[('a','end')] - df[('a','start')]).dt.days
print (df)
a
meas end start diff
0 7 2015-04-05 2015-04-01 4
1 8 2015-04-07 2015-04-02 5
2 9 2015-04-14 2015-04-04 10

Categories