Simplest way to find the difference between two dates in pandas - python

I'm trying to find the difference between two dates in a multi index data frame that is the result of a pivot table operation.
The data frame contains three columns. The first is a measurement the second is the end date and the third is the start date.
I've been able to successfully add a third multi index column to the data frame but only to make the result of reach cell zero
Pt["min"]["start_date"] = 0 but when I try to subtract the two dates I get a string error and appending .Dt.Days to the end of each column results in an error as well.
What is the simplest way to find the difference in days between two dates in a multi index pandas data frame?

You can select Multiindex in columns by tuples and subtract columns:
print (df)
a
meas end start
0 7 2015-04-05 2015-04-01
1 8 2015-04-07 2015-04-02
2 9 2015-04-14 2015-04-04
#if dtypes not datetime
df[('a','end')] = pd.to_datetime(df[('a','end')])
df[('a','start')] = pd.to_datetime(df[('a','start')])
df[('a','diff')] = df[('a','end')] - df[('a','start')]
print (df)
a
meas end start diff
0 7 2015-04-05 2015-04-01 4 days
1 8 2015-04-07 2015-04-02 5 days
2 9 2015-04-14 2015-04-04 10 days
If need output in days:
df[('a','diff')] = (df[('a','end')] - df[('a','start')]).dt.days
print (df)
a
meas end start diff
0 7 2015-04-05 2015-04-01 4
1 8 2015-04-07 2015-04-02 5
2 9 2015-04-14 2015-04-04 10

Related

Filter dataframe for past 4 quarters

I have a DataFrame with start date and end date as a column. Firstly I need create a new column using End Date which consists of yearquarter (2022Q4) which I did using below code:
df['Quarter']=pd.PeriodIndex(d['End Date'],freq='Q')
Now I want to filter data for past 4 quarters using End Date, lets say curremtly its 2023Q1, so I want to filter so That I can get data for (2022Q1, 2022Q2, 2022Q3, 2022Q4).
How can I achieve this in python.
Use Serie.dt.to_period for quarters with generate actual quarter by Timestamp and Timestamp.to_period, last filter by boolean indexing with Series.between:
df = pd.DataFrame({'End Date':pd.date_range('2021-12-01', freq='25D', periods=20)})
df['Quarter'] = df['End Date'].dt.to_period('q')
now = pd.Timestamp.now().to_period('q')
out = df[df['Quarter'].between(now - 4, now - 1)]
print (out)
End Date Quarter
2 2022-01-20 2022Q1
3 2022-02-14 2022Q1
4 2022-03-11 2022Q1
5 2022-04-05 2022Q2
6 2022-04-30 2022Q2
7 2022-05-25 2022Q2
8 2022-06-19 2022Q2
9 2022-07-14 2022Q3
10 2022-08-08 2022Q3
11 2022-09-02 2022Q3
12 2022-09-27 2022Q3
13 2022-10-22 2022Q4
14 2022-11-16 2022Q4
15 2022-12-11 2022Q4

Filling NaN values from another dataframe based on a condition

I need to populate NaN values for some columns in one dataframe based on a condition between two data frames.
DF1 has SOL (start of line) and EOL (end of line) columns and DF2 has UTC_TIME for each entry.
For every point in DF2 where the UTC_TIME is >= the SOL and is <= the EOL of each record in the DF1, that row in DF2 must be assigned the LINE, DEVICE and TAPE_FILE.
So, every one of the points will be assigned a LINE, DEVICE and TAPE_FILE based on the SOL/EOL time the UTC_TIME is between in DF1.
I'm trying to use the numpy where function for each column like this
df2['DEVICE'] = np.where(df2['UTC_TIME'] >= df1['SOL'] and <= df1['EOL'])
Or using a for loop to iterate through each row
for point in points:
if df1['SOL'] >= df2['UTC_TIME'] and df1['EOL'] <= df2['UTC_TIME']
return df1['DEVICE']
Try with merge_asof:
#convert to datetime if needed
df1["SOL"] = pd.to_datetime(df1["SOL"])
df1["EOL"] = pd.to_datetime(df1["EOL"])
df2["UTC_TIME"] = pd.to_datetime(df2["UTC_TIME"])
output = pd.merge_asof(df2[["ID", "UTC_TIME"]],df1,left_on="UTC_TIME",right_on="SOL").drop(["SOL","EOL"],axis=1)
>>> output
ID UTC_TIME LINE DEVICE TAPE_FILE
0 1 2022-04-25 06:50:00 1 Huntec 10
1 2 2022-04-25 07:15:00 2 Teledyne 11
2 3 2022-04-25 10:20:00 3 Huntec 12
3 4 2022-04-25 10:30:00 3 Huntec 12
4 5 2022-04-25 10:50:00 3 Huntec 12

How to remove specific day timestamps from a big dataframe

I have a big dataframe consisting of 600 days worth of data. Each day has 100 timestamps. I have a separate list of 30 days from which I want to data. How do I remove data from these 30 days from the dataframe?
I tried a for loop, but it did not work. I know there is a simple method. But I don't know how to implement it.
df #is main dataframe which has many columns and rows. Index is a timestamp.
df['dates'] = df.index.strftime('%Y-%m-%d') # date part of timestamp is sliced and
#a new column is created. Instead of index, I want to use this column for comparing with bad list.
bad_list # it is a list of bad dates
for i in range(0,len(df)):
for j in range(0,len(bad_list)):
if str(df['dates'][i])== bad_list[j]:
df.drop(df[i].index,inplace=True)
You can do the following
df['dates'] = df.index.strftime('%Y-%m-%d')
#badlist should be in date format too.
newdf = df[~df['dates'].isin(badlist)]
# the ~ is used to denote "not in" the list.
#if Jan 1, 2000 is a bad date, it should be in the list as datetime(2000,1,1)
You can perform simple comparison:
>>> dates = pd.Series(pd.to_datetime(np.random.randint(int(time()) - 60 * 60 * 24 * 5, int(time()), 12), unit='s'))
>>> dates
0 2019-03-19 05:25:32
1 2019-03-20 00:58:29
2 2019-03-19 01:03:36
3 2019-03-22 11:45:24
4 2019-03-19 08:14:29
5 2019-03-21 10:17:13
6 2019-03-18 09:09:15
7 2019-03-20 00:14:16
8 2019-03-21 19:47:02
9 2019-03-23 06:19:35
10 2019-03-23 05:42:34
11 2019-03-21 11:37:46
>>> start_date = pd.to_datetime('2019-03-20')
>>> end_date = pd.to_datetime('2019-03-22')
>>> dates[(dates > start_date) & (dates < end_date)]
1 2019-03-20 00:58:29
5 2019-03-21 10:17:13
7 2019-03-20 00:14:16
8 2019-03-21 19:47:02
11 2019-03-21 11:37:46
If your source Series is not in datetime format, then you will need to use pd.to_datetime to convert it.

Prepare Data Frames to be compared. Index manipulation, datetime and beyond

Ok, this is a question in two steps.
Step one: I have a pandas DataFrame like this:
date time value
0 20100201 0 12
1 20100201 6 22
2 20100201 12 45
3 20100201 18 13
4 20100202 0 54
5 20100202 6 12
6 20100202 12 18
7 20100202 18 17
8 20100203 6 12
...
As you can see, for instance between rows 7 and 8 there is data missing (in this case, the value for the 0 time). Sometimes, several hours or even a full day could be missing.
I would like to convert this DataFrame to the format like this:
value
2010-02-01 00:00:00 12
2010-02-01 06:00:00 22
2010-02-01 12:00:00 45
2010-02-01 18:00:00 13
2010-02-02 00:00:00 54
2010-02-02 06:00:00 12
2010-02-02 12:00:00 18
2010-02-02 18:00:00 17
...
I want this because I have another DataFrame (let's call it "reliable DataFrame") in this format that I am sure it has no missing values.
EDIT 2016/07/28: Studying the problem it seems there were also duplicated data in the dataframe. See the solution to also address this problem.
Step two: With the previous step done I want to compare row by row the index in the "reliable DataFrame" with the index in the DataFrame with missing values.
I want to add a row with the value NaN where there are missing entries in the first DataFrame. The final check would be to be sure that both DataFrames have the same dimension.
I know this is a long question, but I am stacked. I have tried to manage the dates with the dateutil.parser.parse and to use set_index as the method to set a new index, but I have lots of errors in the code. I am afraid this is clearly above my pandas level.
Thank you in advance.
Step 1 Answer
df['DateTime'] = (df['date'].astype(str) + ' ' + df['time'].astype(str) +':'+'00'+':'+'00').apply(lambda x: pd.to_datetime(str(x)))
df.set_index('DateTime', drop=True, append=False, inplace=True, verify_integrity=False)
df.drop(['date', 'time'], axis=1, level=None, inplace=True, errors='raise')
If there are duplicates these can be removed by:
df = df.reset_index().drop_duplicates(subset='DateTime',keep='last').set_index('DateTime')
Step 2
df_join = df.join(df1, how='outer', lsuffix='x',sort=True)

Join and sum on subset of rows in a dataframe

I have a pandas dataframe which stores date ranges and some associated colums:
date_start date_end ... lots of other columns ...
1 2016-07-01 2016-07-02
2 2016-07-01 2016-07-03
3 2016-07-01 2016-07-04
4 2016-07-02 2016-07-07
5 2016-07-05 2016-07-06
and another dataframe of Pikachu sightings indexed by date:
pikachu_sightings
date
2016-07-01 2
2016-07-02 4
2016-07-03 6
2016-07-04 8
2016-07-05 10
2016-07-06 12
2016-07-07 14
For each row in the first df I'd like to calculate the sum of pikachu_sightings within that date range (i.e., date_start to date_end) and store that in a new column. So would end up with a df like this (numbers left in for clarity):
date_start date_end total_pikachu_sightings
1 2016-07-01 2016-07-02 2 + 4
2 2016-07-01 2016-07-03 2 + 4 + 6
3 2016-07-01 2016-07-04 2 + 4 + 6 + 8
4 2016-07-02 2016-07-07 4 + 6 + 8 + 10 + 12 + 14
5 2016-07-05 2016-07-06 10 + 12
If I was doing this iteratively I'd iterate over each row in the table of date ranges, select the subset of rows in the table of sightings that match the date range and perform a sum on it - but this is way too slow for my dataset:
for range in ranges.itertuples():
sightings_in_range = sightings[(sightings.index >= range.date_start) & (sightings.index <= range.date_end)]
sum_sightings_in_range = sightings_in_range["pikachu_sightings"].sum()
ranges.set_value(range.Index, 'total_pikachu_sightings', sum_sightings_in_range)
This is my attempt at using pandas, but fails because the length of the two dataframes does not match (and even if they did, there's probably some other flaw in my approach):
range["total_pikachu_sightings"] =
sightings[(sightings.index >= range.date_start) & (sightings.index <= range.date_end)
["pikachu_sightings"].sum()
I'm trying to understand what the general approach/design should look like as I'd like to aggregate with other functions too, sum just seems like the easiest for an example. Sorry if this is an obvious question - I'm new to pandas!
A sketch of a vectorized solution:
Start with a p as in piRSquared's answer.
Make sure date_ cols have datetime64 dtypes, i.e.:
df['date_start'] = pd.to_datetime(df.date_time)
Then calculate cumulative sums:
psums = p.cumsum()
and
result = psums.asof(df.date_end) - psums.asof(df.date_start)
It's not yet the end, though. asof returns the last good value, so it sometimes will take the exact start date and sometimes not (depending on your data). So, you have to adjust for that. (If the date frequency is day, then probably moving the index of p an hour backwards, e.g. -pd.Timedelta(1, 'h'), and then adding p.asof(df.start_date) might do the trick.)
First make sure that pikachu_sightings has a datetime index and is sorted.
p = pikachu_sightings.squeeze() # force into a series
p.index = pd.to_datetime(p.index)
p = p.sort_index()
Then make sure your date_start and date_end are datetime.
df.date_start = pd.to_datetime(df.date_start)
df.date_end = pd.to_datetime(df.date_end)
Then its simply
df.apply(lambda x: p[x.date_start:x.date_end].sum(), axis=1)
0 6
1 12
2 20
3 54
4 22
dtype: int64

Categories