Pandas: align two time series given an "overlapping" region? - python

I have two time-series, A and B. The time series have different lengths and different dates and different frequencies.
The core issue is that I want to find the dates between the two time series that overlap.
This is easy enough to do, but there's a twist.
Imagine time series A has a value on 01/01/2020 and series B has a value on 02/01/2020. These are only one day apart, and UNLESS there's a better candidate (e.g. if B ALSO has a value on 01/01/02020), I want to just include this one.
So, really the twist consists of a "lag" of, say, 5 days. Then what I want to do is the following:
Consider all date in time series A.
For each such date, find the corresponding date in time series B.
For each date for which a match in B was NOT found, set a lag = 1 day, and search in B again.
Repeat step 3 with lag = 2, 3, ..., 5 days.
Return a new times series C, which consists of the dates of A and the values of B, found by matching using the above procedure.

You can achieve this by, at step 2, finding the nearest date in B using a simple min
from random import random as r
import random
random.seed(123456)
ts_a = pd.Series([pd.Timestamp(f"{d+1}-jan-2021") for d in range(5)])
base = pd.Timestamp(f"01-jan-2021")
one_day = pd.Timedelta("1D")
ts_b = pd.Series([base + int(10*r())*one_day for _ in range(5)])
print(ts_a)
print(ts_b)
for t in ts_a:
deltas = abs(ts_b - t)
index_closest = deltas.idxmin()
print(f"The closest day to {t} in ts_b is {ts_b.iloc[index_closest]}")
0 2021-01-01
1 2021-01-02
2 2021-01-03
3 2021-01-04
4 2021-01-05
dtype: datetime64[ns]
0 2021-01-09
1 2021-01-08
2 2021-01-01
3 2021-01-02
4 2021-01-01
dtype: datetime64[ns]
The closest day to 2021-01-01 00:00:00 in ts_b is 2021-01-01 00:00:00
The closest day to 2021-01-02 00:00:00 in ts_b is 2021-01-02 00:00:00
The closest day to 2021-01-03 00:00:00 in ts_b is 2021-01-02 00:00:00
The closest day to 2021-01-04 00:00:00 in ts_b is 2021-01-02 00:00:00
The closest day to 2021-01-05 00:00:00 in ts_b is 2021-01-08 00:00:00

Related

Is there a way to find hourly averages in pandas timeframes that do not start from even hours?

I have a pandas dataframe (python) indexed with timestamps roughly every 10 seconds. I want to find hourly averages, but all functions I find start their averaging at even hours (e.g. hour 9 includes data from 08.00:00 to 08:59:50). Let's say I have the dataframe below.
Timestamp value data
2022-01-01 00:00:00 0.0 5.31
2022-01-01 00:00:10 0.0 0.52
2022-01-01 00:00:20 1.0 9.03
2022-01-01 00:00:30 1.0 4.37
2022-01-01 00:00:40 1.0 8.03
...
2022-01-01 13:52:30 1.0 9.75
2022-01-01 13:52:40 1.0 0.62
2022-01-01 13:52:50 1.0 3.58
2022-01-01 13:53:00 1.0 8.23
2022-01-01 13:53:10 1.0 3.07
Freq: 10S, Length: 5000, dtype: float64
So what I want to do:
Only look at data where we have data that consistently through 1 hour has a value of 1
Find an hourly average of these hours (could e.g. be between 01:30:00-02:29:50 and 11:16:30 - 12:16:20)..
I hope I made my problem clear enough. How do I do this?
EDIT:
Maybe the question was a bit unclear phrased.
I added a third column data, which is what I want to find the mean of. I am only interested in time intervals where, value = 1 consistently through one hour, the rest of the data can be excluded.
EDIT #2:
A bit of background to my problem: I have a sensor giving me data every 10 seconds. For data to be "approved" certain requirements are to be fulfilled (value in this example), and I need the hourly averages (and preferably timestamps for when this occurs). So in order to maximize the number of possible hours to include in my analysis, I would like to find full hours even if they don't start at an even timestamp.
If I understand you correctly you want a conditional mean - calculate the mean per hour of the data column conditional on the value column being all 1 for every 10s row in that hour.
Assuming your dataframe is called df, the steps to do this are:
Create a grouping column
This is your 'hour' column that can be created by
df['hour'] = df.Timestamp.hour
Create condition
Now we've got a column to identify groups we can check which groups are eligible - only those with value consistently equal to 1. If we have 10s intervals and it's per hour then if we group by hour and sum this column then we should get 360 as there are 360 10s intervals per hour.
Group and compute
We can now group and use the aggregate function to:
sum the value column to evaluate against our condition
compute the mean of the data column to return for the valid hours
# group and aggregate
df_mean = df[['hour', 'value', 'data']].groupby('hour').aggregate({'value': 'sum', 'data': 'mean'})
# apply condition
df_mean = df_mean[df_mean['value'] == 360]
That's it - you are left with a dataframe that contains the mean value of data for only the hours where you have a complete hour of value=1.
If you want to augment this so you don't have to start with the grouping as per hour starting as 08:00:00-09:00:00 and maybe you want to start as 08:00:10-09:00:10 then the solution is simple - augment the grouping column but don't change anything else in the process.
To do this you can use datetime.timedelta to shift things forward or back so that df.Timestamp.hour can still be leveraged to keep things simple.
Infer grouping from data
One final idea - if you want to infer which hours on a rolling basis you have complete data for then you can do this with a rolling sum - this is even easier. You:
compute the rolling sum of value and mean of data
only select where value is equal to 360
df_roll = df.rolling(360).aggregate({'value': 'sum', 'data': 'mean'})
df_roll = df_roll[df_roll['value'] == 360]
Yes, there is. You need resample with an offset.
Make some test data
Please make sure to provide meaningful test data next time.
import pandas as pd
import numpy as np
# One day in 10 second intervals
index = pd.date_range(start='1/1/2018', end='1/2/2018', freq='10S')
df = pd.DataFrame({"data": np.random.random(len(index))}, index=index)
# This will set the first part of the data to 1, the rest to 0
df["value"] = (df.index < "2018-01-01 10:00:10").astype(int)
This is what we got:
>>> df
data value
2018-01-01 00:00:00 0.377082 1
2018-01-01 00:00:10 0.574471 1
2018-01-01 00:00:20 0.284629 1
2018-01-01 00:00:30 0.678923 1
2018-01-01 00:00:40 0.094724 1
... ... ...
2018-01-01 23:59:20 0.839973 0
2018-01-01 23:59:30 0.890321 0
2018-01-01 23:59:40 0.426595 0
2018-01-01 23:59:50 0.089174 0
2018-01-02 00:00:00 0.351624 0
Get the mean per hour with an offset
Here is a small function that checks if all value rows in the slice are equal to 1 and returns the mean if so, otherwise it (implicitly) returns None.
def get_conditioned_average(frame):
if frame.value.eq(1).all():
return frame.data.mean()
Now just apply this to hourly slices, starting, e.g., at 10 seconds after the full hour.
df2 = df.resample('H', offset='10S').apply(get_conditioned_average)
This is the final result:
>>> df2
2017-12-31 23:00:10 0.377082
2018-01-01 00:00:10 0.522144
2018-01-01 01:00:10 0.506536
2018-01-01 02:00:10 0.505334
2018-01-01 03:00:10 0.504431
... ... ...
2018-01-01 19:00:10 NaN
2018-01-01 20:00:10 NaN
2018-01-01 21:00:10 NaN
2018-01-01 22:00:10 NaN
2018-01-01 23:00:10 NaN
Freq: H, dtype: float64

getting week number from date python

I have code as below. My questions:
why is it assigning week 1 to 2014-12-29 and '2014-1-1'? Why it is not assigning week 53 to 2014-12-29?
how could i assign week number that is continuously increasing? I
want '2014-12-29','2015-1-1' to have week 53 and '2015-1-15' to have
week 55 etc.
x=pd.DataFrame(data=['2014-1-1','2014-12-29','2015-1-1','2015-1-15'],columns=['date'])
x['week_number']=pd.DatetimeIndex(x['date']).week
As far as why the week number is 1 for 12/29/2014 -- see the question I linked to in the comments. For the second part of your question:
January 1, 2014 was a Wednesday. We can take the minimum date of your date column, get the day number and subtract from the difference:
Solution
# x["date"] = pd.to_datetime(x["date"]) # if not already a datetime column
min_date = x["date"].min() + 1 # + 1 because they're zero-indexed
x["weeks_from_start"] = ((x["date"].diff().dt.days.cumsum() - min_date) // 7 + 1).fillna(1).astype(int)
Output:
date weeks_from_start
0 2014-01-01 1
1 2014-12-29 52
2 2015-01-01 52
3 2015-01-15 54
Step by step
The first step is to convert the date column to the datetime type, if you haven't already:
In [3]: x.dtypes
Out[3]:
date object
dtype: object
In [4]: x["date"] = pd.to_datetime(x["date"])
In [5]: x
Out[5]:
date
0 2014-01-01
1 2014-12-29
2 2015-01-01
3 2015-01-15
In [6]: x.dtypes
Out[6]:
date datetime64[ns]
dtype: object
Next, we need to find the minimum of your date column and set that as the starting date day of the week number (adding 1 because the day number starts at 0):
In [7]: x["date"].min().day + 1
Out[7]: 2
Next, use the built-in .diff() function to take the differences of adjacent rows:
In [8]: x["date"].diff()
Out[8]:
0 NaT
1 362 days
2 3 days
3 14 days
Name: date, dtype: timedelta64[ns]
Note that we get NaT ("not a time") for the first entry -- that's because the first row has nothing to compare to above it.
The way to interpret these values is that row 1 is 362 days after row 0, and row 2 is 3 days after row 1, etc.
If you take the cumulative sum and subtract the starting day number, you'll get the days since the starting date, in this case 2014-01-01, as if the Wednesday was day 0 of that first week (this is because when we calculate the number of weeks since that starting date, we need to compensate for the fact that Wednesday was the middle of that week):
In [9]: x["date"].diff().dt.days.cumsum() - min_date
Out[9]:
0 NaN
1 360.0
2 363.0
3 377.0
Name: date, dtype: float64
Now when we take the floor division by 7, we'll get the correct number of weeks since the starting date:
In [10]: (x["date"].diff().dt.days.cumsum() - 2) // 7 + 1
Out[10]:
0 NaN
1 52.0
2 52.0
3 54.0
Name: date, dtype: float64
Note that we add 1 because (I assume) you're counting from 1 -- i.e., 2014-01-01 is week 1 for you, and not week 0.
Finally, the .fillna is just to take care of that NaT (which turned into a NaN when we started doing arithmetic). You use .fillna(value) to fill NaNs with value:
In [11]: ((x["date"].diff().dt.days.cumsum() - 2) // 7 + 1).fillna(1)
Out[11]:
0 1.0
1 52.0
2 52.0
3 54.0
Name: date, dtype: float64
Finally use .astype() to convert the column to integers instead of floats.

First week of year considering the first day last year

I have the following df:
time_series date sales
store_0090_item_85261507 1/2020 1,0
store_0090_item_85261501 2/2020 0,0
store_0090_item_85261500 3/2020 6,0
Being 'date' = Week/Year.
So, I tried use the following code:
df['date'] = df['date'].apply(lambda x: datetime.strptime(x + '/0', "%U/%Y/%w"))
But, return this df:
time_series date sales
store_0090_item_85261507 2020-01-05 1,0
store_0090_item_85261501 2020-01-12 0,0
store_0090_item_85261500 2020-01-19 6,0
But, the first day of the first week of 2020 is 2019-12-29, considering sunday as first day. How can I have the first day 2020-12-29 of the first week of 2020 and not 2020-01-05?
From the datetime module's documentation:
%U: Week number of the year (Sunday as the first day of the week) as a zero padded decimal number. All days in a new year preceding the first Sunday are considered to be in week 0.
Edit: My originals answer doesn't work for input 1/2023 and using ISO 8601 date values doesn't work for 1/2021, so I've edited this answer by adding a custom function
Here is a way with a custom function
import pandas as pd
from datetime import datetime, timedelta
##############################################
# to demonstrate issues with certain dates
print(datetime.strptime('0/2020/0', "%U/%Y/%w")) # 2019-12-29 00:00:00
print(datetime.strptime('1/2020/0', "%U/%Y/%w")) # 2020-01-05 00:00:00
print(datetime.strptime('0/2021/0', "%U/%Y/%w")) # 2020-12-27 00:00:00
print(datetime.strptime('1/2021/0', "%U/%Y/%w")) # 2021-01-03 00:00:00
print(datetime.strptime('0/2023/0', "%U/%Y/%w")) # 2023-01-01 00:00:00
print(datetime.strptime('1/2023/0', "%U/%Y/%w")) # 2023-01-01 00:00:00
#################################################
df = pd.DataFrame({'date':["1/2020", "2/2020", "3/2020", "1/2021", "2/2021", "1/2023", "2/2023"]})
print(df)
def get_first_day(date):
date0 = datetime.strptime('0/' + date.split('/')[1] + '/0', "%U/%Y/%w")
date1 = datetime.strptime('1/' + date.split('/')[1] + '/0', "%U/%Y/%w")
date = datetime.strptime(date + '/0', "%U/%Y/%w")
return date if date0 == date1 else date - timedelta(weeks=1)
df['new_date'] = df['date'].apply(lambda x:get_first_day(x))
print(df)
Input
date
0 1/2020
1 2/2020
2 3/2020
3 1/2021
4 2/2021
5 1/2023
6 2/2023
Output
date new_date
0 1/2020 2019-12-29
1 2/2020 2020-01-05
2 3/2020 2020-01-12
3 1/2021 2020-12-27
4 2/2021 2021-01-03
5 1/2023 2023-01-01
6 2/2023 2023-01-08
You'll want to use ISO week parsing directives, Ex:
import pandas as pd
date = pd.Series(["1/2020", "2/2020", "3/2020"])
pd.to_datetime(date+"/1", format="%V/%G/%u")
0 2019-12-30
1 2020-01-06
2 2020-01-13
dtype: datetime64[ns]
you can also shift by one day if the week should start on Sunday:
pd.to_datetime(date+"/1", format="%V/%G/%u") - pd.Timedelta('1d')
0 2019-12-29
1 2020-01-05
2 2020-01-12
dtype: datetime64[ns]

get minimum and maxiumum time for a particular date in pandas

I've separate columns for start( timestamp ) and end( timestamp) and i need to get the earliest starttime and last endtime for each date.
number start end test time
0 1 2020-02-01 06:27:38 2020-02-01 08:29:42 1 02:02:04
1 1 2020-02-01 08:41:03 2020-02-01 11:05:30 2 02:24:27
2 1 2020-02-01 11:20:22 2020-02-01 13:03:49 1 01:43:27
3 1 2020-02-01 13:38:18 2020-02-01 16:04:31 2 02:26:13
4 1 2020-02-01 16:26:46 2020-02-01 17:42:49 1 01:16:03
5 1 2020-02-02 10:11:00 2020-02-02 12:11:00 1 02:00:00
I want the output for each date as : Date Min Max
I'm fairly new to Pandas and most of the solutions i've across is finding the min and max datetime from column. While what i want to do is min and max datetime for each date, where the timestamps are spread over two columns
expected output (ignore the date and time formats please)
date min max
1/2/2020 6:27 17:42
2/2/2020 10:11 12:11
I believe you need to start by creating a date column and later performing groupby with date.
df['date'] = df['start'].dt.date
df['start_hm'] = df['start'].dt.strftime('%H:%M')
df['end_hm'] = df['end'].dt.strftime('%H:%M')
output = df.groupby('date').agg(min = pd.NamedAgg(column = 'start_hm',aggfunc='min'),
max = pd.NamedAgg(column='end_hm',aggfunc='max'))
Output:
min max
date
2020-02-01 06:27 17:42
2020-02-02 10:11 12:11

Python Pandas - Computing stats on TimeSeriesIndexedData for each customer

UsageDate CustID1 CustID2 .... CustIDn
0 2018-01-01 00:00:00 1.095
1 2018-01-01 01:00:00 1.129
2 2018-01-01 02:00:00 1.165
3 2018-01-01 04:00:00 1.697
.
.
m 2018-31-01 23:00:00 1.835 (m,n)
The dataframe (df) has m rows and n columns. m is a Hourly TimeSeries Index which starts from first hour of month to last hour of month.
The columns are the customers which are almost 100,000.
The values at each cell of Dataframe are energy consumption values.
For every customer, I need to calculate:
1) Mean of every hour usage - so basically average of 1st hour of every day in a month, 2nd hour of every day in a month etc.
2) Summation of usage of every customer
3) Top 3 usage hours - for a customer x, it can be "2018-01-01 01:00:00",
"2018-11-01 05:00:00" "2018-21-01 17:00:00"
4) Bottom 3 usage hours - Similar explanation as above
5) Mean of usage for every customer in the month
My main point of trouble is how to aggregate data both for every customer and the hour of day, or day together.
For summation of usage for every customer, I tried:
df_temp = pd.DataFrame(columns=["TotalUsage"])
for col in df.columns:
`df_temp[col,"TotalUsage"] = df[col].apply.sum()`
However, this and many version of this which I tried are not helping me solve the problem.
Please help me with an approach and how to think about such problems.
Also, since the dataframe is large, it would be helpful if we can talk about Computational Complexity and how can we decrease computation time.
This looks like a job for pandas.groupby.
(I didn't test the code because I didn't have a good sample dataset from which to work. If there are errors, let me know.)
For some of your requirements, you'll need to add a column with the hour:
df['hour']=df['UsageDate'].dt.hour
1) Mean by hour.
mean_by_hour=df.groupby('hour').mean()
2) Summation by user.
sum_by_uers=df.sum()
3) Top usage by customer. Bottom 3 usage hours - Similar explanation as above.--I don't quite understand your desired output, you might be asking too many different questions in this question. If you want the hour and not the value, I think you may have to iterate through the columns. Adding an example may help.
4) Same comment.
5) Mean by customer.
mean_by_cust = df.mean()
I am not sure if this is all the information you are looking for but it will point you in the right direction:
import pandas as pd
import numpy as np
# sample data for 3 days
np.random.seed(1)
data = pd.DataFrame(pd.date_range('2018-01-01', periods= 72, freq='H'), columns=['UsageDate'])
data2 = pd.DataFrame(np.random.rand(72,5), columns=[f'ID_{i}' for i in range(5)])
df = data.join([data2])
# print('Sample Data:')
# print(df.head())
# print()
# mean of every month and hour per year
# groupby year month hour then find the mean of every hour in a given year and month
mean_data = df.groupby([df['UsageDate'].dt.year, df['UsageDate'].dt.month, df['UsageDate'].dt.hour]).mean()
mean_data.index.names = ['UsageDate_year', 'UsageDate_month', 'UsageDate_hour']
# print('Mean Data:')
# print(mean_data.head())
# print()
# use set_index with max and head
top_3_Usage_hours = df.set_index('UsageDate').max(1).sort_values(ascending=False).head(3)
# print('Top 3:')
# print(top_3_Usage_hours)
# print()
# use set_index with min and tail
bottom_3_Usage_hours = df.set_index('UsageDate').min(1).sort_values(ascending=False).tail(3)
# print('Bottom 3:')
# print(bottom_3_Usage_hours)
out:
Sample Data:
UsageDate ID_0 ID_1 ID_2 ID_3 ID_4
0 2018-01-01 00:00:00 0.417022 0.720324 0.000114 0.302333 0.146756
1 2018-01-01 01:00:00 0.092339 0.186260 0.345561 0.396767 0.538817
2 2018-01-01 02:00:00 0.419195 0.685220 0.204452 0.878117 0.027388
3 2018-01-01 03:00:00 0.670468 0.417305 0.558690 0.140387 0.198101
4 2018-01-01 04:00:00 0.800745 0.968262 0.313424 0.692323 0.876389
Mean Data:
ID_0 ID_1 ID_2 \
UsageDate_year UsageDate_month UsageDate_hour
2018 1 0 0.250716 0.546475 0.202093
1 0.414400 0.264330 0.535928
2 0.335119 0.877191 0.380688
3 0.577429 0.599707 0.524876
4 0.702336 0.654344 0.376141
ID_3 ID_4
UsageDate_year UsageDate_month UsageDate_hour
2018 1 0 0.244185 0.598238
1 0.400003 0.578867
2 0.623516 0.477579
3 0.429835 0.510685
4 0.503908 0.595140
Top 3:
UsageDate
2018-01-01 21:00:00 0.997323
2018-01-03 23:00:00 0.990472
2018-01-01 08:00:00 0.988861
dtype: float64
Bottom 3:
UsageDate
2018-01-01 19:00:00 0.002870
2018-01-03 02:00:00 0.000402
2018-01-01 00:00:00 0.000114
dtype: float64
For top and bottom 3 if you want to find the min sum across rows then:
df.set_index('UsageDate').sum(1).sort_values(ascending=False).tail(3)

Categories