Aggregating Values using Date Ranges in Another Dataframe - python

I need to sum all values from maindata using master_records. Many values for ids will not get summed even if there are timestamps and values for these columns.
import pandas as pd
#Proxy reference dataframe
master_records = [['site a', '2021-03-05 02:00:00', '2021-03-05 03:00:00'],
['site a', '2021-03-05 06:00:00', '2021-03-05 08:00:00'],
['site b', '2021-04-08 10:00:00', '2021-04-08 13:00:00']]
mst_df = pd.DataFrame(master_records, columns = ['id', 'start', 'end'])
mst_df['start'] = pd.to_datetime(mst_df['start'], infer_datetime_format=True)
mst_df['end'] = pd.to_datetime(mst_df['end'], infer_datetime_format=True)
#Proxy main high frequency dataframe
main_data = [['id a','2021-03-05 00:00:00', 10], #not aggregated
['id a','2021-03-05 01:00:00', 19], #not aggregated
['id a','2021-03-05 02:00:00', 9],
['id a','2021-03-05 03:00:00', 16],
['id a','2021-03-05 04:00:00', 16], #not aggregated
['id a','2021-03-05 05:00:00', 11], #not aggregated
['id a','2021-03-05 06:00:00', 16],
['id a','2021-03-05 07:00:00', 12],
['id a','2021-03-05 08:00:00', 9],
['id b','2021-04-08 10:00:00', 11],
['id b','2021-04-08 11:00:00', 10],
['id b','2021-04-08 12:00:00', 19],
['id b','2021-04-08 13:00:00', 10],
['id b','2021-04-08 14:00:00', 16]] #not aggregated
# Create the pandas DataFrame
maindata = pd.DataFrame(main_data, columns = ['id', 'timestamp', 'value'])
maindata['timestamp'] = pd.to_datetime(maindata['timestamp'], infer_datetime_format=True)
The desired DataFrame looks like:
print(mst_df)
id start end sum(value)
0 site a 2021-03-05 02:00:00 2021-03-05 03:00:00 25
1 site a 2021-03-05 06:00:00 2021-03-05 08:00:00 37
2 site b 2021-04-08 10:00:00 2021-04-08 13:00:00 50

The "id"s don't match; so first we create a column in both DataFrames to get a matching ID; then merge on the matching "id"s; then filter the merged DataFrame on the rows where the timestamps are between "start" and "end". Finally groupby + sum will fetch the desired outcome:
maindata['id_letter'] = maindata['id'].str.split().str[-1]
mst_df['id_letter'] = mst_df['id'].str.split().str[-1]
merged = mst_df.merge(maindata, on='id_letter', suffixes=('','_'))
out = (merged[merged['timestamp'].between(merged['start'], merged['end'])]
.groupby(['id','start','end'], as_index=False)['value'].sum())
Output:
id start end value
0 site a 2021-03-05 02:00:00 2021-03-05 03:00:00 25
1 site a 2021-03-05 06:00:00 2021-03-05 08:00:00 37
2 site b 2021-04-08 10:00:00 2021-04-08 13:00:00 50

Related

how to find the difference of dataframe which is given on start date and end date

I have to find the difference in data provided at 00:00:00 and 23:59:59 per day for seven days.
How to find the difference in the data frame, which is given on the start date and end date?
Sample Data
Date Data
2018-12-01 00:00:00 2
2018-12-01 12:00:00 5
2018-12-01 23:59:59 10
2018-12-02 00:00:00 12
2018-12-02 12:00:00 15
2018-12-02 23:59:59 22
Expected Output
Date Data
2018-12-01 8
2018-12-02 10
Example
data = {
'Date': ['2018-12-01 00:00:00', '2018-12-01 12:00:00', '2018-12-01 23:59:59',
'2018-12-02 00:00:00', '2018-12-02 12:00:00', '2018-12-02 23:59:59'],
'Data': [2, 5, 10, 12, 15, 22]
}
df = pd.DataFrame(data)
Code
df['Date'] = pd.to_datetime(df['Date'])
out = (df.resample('D', on='Date')['Data']
.agg(lambda x: x.iloc[-1] - x.iloc[0]).reset_index())
out
Date Data
0 2018-12-01 8
1 2018-12-02 10
Update
more efficient way
you can get same result following code:
g = df.resample('D', on='Date')['Data']
out = g.last().sub(g.first()).reset_index()
You can use groupby and iterate over with min-max range.
import pandas as pd
df = pd.DataFrame({
'Date': ['2018-12-01 00:00:00', '2018-12-01 12:00:00', '2018-12-01 23:59:59',
'2018-12-02 00:00:00', '2018-12-02 12:00:00', '2018-12-02 23:59:59'],
'Data': [2, 5, 10, 12, 15, 22]
})
df['Date'] = pd.to_datetime(df['Date'])
df['Date_Only'] = df['Date'].dt.date
result = df.groupby('Date_Only').apply(lambda x: x['Data'].max() - x['Data'].min())
print(result)

Create Event Indicator from 2 DataFrames of Timestamps

I have two dataframes, one is the main_df that has high frequency pi data with timestamps. The other is the reference_df that is a header-level dataframe with start and end timestamps. I need to create an indicator for main_df for when individual rows fall between the start and end timestamps in the referenc_df. Any input would be appreciated.
import pandas as pd
#Proxy reference dataframe
reference_data = [['site a', '2021-03-05 00:00:00', '2021-03-05 23:52:00'],
['site a', '2021-03-06 00:00:00', '2021-03-06 12:00:00'],
['site b', '2021-04-08 20:04:00', '2021-04-08 23:00:00'],
['site c', '2021-04-09 04:08:00', '2021-04-09 09:52:00']]
ref_df = pd.DataFrame(reference_data, columns = ['id', 'start', 'end'])
ref_df['start'] = pd.to_datetime(ref_df['start'], infer_datetime_format=True)
ref_df['end'] = pd.to_datetime(ref_df['end'], infer_datetime_format=True)
#Proxy main high frequency dataframe
main_data = [['site a', '2021-03-05 01:00:00', 10],
['site a', '2021-03-05 01:01:00', 11],
['site b', '2021-04-08 20:00:00', 9],
['site b', '2021-04-08 20:04:00', 10],
['site b', '2021-04-08 20:05:00', 11],
['site c', '2021-01-09 10:0:00', 7]]
# Create the pandas DataFrame
main_df = pd.DataFrame(main_data, columns = ['id', 'timestamp', 'value'])
main_df['timestamp'] = pd.to_datetime(main_df['timestamp'], infer_datetime_format=True)
Desired DataFrame:
print(main_df)
id timestamp value event_indicator
0 site a 2021-03-05 01:00:00 10 1
1 site a 2021-03-05 01:01:00 11 1
2 site b 2021-04-08 20:00:00 9 0
3 site b 2021-04-08 20:04:00 10 1
4 site b 2021-04-08 20:05:00 11 1
5 site c 2021-01-09 10:00:00 7 0
Perform an inner join on the sites, then check if the timestamp is between any of the ranges. reset_index before the merge so you can use that to keep track of which row you're checking the range for.
s = main_df.reset_index().merge(ref_df, on='id')
s['event_indicator'] = s['timestamp'].between(s['start'], s['end']).astype(int)
# Max checks for at least 1 overlap.
s = s.groupby('index')['event_indicator'].max()
main_df['event_indicator'] = s
id timestamp value event_indicator
0 site a 2021-03-05 01:00:00 10 1
1 site a 2021-03-05 01:01:00 11 1
2 site b 2021-04-08 20:00:00 9 0
3 site b 2021-04-08 20:04:00 10 1
4 site b 2021-04-08 20:05:00 11 1
5 site c 2021-01-09 10:00:00 7 0

Is there a faster way for finding the range of constant values in a dataframe?

I want to find the longest duration of the constant values in a dataframe. For example, given a dataframe below, the longest duration should be 30 minutes (when value = 2).
import pandas as pd
d = {'date_time': ['2016-01-01 12:00:00', '2016-01-01 12:15:00',
'2016-01-01 12:30:00', '2016-01-01 12:45:00',
'2016-01-01 13:00:00', '2016-01-01 13:15:00',
'2016-01-01 13:30:00', '2016-01-01 13:45:00'],
'value': [1,2,2,2,4,5,5,7]}
df = pd.DataFrame(data=d)
df['date_time'] = pd.to_datetime(df['date_time'])
print(df)
date_time value
0 2016-01-01 12:00:00 1
1 2016-01-01 12:15:00 2
2 2016-01-01 12:30:00 2
3 2016-01-01 12:45:00 2
4 2016-01-01 13:00:00 4
5 2016-01-01 13:15:00 5
6 2016-01-01 13:30:00 5
7 2016-01-01 13:45:00 7
(Note: the date_time interval is not always consistent.)
I managed to find it by finding the indexes of df.value.diff().abs()==0, build a complex function to iterate through that list and compute the range.
Since the actual dataframe is much larger than this example, is there a shortcut function or a faster way to get this without multiple iterations?
Thank you.
EDIT:
In my case, the same value can appear in other streaks. A more appropriate example would be
d = {'date_time': ['2016-01-01 12:00:00', '2016-01-01 12:15:00',
'2016-01-01 12:30:00', '2016-01-01 12:45:00',
'2016-01-01 13:00:00', '2016-01-01 13:15:00',
'2016-01-01 13:30:00', '2016-01-01 13:45:00',
'2016-01-01 14:00:00', '2016-01-01 14:05:00'],
'value': [1,2,2,2,4,5,5,7,5,5]}
df = pd.DataFrame(data=d)
df['date_time'] = pd.to_datetime(df['date_time'])
print(df)
date_time value
0 2016-01-01 12:00:00 1
1 2016-01-01 12:15:00 2
2 2016-01-01 12:30:00 2
3 2016-01-01 12:45:00 2
4 2016-01-01 13:00:00 4
5 2016-01-01 13:15:00 5
6 2016-01-01 13:30:00 5
7 2016-01-01 13:45:00 7
8 2016-01-01 14:00:00 5
9 2016-01-01 14:05:00 5
The longest duration, in this case, remains 30 minutes when value = 2.
groupby + nlargest
Create a grouping series that tracks changes.
groupr = df.value.ne(df.value.shift()).cumsum()
Create a mapping dictionary that can translate from the groupr key to the actual value in the df.value column.
mapper = dict(zip(groupr, df.value))
Now we group and use ptp and nlargest. Finally, we use rename and mapper to translate the index value, which is the groupr value, back to the value value (phew, that's a tad confusing).
df.groupby(groupr).date_time.apply(np.ptp).nlargest(1).rename(mapper)
value
2 0 days 00:30:00
Name: date_time, dtype: timedelta64[ns]
The 2 in the index is the value with the longest duration. The 0 days 00:30:00 is the longest duration.
References
np.ptp
nlargest
You can groupby the value column and use .size() to get the size/length of each group.
>>> groups = df.groupby('value')
>>> groups.size()
value
1 1
2 3
4 1
5 2
7 1
dtype: int64
.idxmax() will give you the index of the largest group which you can pass to .get_groups()
>>> groups.get_group(groups.size().idxmax())
date_time value
1 2016-01-01 12:15:00 2
2 2016-01-01 12:30:00 2
3 2016-01-01 12:45:00 2
Then you can diff the last and first dates (assuming they are sorted - if not you can sort them)
>>> max_streak = groups.get_group(groups.size().idxmax())
>>> max_streak.iloc[-1].date_time - max_streak.iloc[0].date_time
Timedelta('0 days 00:30:00')
If value can repeat in other streaks you can groupby using:
groups = df.groupby((df.value != df.value.shift()).cumsum())
Update: Maximum duration of any streak
>>> groups = df.groupby((df.value != df.value.shift()).cumsum())
>>> last = groups.last()
>>> max_duration = (last.date_time - groups.first().date_time).nlargest(1)
>>> max_duration.iat[0]
Timedelta('0 days 00:30:00')
>>> last.loc[max_duration.index].value.iat[0]
2
You could use pd.pivot_table to get the minimum and maximum datetime value for each value, then calculate the duration between them and extract the longest.
import pandas as pd
import numpy as np
d = {'date_time': ['2016-01-01 12:00:00', '2016-01-01 12:15:00',
'2016-01-01 12:30:00', '2016-01-01 12:45:00',
'2016-01-01 13:00:00', '2016-01-01 13:15:00',
'2016-01-01 13:30:00', '2016-01-01 13:45:00'],
'value': [1,2,2,2,4,5,5,7]}
df = pd.DataFrame(data=d)
df['date_time'] = pd.to_datetime(df['date_time'])
df_pivot = pd.pivot_table(df, index='value', values='date_time', aggfunc=[np.min,np.max])
df_pivot['duration'] = df_pivot.iloc[:, 1] - df_pivot.iloc[:, 0]
print(df_pivot[df_pivot['duration'] == max(df_pivot['duration'])])

Python Pandas: detecting frequency of time series

Assume I have loaded time series data from SQL or CSV (not created in Python), the index would be:
DatetimeIndex(['2015-03-02 00:00:00', '2015-03-02 01:00:00',
'2015-03-02 02:00:00', '2015-03-02 03:00:00',
'2015-03-02 04:00:00', '2015-03-02 05:00:00',
'2015-03-02 06:00:00', '2015-03-02 07:00:00',
'2015-03-02 08:00:00', '2015-03-02 09:00:00',
...
'2015-07-19 14:00:00', '2015-07-19 15:00:00',
'2015-07-19 16:00:00', '2015-07-19 17:00:00',
'2015-07-19 18:00:00', '2015-07-19 19:00:00',
'2015-07-19 20:00:00', '2015-07-19 21:00:00',
'2015-07-19 22:00:00', '2015-07-19 23:00:00'],
dtype='datetime64[ns]', name=u'hour', length=3360, freq=None, tz=None)
As you can see, the freq is None. I am wondering how can I detect the frequency of this series and set the freq as its frequency. If possible, I would like this to work in the case of data which isn't continuous (there are plenty of breaks in the series).
I was trying to find the mode of all the differences between two timestamps, but I am not sure how to transfer it into a format that is readable by Series
It is worth mentioning that if data is continuous, you can use pandas.DateTimeIndex.inferred_freq property:
dt_ix = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_ix._set_freq(None)
dt_ix.inferred_freq
Out[2]: 'H'
or pandas.infer_freq method:
pd.infer_freq(dt_ix)
Out[3]: 'H'
If not continuous pandas.infer_freq will return None. Similarly to what has been proposed yet, another alternative is using pandas.Series.diff method:
split_ix = dt_ix.drop(pd.date_range('2015-05-01 00:00:00','2015-05-30 00:00:00', freq='1H'))
split_ix.to_series().diff().min()
Out[4]: Timedelta('0 days 01:00:00')
Maybe try taking difference of the timeindex and use the mode (or smallest difference) as the freq.
import pandas as pd
import numpy as np
# simulate some data
# ===================================
np.random.seed(0)
dt_rng = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_idx = pd.DatetimeIndex(np.random.choice(dt_rng, size=2000, replace=False))
df = pd.DataFrame(np.random.randn(2000), index=dt_idx, columns=['col']).sort_index()
df
col
2015-03-02 01:00:00 2.0261
2015-03-02 04:00:00 1.3325
2015-03-02 05:00:00 -0.9867
2015-03-02 06:00:00 -0.0671
2015-03-02 08:00:00 -1.1131
2015-03-02 09:00:00 0.0494
2015-03-02 10:00:00 -0.8130
2015-03-02 11:00:00 1.8453
... ...
2015-07-19 13:00:00 -0.4228
2015-07-19 14:00:00 1.1962
2015-07-19 15:00:00 1.1430
2015-07-19 16:00:00 -1.0080
2015-07-19 18:00:00 0.4009
2015-07-19 19:00:00 -1.8434
2015-07-19 20:00:00 0.5049
2015-07-19 23:00:00 -0.5349
[2000 rows x 1 columns]
# processing
# ==================================
# the gap distribution
res = (pd.Series(df.index[1:]) - pd.Series(df.index[:-1])).value_counts()
01:00:00 1181
02:00:00 499
03:00:00 180
04:00:00 93
05:00:00 24
06:00:00 10
07:00:00 9
08:00:00 3
dtype: int64
# the mode can be considered as frequency
res.index[0] # output: Timedelta('0 days 01:00:00')
# or maybe the smallest difference
res.index.min() # output: Timedelta('0 days 01:00:00')
# get full datetime rng
full_rng = pd.date_range(df.index[0], df.index[-1], freq=res.index[0])
full_rng
DatetimeIndex(['2015-03-02 01:00:00', '2015-03-02 02:00:00',
'2015-03-02 03:00:00', '2015-03-02 04:00:00',
'2015-03-02 05:00:00', '2015-03-02 06:00:00',
'2015-03-02 07:00:00', '2015-03-02 08:00:00',
'2015-03-02 09:00:00', '2015-03-02 10:00:00',
...
'2015-07-19 14:00:00', '2015-07-19 15:00:00',
'2015-07-19 16:00:00', '2015-07-19 17:00:00',
'2015-07-19 18:00:00', '2015-07-19 19:00:00',
'2015-07-19 20:00:00', '2015-07-19 21:00:00',
'2015-07-19 22:00:00', '2015-07-19 23:00:00'],
dtype='datetime64[ns]', length=3359, freq='H', tz=None)
The minimum time difference is found with
np.diff(data.index.values).min()
which is normally in units of ns. To get a frequency, assuming ns:
freq = 1e9 / np.diff(df.index.values).min().astype(int)

numpy.datetime64: how to get weekday of numpy datetime64 and check if it's between time1 and time2

how to check if a numpy datetime is between time1 and time2(without date).
Say I have a series of datetime, i want to check its weekday, and whether it's between 13:00 and 13:30. For example
2014-03-05 22:55:00
is Wed and it's not between 13:00 and 13:30
Using pandas, you could use the DatetimeIndex.indexer_between_time method to find those dates whose time is between 13:00 and 13:30.
For example,
import pandas as pd
dates = pd.date_range('2014-3-1 00:00:00', '2014-3-8 0:00:00', freq='50T')
dates_between = dates[dates.indexer_between_time('13:00','13:30')]
wednesdays_between = dates_between[dates_between.weekday == 2]
These are the first 5 items in dates:
In [95]: dates.tolist()[:5]
Out[95]:
[Timestamp('2014-03-01 00:00:00', tz=None),
Timestamp('2014-03-01 00:50:00', tz=None),
Timestamp('2014-03-01 01:40:00', tz=None),
Timestamp('2014-03-01 02:30:00', tz=None),
Timestamp('2014-03-01 03:20:00', tz=None)]
Notice that these dates are all between 13:00 and 13:30:
In [96]: dates_between.tolist()[:5]
Out[96]:
[Timestamp('2014-03-01 13:20:00', tz=None),
Timestamp('2014-03-02 13:30:00', tz=None),
Timestamp('2014-03-04 13:00:00', tz=None),
Timestamp('2014-03-05 13:10:00', tz=None),
Timestamp('2014-03-06 13:20:00', tz=None)]
And of those dates, here is the only one that is a Wednesday:
In [99]: wednesdays_between.tolist()
Out[99]: [Timestamp('2014-03-05 13:10:00', tz=None)]

Categories