Pandas: get observations by timestamp - python

I got a list of dynamic values (e.g. observations). It records all value changes of an entity (e.g. display).
df
+----+---------------------+-----------------+---------+
| | time | display_index | value |
|----+---------------------+-----------------+---------|
| 0 | 2017-11-06 13:00:00 | 1 | val1 |
| 1 | 2017-11-06 14:00:00 | 1 | val2 |
| 2 | 2017-11-06 15:00:00 | 1 | val1 |
| 3 | 2017-11-06 13:30:00 | 2 | val3 |
| 4 | 2017-11-06 14:05:00 | 2 | val4 |
| 5 | 2017-11-06 15:30:00 | 2 | val1 |
+----+---------------------+-----------------+---------+
Now I got a second list of timestamps and I'm interested in the values that each display has shown at that time. Note that the first timestamp (13:00) for display_index 2 is before any value is even known for that one (first record is 13:30).
df_times
+----+---------------------+-----------------+
| | time | display_index |
|----+---------------------+-----------------|
| 0 | 2017-11-06 13:20:00 | 1 |
| 1 | 2017-11-06 13:40:00 | 1 |
| 2 | 2017-11-06 13:00:00 | 2 |
| 3 | 2017-11-06 14:00:00 | 2 |
+----+---------------------+-----------------+
I tried calculating the period between both timestamps and chose the observation with the minimum value for that period:
df_merged = df_times.merge(df, on='display_index', how='outer', suffixes=['','_measured'])
df_merged['seconds'] = (df_merged.time_measured - df_merged.time).astype('timedelta64[s]')
df_merged['seconds'] = df_merged['seconds'].apply(math.fabs)
df_merged = df_merged.sort_values('seconds').groupby(['time', 'display_index'], as_index=False).first()
print(tabulate(df_merged, headers='keys', tablefmt='psql'))
+----+---------------------+-----------------+---------------------+---------+-----------+
| | time | display_index | time_measured | value | seconds |
|----+---------------------+-----------------+---------------------+---------+-----------|
| 0 | 2017-11-06 13:00:00 | 2 | 2017-11-06 13:30:00 | val3 | 1800 |
| 1 | 2017-11-06 13:20:00 | 1 | 2017-11-06 13:00:00 | val1 | 1200 |
| 2 | 2017-11-06 13:40:00 | 1 | 2017-11-06 14:00:00 | val2 | 1200 |
| 3 | 2017-11-06 14:00:00 | 2 | 2017-11-06 14:05:00 | val4 | 300 |
+----+---------------------+-----------------+---------------------+---------+-----------+
The problem is that the last values for display 1 and 2 are wrong since they are still showing another value at that time. It should be val1 for display 1 and val3 for display 2. What I'm actually looking for is the observation that was last seen before the timestamp. So how to do this?
Here's the code that I used:
import pandas as pd
from tabulate import tabulate
import math
values = [("2017-11-06 13:00", 1, 'val1'),
("2017-11-06 14:00", 1, 'val2'),
("2017-11-06 15:00", 1, 'val1'),
("2017-11-06 13:30", 2, 'val3'),
("2017-11-06 14:05", 2, 'val4'),
("2017-11-06 15:30", 2, 'val1'),
]
labels = ['time', 'display_index', 'value']
df = pd.DataFrame.from_records(values, columns=labels)
df['time'] = pd.to_datetime(df['time'])
print(tabulate(df, headers='keys', tablefmt='psql'))
values = [("2017-11-06 13:20", 1),
("2017-11-06 13:40", 1),
("2017-11-06 13:00", 2),
("2017-11-06 14:00", 2),
]
labels = ['time', 'display_index']
df_times = pd.DataFrame.from_records(values, columns=labels)
df_times['time'] = pd.to_datetime(df_times['time'])
print(tabulate(df_times, headers='keys', tablefmt='psql'))
df_merged = df_times.merge(df, on='display_index', how='outer', suffixes=['','_measured'])
df_merged['seconds'] = (df_merged.time_measured - df_merged.time).astype('timedelta64[s]')
df_merged['seconds'] = df_merged['seconds'].apply(math.fabs)
df_merged = df_merged.sort_values('seconds').groupby(['time', 'display_index'], as_index=False).first()
print(tabulate(df_merged, headers='keys', tablefmt='psql'))

This is a perfect use case for pd.merge_asof
Note: I think you got the second row wrong.
# dataframes need to be sorted
df_times = df_times.sort_values(['time', 'display_index'])
df = df.sort_values(['time', 'display_index'])
pd.merge_asof(
df_times, df.assign(time_measured=df.time),
on='time', by='display_index', direction='forward'
).assign(seconds=lambda d: d.time_measured.sub(d.time).dt.total_seconds())
time display_index value time_measured seconds
0 2017-11-06 13:00:00 2 val3 2017-11-06 13:30:00 1800.0
1 2017-11-06 13:20:00 1 val2 2017-11-06 14:00:00 2400.0
2 2017-11-06 13:40:00 1 val2 2017-11-06 14:00:00 1200.0
3 2017-11-06 14:00:00 2 val4 2017-11-06 14:05:00 300.0
Explanation
pd.merge_asof for every row in the left argument, it attempts to locate a matching row in the right argument.
Since we passed direction='forward' it will look forward from the row in the left argument and find the next value.
I needed a way to capture the time_measured column. Since merge_asof snags the time column, I assigned it as a different column that I can use as intended. The use of df.assign(time_measured=df.time) just dups the column for use later.
I use assign again. This time to assign a new column seconds. When using assign, you can pass an array of equal length as the dataframe. You can pass a series in which the values will align based on the index. Or you can pass a callable that will get passed the dataframe that is calling assign. This is what I did. The lambda takes the calling dataframe and finds the difference in those two date columns and converts the resulting series of timedeltas to seconds.

Related

How to reindex a datetime-based multiindex in pandas

I have a dataframe that counts the number of times an event has occured per user per day. Users may have 0 events per day and (since the table is an aggregate from a raw event log) rows with 0 events are missing from the dataframe. I would like to add these missing rows and group the data by week so that each user has one entry per week (including 0 if applicable).
Here is an example of my input:
import numpy as np
import pandas as pd
np.random.seed(42)
df = pd.DataFrame({
"person_id": np.arange(3).repeat(5),
"date": pd.date_range("2022-01-01", "2022-01-15", freq="d"),
"event_count": np.random.randint(1, 7, 15),
})
# end of each week
# Note: week 2022-01-23 is not in df, but should be part of the result
desired_index = pd.to_datetime(["2022-01-02", "2022-01-09", "2022-01-16", "2022-01-23"])
df
| | person_id | date | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-01 00:00:00 | 4 |
| 1 | 0 | 2022-01-02 00:00:00 | 5 |
| 2 | 0 | 2022-01-03 00:00:00 | 3 |
| 3 | 0 | 2022-01-04 00:00:00 | 5 |
| 4 | 0 | 2022-01-05 00:00:00 | 5 |
| 5 | 1 | 2022-01-06 00:00:00 | 2 |
| 6 | 1 | 2022-01-07 00:00:00 | 3 |
| 7 | 1 | 2022-01-08 00:00:00 | 3 |
| 8 | 1 | 2022-01-09 00:00:00 | 3 |
| 9 | 1 | 2022-01-10 00:00:00 | 5 |
| 10 | 2 | 2022-01-11 00:00:00 | 4 |
| 11 | 2 | 2022-01-12 00:00:00 | 3 |
| 12 | 2 | 2022-01-13 00:00:00 | 6 |
| 13 | 2 | 2022-01-14 00:00:00 | 5 |
| 14 | 2 | 2022-01-15 00:00:00 | 2 |
This is how my desired result looks like:
| | person_id | level_1 | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-02 00:00:00 | 9 |
| 1 | 0 | 2022-01-09 00:00:00 | 13 |
| 2 | 0 | 2022-01-16 00:00:00 | 0 |
| 3 | 0 | 2022-01-23 00:00:00 | 0 |
| 4 | 1 | 2022-01-02 00:00:00 | 0 |
| 5 | 1 | 2022-01-09 00:00:00 | 11 |
| 6 | 1 | 2022-01-16 00:00:00 | 5 |
| 7 | 1 | 2022-01-23 00:00:00 | 0 |
| 8 | 2 | 2022-01-02 00:00:00 | 0 |
| 9 | 2 | 2022-01-09 00:00:00 | 0 |
| 10 | 2 | 2022-01-16 00:00:00 | 20 |
| 11 | 2 | 2022-01-23 00:00:00 | 0 |
I can produce it using:
(
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.groupby("person_id").apply(
lambda df: (
df
.reset_index(drop=True, level=0)
.reindex(desired_index, fill_value=0))
)
.reset_index()
)
However, according to the docs of reindex, I should be able to use it with level=1 as a kwarg directly and without having to do another groupby. However, when I do this I get an "inner join" of the two indices instead of an "outer join":
result = (
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.reindex(desired_index, level=1)
.reset_index()
)
| | person_id | date | event_count |
|---:|------------:|:--------------------|--------------:|
| 0 | 0 | 2022-01-02 00:00:00 | 9 |
| 1 | 0 | 2022-01-09 00:00:00 | 13 |
| 2 | 1 | 2022-01-09 00:00:00 | 11 |
| 3 | 1 | 2022-01-16 00:00:00 | 5 |
| 4 | 2 | 2022-01-16 00:00:00 | 20 |
Why is that, and how am I supposed to use df.reindex correctly?
I have found a similar SO question on reindexing a multi-index level, but the accepted answer there uses df.unstack, which doesn't work for me, because not every level of my desired index occurs in my current index (and vice versa).
You need reindex by both levels of MultiIndex:
mux = pd.MultiIndex.from_product([df['person_id'].unique(), desired_index],
names=['person_id','date'])
result = (
df
.groupby(["person_id", pd.Grouper(key="date", freq="w")]).sum()
.reindex(mux, fill_value=0)
.reset_index()
)
print (result)
person_id date event_count
0 0 2022-01-02 9
1 0 2022-01-09 13
2 0 2022-01-16 0
3 0 2022-01-23 0
4 1 2022-01-02 0
5 1 2022-01-09 11
6 1 2022-01-16 5
7 1 2022-01-23 0
8 2 2022-01-02 0
9 2 2022-01-09 0
10 2 2022-01-16 20
11 2 2022-01-23 0

Create a counter of date values for a given max-min interval

Be the following python pandas DataFrame:
| date | column_1 | column_2 |
| ---------- | -------- | -------- |
| 2022-02-01 | val | val2 |
| 2022-02-03 | val1 | val |
| 2022-02-01 | val | val3 |
| 2022-02-04 | val2 | val |
| 2022-02-27 | val2 | val4 |
I want to create a new DataFrame, where each row has a value between the minimum and maximum date value from the original DataFrame. The counter column contains a row counter for that date.
| date | counter |
| ---------- | -------- |
| 2022-02-01 | 2 |
| 2022-02-02 | 0 |
| 2022-02-03 | 1 |
| 2022-02-04 | 1 |
| 2022-02-05 | 0 |
...
| 2022-02-26 | 0 |
| 2022-02-27 | 1 |
Count dates first & remove duplicates using Drop duplicates. Fill intermidiate dates with Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
df['counts'] = df['date'].map(df['date'].value_counts())
df = df.drop_duplicates(subset='date', keep="first")
df.date = pd.to_datetime(df.date)
df = df.set_index('date').asfreq('D').reset_index()
df = df.fillna(0)
print(df)
Gives #
date counts
0 2022-02-01 2.0
1 2022-02-02 0.0
2 2022-02-03 1.0
3 2022-02-04 1.0
4 2022-02-05 0.0
5 2022-02-06 0.0
6 2022-02-07 0.0
7 2022-02-08 0.0
8 2022-02-09 0.0
9 2022-02-10 0.0
10 2022-02-11 0.0
11 2022-02-12 0.0
12 2022-02-13 0.0
13 2022-02-14 0.0
14 2022-02-15 0.0
15 2022-02-16 0.0
16 2022-02-17 0.0
17 2022-02-18 0.0
18 2022-02-19 0.0
19 2022-02-20 0.0
20 2022-02-21 0.0
21 2022-02-22 0.0
22 2022-02-23 0.0
23 2022-02-24 0.0
24 2022-02-25 0.0
25 2022-02-26 0.0
Many ways to do this. Here is mine. Probably not optimal, but at least I am not iterating rows, nor using .apply, which are both sure recipes to create slow solutions
import pandas as pd
import datetime
# A minimal example (you should provide such an example next time)
df=pd.DataFrame({'date':pd.to_datetime(['2022-02-01', '2022-02-03', '2022-02-01', '2022-02-04', '2022-02-27']), 'c1':['val','val1','val','val2','val2'], 'c2':range(5)})
# A delta of 1 day, to create list of date
dt=datetime.timedelta(days=1)
# Result dataframe, with a count of 0 for now
res=pd.DataFrame({'date':df.date.min()+dt*np.arange((df.date.max()-df.date.min()).days+1), 'count':0})
# Cound dates
countDates=df[['date', 'c1']].groupby('date').agg('count')
# Merge the counted dates with the target array, filling missing values with 0
res['count']=res.merge(countDates, on='date', how='left').fillna(0)['c1']

How can I set last rows of a dataframe based on condition in Python?

I have 1 dataframes, df1, with 2 different columns. The first column 'col1' is a datetime column, and the second one is a int column with only 2 possible values (0 or 1). Here is an example of the dataframe:
+----------------------+----------+
| col1 | col2 |
+----------------------+----------+
| 2020-01-01 10:00:00 | 0 |
+----------------------+----------+
| 2020-01-01 11:00:00 | 1 |
+----------------------+----------+
| 2020-01-01 12:00:00 | 1 |
+----------------------+----------+
| 2020-01-02 11:00:00 | 0 |
+----------------------+----------+
| 2020-01-02 12:00:00 | 1 |
+----------------------+----------+
| ... | ... |
+----------------------+----------+
As you can see, the datetimes are sorted in an ascending order. What I would like is: for each diferent date (in this example are 2 diferent dates, 2020-01-01 and 2020-01-02 with diferent times) I would like to mantain the first 1 value and put as 0 the previous and the next ones in that date. So, the resulting dataframe would be:
+----------------------+----------+
| col1 | col2 |
+----------------------+----------+
| 2020-01-01 10:00:00 | 0 |
+----------------------+----------+
| 2020-01-01 11:00:00 | 1 |
+----------------------+----------+
| 2020-01-01 12:00:00 | 0 |
+----------------------+----------+
| 2020-01-02 11:00:00 | 0 |
+----------------------+----------+
| 2020-01-02 12:00:00 | 1 |
+----------------------+----------+
| ... | ... |
+----------------------+----------+
How can I do it in Python?
Use:
df['col1'] = pd.to_datetime(df.col1)
mask = df.groupby(df.col1.dt.date)['col2'].cumsum().eq(1)
df.col2.where(mask, 0, inplace = True)
Output:
>>> df
col1 col2
0 2020-01-01 10:00:00 0
1 2020-01-01 12:00:00 1
2 2020-01-01 12:00:00 0
3 2020-01-02 11:00:00 0
4 2020-01-02 12:00:00 1

Iterate over each day, calculate average price of first and last 3 hours and take difference of those averages in Python

I have a data frame that looks like this,
+---------------------+------------+----------+-------+
| Timestamp | Date | Time | Price |
+---------------------+------------+----------+-------+
| 2017-01-01 00:00:00 | 2017-01-01 | 00:00:00 | 20 |
| 2017-01-01 00:01:00 | 2017-01-01 | 00:01:00 | 25 |
| 2017-01-01 00:02:00 | 2017-01-01 | 00:02:00 | 15 |
| 2017-01-01 00:03:00 | 2017-01-01 | 00:03:00 | 20 |
| ... | | | |
| 2017-01-01 00:20:00 | 2017-01-01 | 00:20:00 | 25 |
| 2017-01-01 00:21:00 | 2017-01-01 | 00:21:00 | 15 |
| 2017-01-01 00:22:00 | 2017-01-01 | 00:22:00 | 10 |
| 2017-01-01 00:23:00 | 2017-01-01 | 00:23:00 | 25 |
| 2017-01-01 00:00:00 | 2017-02-01 | 00:00:00 | 10 |
| 2017-02-01 00:01:00 | 2017-02-01 | 00:01:00 | 25 |
| 2017-02-01 00:02:00 | 2017-02-01 | 00:02:00 | 10 |
| 2017-02-01 00:03:00 | 2017-02-01 | 00:03:00 | 25 |
| ... | | | |
| 2017-02-01 00:20:00 | 2017-02-01 | 00:20:00 | 15 |
| 2017-02-01 00:21:00 | 2017-02-01 | 00:21:00 | 10 |
| 2017-02-01 00:22:00 | 2017-02-01 | 00:22:00 | 25 |
| 2017-02-01 00:24:00 | 2017-02-01 | 00:23:00 | 10 |
+---------------------+------------+----------+-------+
Timestamp datetime64[ns]
Date datetime64[ns]
Time object
Price float64
and I'm trying to calculate difference between the average price of the first 3 hours and the last 3 hours of a day.
Design in my mind is to do something like this;
For every unique date in Date
a = avg(price.first(3))
b = avg(price.last(3))
dif = a - b
append to another dataset
---------EDIT----------
and the expected result is;
+------------+---------+
| Date | Diff |
+------------+---------+
| 2017-01-01 | 3.33334 |
| 2017-01-02 | 0 |
+------------+---------+
My real query will be in seconds rather then hours.(I didnt wanted to put 120 rows in here show 2 minutes of the data).So hours are representations of seconds.
And there can be some missing rows in the dataset so if I just do price.first(3600) it can overshoot for some days right? If I can solve this using df.Timestamp.datetime.hour that will be more precise I think.
I really can't get my head around figuring how to get first and last 3 Price for everyday kind of approach. Any help will be much much appreciated!! Thank you so so much in advance!
As you showed, the hours are ordered, so you can groupby day, and the get the list of the prices of the 24 hours of the day, then, you can apply a function to do the difference. You could try something like this:
import pandas as pd
from statistics import mean
def getavg(ls):
mean3first=mean(ls[:3])
mean3last=mean(ls[len(ls)-3:])
return mean3first-mean3last
diff_means= df.groupby(['Date']).agg(list)['Price'].apply(getavg).reset_index()
diff_means.columns=['Date','Diff']
print(diff_means)
I'm not entirely sure what format you want the result in, but I found a solution that I find pretty elegant:
unique_dates = df.Date.unique()
new_df = pd.DataFrame()
for u_date in unique_dates:
first_3 = np.mean(df[df.Date == u_date].reset_index().head(3).Price)
last_3 = np.mean(df[df.Date == u_date].reset_index().tail(3).Price)
new_df = new_df.append(
pd.DataFrame([[u_date, last_3 - first_3]], columns = ['Date', 'PriceDiff']))

Pandas resampling with multi-index

I have a DataFrame containing information about stores. It looks like the following:
date | store_id | x
2019-01-01| 1 | 5
2019-01-01| 2 | 1
2019-01-05| 1 | 3
...
The multi-index is [date,store_id]. Note that the dates are not unique.
I want to resample the data at an hourly level, but only for the days in the date column i.e. I don't want to fill in every hour in between. Furthermore, I want to fill in the value of x for every hour that is created. So the desired result for the above example would be
date | store_id | x
2019-01-01 00:00:00| 1 | 5
2019-01-01 01:00:00| 1 | 5
2019-01-01 02:00:00| 1 | 5
...
2019-01-01 23:00:00| 1 | 5
2019-01-01 00:00:00| 2 | 1
2019-01-01 01:00:00| 2 | 1
2019-01-01 02:00:00| 2 | 1
...
2019-01-01 23:00:00| 2 | 1
2019-01-05 00:00:00| 1 | 3
2019-01-05 01:00:00| 1 | 3
2019-01-05 02:00:00| 1 | 3
...
2019-01-05 23:00:00| 1 | 3
Define the following "replication" function:
def repl(row):
return pd.DataFrame({'date': pd.date_range(start=row.date,
periods=24, freq='H'),'store_id': row.store_id, 'x': row.x})
It "replicates" the source row (parameter), returning a sequence of rows
with the given date, for consecutive hours.
Then:
reset the index, to have all columns as "normal" columns,
apply this function (to each row),
convert the resulting Series of DataFrames into a list (of DataFrames),
concatenate the result.
The code to do it is:
pd.concat(df.reset_index().apply(repl, axis=1).tolist(), ignore_index=True)

Categories