Define start and end date of several DataFrames with pandas - python

I have many DataFrames which have a different period lengths. I am trying to create a for loop to define for all those DataFrames a specific start and end day.
Here is a simple example:
df1:
Dates ID1 ID2
0 2021-01-01 0 1
1 2021-01-02 0 0
2 2021-01-03 1 0
3 2021-01-04 2 2
4 2021-01-05 1 4
5 2021-01-06 -1 -2
df2:
Dates ID1 ID2
0 2021-01-01 0 1
1 2021-01-02 1 2
2 2021-01-03 -1 3
3 2021-01-04 1 -1
4 2021-01-05 4 2
I want to define a specific start and end day as:
start = pd.to_datetime('2021-01-02')
end = pd.to_datetime('2021-01-04')
So far, I only figured out how to define the period for one DataFrame:
df1.loc[(df1['Dates'] >= start) & (df1['Dates'] <= end)]
Is there an easy method to loop over all DataFrames at the same time to define the start and end dates?
For reproducibility:
import pandas as pd
df1 = pd.DataFrame({
'Dates':['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-06'],
'ID1':[0,0,1,2,1,-1],
'ID2':[1,0,0,2,4,-2]})
df1['Dates'] = pd.to_datetime(df1['Dates'])
df2 = pd.DataFrame({
'Dates':['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05'],
'ID1':[0,1,-1,1,4],
'ID2':[1,2,3,-1,2]})
df2['Dates'] = pd.to_datetime(df2['Dates'])

You can store your dataframes in a list, and then apply your loc formula on all the dataframes in the list using list comprehension, and return back a new list of the resulting filtered dataframes:
# Create a list with your dataframes
dfs = [df1 , df2]
# Thresholds
start = pd.to_datetime('2021-01-02')
end = pd.to_datetime('2021-01-04')
# Filter all of them and store back
filtered_dfs = [df.loc[(df['Dates'] >= start) & (df['Dates'] <= end)] for df in dfs]
Result:
>>> print(filtered_dfs)
[ Dates ID1 ID2
1 2021-01-02 0 0
2 2021-01-03 1 0
3 2021-01-04 2 2,
Dates ID1 ID2
1 2021-01-02 1 2
2 2021-01-03 -1 3
3 2021-01-04 1 -1]
>>> print(dfs)
[ Dates ID1 ID2
0 2021-01-01 0 1
1 2021-01-02 0 0
2 2021-01-03 1 0
3 2021-01-04 2 2
4 2021-01-05 1 4
5 2021-01-06 -1 -2,
Dates ID1 ID2
0 2021-01-01 0 1
1 2021-01-02 1 2
2 2021-01-03 -1 3
3 2021-01-04 1 -1
4 2021-01-05 4 2]

Related

consecutive rows of unsorted dates based on one day before after or on the same day into one [duplicate]

I would like to combine rows of same id with consecutive dates and same features values.
I have the following dataframe:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-15 1 1
1 A 2020-01-16 2020-01-30 1 1
2 A 2020-01-31 2020-02-15 0 1
3 A 2020-07-01 2020-07-15 0 1
4 B 2020-01-31 2020-02-15 0 0
5 B 2020-02-16 NaT 0 0
An the expected result is:
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0
I have been trying other posts answers but they don't really match with my use case.
Thanks in advance!
You can approach by:
Get the day diff of each consecutive entries within same group by substracting current Start with last End with the group using GroupBy.shift().
Set group number group_no such that new group number is issued when day diff with previous entry within the group is greater than 1.
Then, group by Id and group_no and aggregate for each group the Start and End dates using .gropuby() and .agg()
As there is NaT data within the grouping, we need to specify dropna=False during grouping. Furthermore, to get the last entry of End within the group, we use x.iloc[-1] instead of last.
# convert to datetime format if not already in datetime
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
# sort by columns `Id` and `Start` if not already in this sequence
df = df.sort_values(['Id', 'Start'])
day_diff = (df['Start'] - df['End'].groupby([df['Id'], df['Feature1'], df['Feature2']]).shift()).dt.days
group_no = (day_diff.isna() | day_diff.gt(1)).cumsum()
df_out = (df.groupby(['Id', group_no], dropna=False, as_index=False)
.agg({'Id': 'first',
'Start': 'first',
'End': lambda x: x.iloc[-1],
'Feature1': 'first',
'Feature2': 'first',
}))
Result:
print(df_out)
Id Start End Feature1 Feature2
0 A 2020-01-01 2020-01-30 1 1
1 A 2020-01-31 2020-02-15 0 1
2 A 2020-07-01 2020-07-15 0 1
3 B 2020-01-31 NaT 0 0
Extract months from both date column
df['sMonth'] = df['Start'].apply(pd.to_datetime).dt.month
df['eMonth'] = df['End'].apply(pd.to_datetime).dt.month
Now groupby data frame with ['Id','Feature1','Feature2','sMonth','eMonth'] and we get result
df.groupby(['Id','Feature1','Feature2','sMonth','eMonth']).agg({'Start':'min','End':'max'}).reset_index().drop(['sMonth','eMonth'],axis=1)
Result
Id Feature1 Feature2 Start End
0 A 0 1 2020-01-31 2020-02-15
1 A 0 1 2020-07-01 2020-07-15
2 A 1 1 2020-01-01 2020-01-30
3 B 0 0 2020-01-31 2020-02-15

Calculate time blocked within a timerange with pandas

I have a list of products produced or processes finished like this one:
Name
Timestamp Start
Timestamp Stop
Product 1
2021-01-01 15:15:00
2021-01-01 15:37:00
Product 1
2021-01-01 15:30:00
2021-01-01 15:55:00
Product 1
2021-01-02 15:05:00
2021-01-02 15:22:00
Product 1
2021-01-03 15:45:00
2021-01-03 15:55:00
...
...
...
What I want to do is to calculate the amount of time where no product/process happened in a given timeframe, for example from 15:00 to 16:00 and, to be more specific, each day.
The output could be "amount of idle minutes/time where nothing happened" or "percentage of idle time".
import pandas as pd
import datetime
df = pd.read_csv('example_data.csv')
# generate list of products
listOfProducts = df['NAME'].drop_duplicates().tolist()
# define timeframe for each day
startTime = datetime.time(15, 0)
stopTime = datetime.time(16, 0)
# define daterange to look for
startDay = datetime(2021, 1, 1)
stopDay = datetime(2021,1, 5)
# do it for every product
for i in listOfProducts:
# filter dataframe by product
df_product = df[df['NAME'] == i]
# sort dataframe by start
df_product = df_product.sort_values(by='started')
# ... how to proceed?
The wanted output should look like this or similiar:
Day
Time idle
2021-01-01
00:20:00
2021-02-01
00:43:00
2021-03-01
00:50:00
...
...
Here are some notes that are important:
Timeranges of products can overlap between each other, in this case they should only "count once"
Timeranges of products can overlap the borders (15:00 or 16:00 in this case), in this case the time within the borders should be counted
I struggle to implement it in a pandas-way, because this border-cases prevent me from adding up Timedeltas.
In the past, I solved this issue by iterating row by row from here and adding the minutes or seconds. But I'm sure there is a more pandas-way, maybe with the .groupby() function?
Input data:
>>> df
Name Start Stop
0 Product 1 2021-01-01 14:49:00 2021-01-01 15:04:00 # OK (overlap 4')
1 Product 1 2021-01-01 15:15:00 2021-01-01 15:37:00 # OK
2 Product 1 2021-01-01 15:30:00 2021-01-01 15:55:00 # OK
3 Product 1 2021-01-02 15:05:00 2021-01-02 15:22:00 # OK
4 Product 1 2021-01-03 15:45:00 2021-01-03 15:55:00 # OK
5 Product 1 2021-01-03 15:51:00 2021-01-03 16:23:00 # OK (overlap 9')
6 Product 1 2021-01-04 14:28:00 2021-01-04 17:12:00 # OK (overlap 60')
7 Product 1 2021-01-05 11:46:00 2021-01-05 13:40:00 # Out of bounds
8 Product 1 2021-01-05 17:20:00 2021-01-05 19:11:00 # Out of bounds
First, remove data out of bounds (7 & 8):
import datetime
START = datetime.time(15)
STOP = datetime.time(16)
df1 = df.loc[(df["Start"].dt.floor(freq="H").dt.time <= START)
& (START <= df["Stop"].dt.floor(freq="H").dt.time),
["Start", "Stop"]]
Extract the minute of Start and Stop datetime. If the process began before 15:00, set to 0 because we want only keep overlap part. If the process ended after 16:00, set the minute to 59.
import numpy as np
df1["m1"] = np.where(df1["Start"].dt.time > START,
df1["Start"].sub(df1["Start"].dt.floor(freq="H"))
.dt.seconds // 60, 0)
df1["m2"] = np.where(df1["Stop"].dt.time < STOP,
df1["Stop"].sub(df1["Stop"].dt.floor(freq="H"))
.dt.seconds // 60, 59)
>>> df1
Start Stop m1 m2
0 2021-01-01 14:49:00 2021-01-01 15:04:00 0 4
1 2021-01-01 15:15:00 2021-01-01 15:37:00 15 37
2 2021-01-01 15:30:00 2021-01-01 15:55:00 30 55
3 2021-01-02 15:05:00 2021-01-02 15:22:00 5 22
4 2021-01-03 15:45:00 2021-01-03 15:55:00 45 55
5 2021-01-03 15:51:00 2021-01-03 16:23:00 51 59
6 2021-01-04 14:28:00 2021-01-04 17:12:00 0 59
Create an empty table len(df1)x60' to store process usage:
out = pd.DataFrame(0, index=df1.index, columns=pd.RangeIndex(60))
Fill the out dataframe:
for idx, (i1, i2) in df1[["m1", "m2"]].iterrows():
out.loc[idx, i1:i2] = 1
>>> out
0 1 2 3 4 5 6 ... 53 54 55 56 57 58 59
0 1 1 1 1 1 0 0 ... 0 0 0 0 0 0 0 # 4'
1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 ... 1 1 1 0 0 0 0
3 0 0 0 0 0 1 1 ... 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 ... 1 1 1 0 0 0 0
5 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 # full hour
[7 rows x 60 columns]
Finally, compute the idle minutes:
>>> 60 - (out.groupby(df1["Start"].dt.date).sum() & 1).sum(axis="columns")
Start
2021-01-01 22
2021-01-02 42
2021-01-03 50
2021-01-04 0
dtype: int64
Note: you have to determine if the Stop datetime is closed or not.

zero-ing only positive values for a specific date in a multiindex pandas dataframe

Here's the code showing what I want to do:
import pandas as pd
from numpy.random import randint
index = pd.MultiIndex.from_product(
[['a', 'b'], pd.date_range('2021-01-01', periods=3)], names=['area', 'date']
)
df = pd.DataFrame({n:randint(-5, 5, 6) for n in ('foo', 'bar')}, index=index)
def zero_positives_on_date(df, dt):
for area in df.index.levels[0]:
for col in df.columns:
if df.loc[pd.IndexSlice[area, dt], col] > 0:
df.loc[pd.IndexSlice[area, dt], col] = 0
return df
print(df)
print(zero_positives_on_date(df, pd.to_datetime('2021-01-02')))
How can I implement zero_positives_on_date using masks/broadcasts/indexing/etc rather than evil nested for loops?
The output of the above looks like this:
foo bar
area date
a 2021-01-01 0 -5
2021-01-02 1 0
2021-01-03 -1 -1
b 2021-01-01 2 3
2021-01-02 4 1
2021-01-03 -3 3
foo bar
area date
a 2021-01-01 0 -5
2021-01-02 0 0
2021-01-03 -1 -1
b 2021-01-01 2 3
2021-01-02 0 0
2021-01-03 -3 3
try accessing rows by df.index.get_level_values('date')
dt = pd.to_datetime('2021-01-02')
idx = df.index.get_level_values('date')
x = df[idx == dt]
x[x>0] = 0
df[idx == dt] = x
Using clip:
df.loc[idx == dt] = df.loc[idx == dt].clip(upper=0)
foo bar
area date
a 2021-01-01 4 -1
2021-01-02 0 -2
2021-01-03 2 -3
b 2021-01-01 1 -5
2021-01-02 0 -4
2021-01-03 -4 -2
Try this, using where and get_index_level:
dt = pd.to_datetime('2021-01-02')
df[df.index.get_level_values('date') == dt] = df[df.index.get_level_values('date') == dt].where(df<0,0)
You can use get_locs to get the positional based index for your date and overwrite that with .clip to zero-out the positive values.
def zero_positives_on_date(df, date):
df = df.copy()
date_row_idx = df.index.get_locs((slice(None), date))
df.iloc[date_row_idx] = df.iloc[date_row_idx].clip(upper=0)
return df
print(df)
foo bar
area date
a 2021-01-01 0 -5
2021-01-02 1 0
2021-01-03 -1 -1
b 2021-01-01 2 3
2021-01-02 4 1
2021-01-03 -3 3
new_df = zero_positives_on_date(df, "2021-01-02")
print(new_df)
foo bar
area date
a 2021-01-01 0 -5
2021-01-02 0 0
2021-01-03 -1 -1
b 2021-01-01 2 3
2021-01-02 0 0
2021-01-03 -3 3

Pandas return df with repeated data ranges for element of list

I want to create a df with data range values.
I can create the df like this:
def create_df(start='2021-01-01', end='2022-12-31'):
df = pd.DataFrame({"Date": pd.date_range(start,end)})
return df
df = create_df()
Gives the following df
Date
0 2021-01-01
1 2021-01-02
2 2021-01-03
3 2021-01-04
4 2021-01-05
Now I want to create a second column A with elements from a list. There should be one df with repeated date values for each element of the list.
This is what I want
Date A
0 2021-01-01 1
1 2021-01-02 1
2 2021-01-03 1
.............
729 2022-12-31 1
730 2021-01-01 2
How can I create one df with the repeated data range for all element in the list?
Use product for repeat values of lists with date ranges:
from itertools import product
def create_df(start='2021-01-01', end='2022-12-31', L=[1,2,3]):
df = pd.DataFrame(product(L, pd.date_range(start,end)), columns=['A','Date'])
return df[['Date','A']]
df = create_df()
print (df)
Date A
0 2021-01-01 1
1 2021-01-02 1
2 2021-01-03 1
3 2021-01-04 1
4 2021-01-05 1
... ..
2185 2022-12-27 3
2186 2022-12-28 3
2187 2022-12-29 3
2188 2022-12-30 3
2189 2022-12-31 3
[2190 rows x 2 columns]

Rolling Look Forward Sum with Datetime Index in Pandas

I have multivariate time-series/panel data in the following simplified format:
id,date,event_ind
1,2014-01-01,0
1,2014-01-02,1
1,2014-01-03,1
2,2014-01-01,1
2,2014-01-02,1
2,2014-01-03,1
3,2014-01-01,0
3,2014-01-02,0
3,2014-01-03,1
For this simplified example, I would like the future 2 day sum of event_ind grouped by id
For some reason adapting this example still gives me the "index is not monotonic error": how to do forward rolling sum in pandas?
Here is my approach which otherwise worked for past rolling by group before I adapted it:
df.sort_values(['id','date'], ascending=[True,True], inplace=True)
df.reset_index(drop=True, inplace=True)
df['date'] = pd.DatetimeIndex(df['date'])
df.set_index(['date'], drop=True, inplace=True)
rolling_forward_2_day = lambda x: x.iloc[::-1].rolling('2D').sum().shift(1).iloc[::-1]
df['future_2_day_total'] = df.groupby(['id'], sort=False)['event_ind'].transform(rolling_forward_2_day)
df.reset_index(drop=False, inplace=True)
Here is the expected result:
id date event_ind future_2_day_total
0 1 2014-01-01 0 2
1 1 2014-01-02 1 1
2 1 2014-01-03 1 0
3 2 2014-01-01 1 2
4 2 2014-01-02 1 1
5 2 2014-01-03 1 0
6 3 2014-01-01 0 1
7 3 2014-01-02 0 1
8 3 2014-01-03 1 0
Any tips on what I might be doing wrong or high-performance alternatives would be great!
EDIT:
One quick clarification. This example is simplified and valid solutions need to be able to handle unevenly spaced/irregular time series which is why rolling with a time-based index is utilized.
You can still use rolling here, but use it with the flag win_type='boxcar' and shift your data around before and after you sum:
df['future_day_2_total'] = (
df.groupby('id').event_ind.shift(-1)
.fillna(0).groupby(df.id).rolling(2, win_type='boxcar')
.sum().shift(-1).fillna(0)
)
id date event_ind future_day_2_total
0 1 2014-01-01 0 2.0
1 1 2014-01-02 1 1.0
2 1 2014-01-03 1 0.0
3 2 2014-01-01 1 2.0
4 2 2014-01-02 1 1.0
5 2 2014-01-03 1 0.0
6 3 2014-01-01 0 1.0
7 3 2014-01-02 0 1.0
8 3 2014-01-03 1 0.0

Categories