I have two dataframes (simple examples shown below):
df1 df2
time column time column ID column Value
2022-01-01 00:00:00 2022-01-01 00:00:00 1 10
2022-01-01 00:15:00 2022-01-01 00:30:00 1 9
2022-01-01 00:30:00 2022-01-02 00:30:00 1 5
2022-01-01 00:45:00 2022-01-02 00:45:00 1 15
2022-01-02 00:00:00 2022-01-01 00:00:00 2 6
2022-01-02 00:15:00 2022-01-01 00:15:00 2 2
2022-01-02 00:30:00 2022-01-02 00:45:00 2 7
2022-01-02 00:45:00
df1 shows every timestamp I am interested in. df2 shows data sorted by timestamp and ID. What I need to do is add every single timestamp from df1 that is not in df2 for each unique ID and add zero to the value column.
This is the outcome I'm interested in
df3
time column ID column Value
2022-01-01 00:00:00 1 10
2022-01-01 00:15:00 1 0
2022-01-01 00:30:00 1 9
2022-01-01 00:45:00 1 0
2022-01-02 00:00:00 1 0
2022-01-02 00:15:00 1 0
2022-01-02 00:30:00 1 5
2022-01-02 00:45:00 1 15
2022-01-01 00:00:00 2 6
2022-01-01 00:15:00 2 2
2022-01-01 00:30:00 2 0
2022-01-01 00:45:00 2 0
2022-01-02 00:00:00 2 0
2022-01-02 00:15:00 2 0
2022-01-02 00:30:00 2 0
2022-01-02 00:45:00 2 7
My df2 is much larger (hundreds of thousands of rows, and more than 500 unique IDs) so manually doing this isn't feasible. I've search for hours for something that could help, but everything has fallen flat. This data will ultimately be fed into a NN.
I am open to other libraries and can work in python or R.
Any help is greatly appreciated.
Try:
x = (
df2.groupby("ID column")
.apply(lambda x: x.merge(df1, how="outer").fillna(0))
.drop(columns="ID column")
.droplevel(1)
.reset_index()
.sort_values(by=["ID column", "time column"])
)
print(x)
Prints:
ID column time column Value
0 1 2022-01-01 00:00:00 10.0
4 1 2022-01-01 00:15:00 0.0
1 1 2022-01-01 00:30:00 9.0
5 1 2022-01-01 00:45:00 0.0
6 1 2022-01-02 00:00:00 0.0
7 1 2022-01-02 00:15:00 0.0
2 1 2022-01-02 00:30:00 5.0
3 1 2022-01-02 00:45:00 15.0
8 2 2022-01-01 00:00:00 6.0
9 2 2022-01-01 00:15:00 2.0
11 2 2022-01-01 00:30:00 0.0
12 2 2022-01-01 00:45:00 0.0
13 2 2022-01-02 00:00:00 0.0
14 2 2022-01-02 00:15:00 0.0
15 2 2022-01-02 00:30:00 0.0
10 2 2022-01-02 00:45:00 7.0
I have a pandas dataframe with a list of datetimes in them. I want to add 12 hours onto any date time that is not equal to 8am and but is still in the morning. For example:
Datetime
A
2022-01-01 08:00:00
10
2022-01-01 09:00:00
10
2022-01-01 12:00:00
10
2022-01-01 24:00:00
10
Should become:
Datetime
A
2022-01-01 08:00:00
10
2022-01-01 21:00:00
10
2022-01-01 12:00:00
10
2022-01-01 24:00:00
10
I can do this by looping through the dataframe one element at a time and doing this conditional check. However, the dataset I am working with is large. Is it possible to do this without looping though the whole dataset by filtering on this condition. So far I have not managed to find a way!
I just write some code. You can utilize .dt.hour and datetime.timedelta to solve this problem
import datetime
data = """2022-01-01 08:00:00 10
2022-01-01 09:00:00 10
2022-01-01 12:00:00 10
2022-01-01 23:00:00 10"""
data = [f.split("\t") for f in data.split("\n")]
df = pd.DataFrame(data=data, columns=['Datetime', 'A'])
df['Datetime'] = pd.to_datetime(df['Datetime'])
mask = (df['Datetime'].dt.hour != 8) & (df['Datetime'].dt.hour <=12)
df.loc[mask, "Datetime"] += datetime.timedelta(hours=12)
I have a pandas dataframe as follows
df_sample = pd.DataFrame({
'machine': [1, 1, 1, 2],
'ts_start': ["2022-01-01 20:00:00", "2022-01-01 20:30:00", "2022-01-02 20:30:00", "2022-01-01 19:00:00"],
'ts_end': ["2022-01-01 21:00:00", "2022-01-01 21:30:00", "2022-01-02 20:35:00", "2022-01-01 23:00:00"]
})
I want to check which of these [ts_start, ts_end] intervals are overlapped, for the same machine. I have seen some questions about finding overlaps, but couldn't find that grouped by another column, in my case considering the overlaps for each machine separately.
I tried using Piso which seems very interesting.
df_sample['ts_start'] = pd.to_datetime(df_sample['ts_start'])
df_sample['ts_end'] = pd.to_datetime(df_sample['ts_end'])
ii = pd.IntervalIndex.from_arrays(df_sample["ts_start"], df_sample["ts_end"])
df_sample["isOverlap"] = piso.adjacency_matrix(ii).any(axis=1).astype(int).values
I obtain something like this:
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 1
However, it is considering all machines at the same time. Is there a way (using piso or not) to get the overlapping moments, for each machine, in a single dataframe?
piso can indeed be used. It'll run fast on large datasets, and not be limited to assumptions on sampling rate of times. Modify your piso example to wrap the last two lines in a function:
def make_overlaps(df):
ii = pd.IntervalIndex.from_arrays(df["ts_start"], df["ts_end"])
df["isOverlap"] = piso.adjacency_matrix(ii).any(axis=1).astype(int).values
return df
Then group df_sample on the machine column, and apply:
df_sample.groupby("machine").apply(make_overlaps)
This will give you:
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 0
Here's a way to do what your question asks:
import pandas as pd
df_sample = pd.DataFrame({
'machine': [1, 1, 1, 2],
'ts_start': ["2022-01-01 20:00:00", "2022-01-01 20:30:00", "2022-01-02 20:30:00", "2022-01-01 19:00:00"],
'ts_end': ["2022-01-01 21:00:00", "2022-01-01 21:30:00", "2022-01-02 20:35:00", "2022-01-01 23:00:00"]
})
df_sample = df_sample.sort_values(['machine', 'ts_start', 'ts_end'])
print(df_sample)
def foo(x):
if len(x.index) > 1:
iPrev, reachOfPrev = x.index[0], x.loc[x.index[0], 'ts_end'] if len(x.index) else None
x.loc[iPrev, 'isOverlap'] = 0
for i in x.index[1:]:
if x.loc[i,'ts_start'] < reachOfPrev:
x.loc[iPrev, 'isOverlap'] = 1
x.loc[i, 'isOverlap'] = 1
else:
x.loc[i, 'isOverlap'] = 0
if x.loc[i, 'ts_end'] > reachOfPrev:
iPrev, reachOfPrev = i, x.loc[i, 'ts_end']
else:
x['isOverlap'] = 0
x.isOverlap = x.isOverlap.astype(int)
return x
df_sample = df_sample.groupby('machine').apply(foo)
print(df_sample)
Input:
machine ts_start ts_end
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00
Output:
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 0
Assuming the overlap is only checked up by minutes, you could try:
#create date ranges by minute frequency
df_sample["times"] = df_sample.apply(lambda row: pd.date_range(row["ts_start"], row["ts_end"], freq="1min"), axis=1)
#explode to get one row per minute
df_sample = df_sample.explode("times")
#check if times overlap by looking for duplicates
df_sample["isOverlap"] = df_sample[["machine","times"]].duplicated(keep=False)
#groupby to get back original data structure
output = df_sample.drop("times", axis=1).groupby(["machine","ts_start","ts_end"]).any().astype(int).reset_index()
>>> output
machine ts_start ts_end isOverlap
0 1 2022-01-01 20:00:00 2022-01-01 21:00:00 1
1 1 2022-01-01 20:30:00 2022-01-01 21:30:00 1
2 1 2022-01-02 20:30:00 2022-01-02 20:35:00 0
3 2 2022-01-01 19:00:00 2022-01-01 23:00:00 0
I have a dataframe as such
Date Value
2022-01-01 10:00:00 7
2022-01-01 10:30:00 5
2022-01-01 11:00:00 3
....
....
2022-02-15 21:00:00 8
I would like to convert it into a day by row and hour by column format. The hours are the columns in this case. and the value column is now filled as cell values.
Date 10:00 10:30 11:00 11:30............21:00
2022-01-01 7 5 3 4 11
2022-01-02 8 2 4 4 13
How can I achieve this? I have tried pivot table but no success
Use pivot_table:
df['Date'] = pd.to_datetime(df['Date'])
out = df.pivot_table('Value', df['Date'].dt.date, df['Date'].dt.time, fill_value=0)
print(out)
# Output
Date 10:00:00 10:30:00 11:00:00 21:00:00
Date
2022-01-01 7 5 3 0
2022-02-15 0 0 0 8
To remove Date labels, you can use rename_axis:
for the top Date label: out.rename_axis(columns=None)
for the bottom Date label: out.rename_axis(index=None)
for both: out.rename_axis(index=None, columns=None)
You can change None by any string to rename axis.
I was hoping if anyone could help me with this use-case:
I want to generate dates between two dates and then tag each date with week number, then add both new generated dates and week number as new columns to the original dataframe and map it to user id.
this is the existing dataframe:
user_id
start_dt
end_dt
1
2022-01-01
2022-02-01
2
2022-01-14
2022-03-14
3
2022-01-05
2022-02-05
4
2022-01-25
2022-02-25
generating dates between start and end date and tag date with wk number
user_id
date
week_nbr
1
2022-01-01
w1
1
2022-01-02
w1
1
2022-01-03
w1
1
2022-01-04
w1
1
2022-01-05
w1
1
2022-01-06
w1
1
2022-01-07
w1
1
2022-01-08
w2
Finally map the generated wk and dates back to the original table using user_id:
user_id
start_dt
end_dt
date
week_nbr
1
2022-01-01
2022-02-01
2022-01-01
w1
1
2022-01-01
2022-02-01
2022-01-02
w1
1
2022-01-01
2022-02-01
2022-01-03
w1
1
2022-01-01
2022-02-01
2022-01-04
w1
1
2022-01-01
2022-02-01
2022-01-05
w1
1
2022-01-01
2022-02-01
2022-01-06
w1
1
2022-01-01
2022-02-01
2022-01-07
w1
1
2022-01-01
2022-02-01
2022-01-08
w2
Any thoughts?
I believe this should give you what you are looking for:
(df.assign(
date = [pd.date_range(i,j) for i,j in zip(df['start_dt'],df['end_dt'])]).explode('date')
.assign(week_nbr = lambda x: x.groupby('user_id')['date']
.diff()
.dt.days
.cumsum()
.floordiv(7)
.add(1,fill_value=0)
.astype(int)
.map('w{}'.format))
.reset_index(drop=True))
Output: (top 5 rows)
user_id start_dt end_dt date week_nbr
0 1 2022-01-01 2022-02-01 2022-01-01 w1
1 1 2022-01-01 2022-02-01 2022-01-02 w1
2 1 2022-01-01 2022-02-01 2022-01-03 w1
3 1 2022-01-01 2022-02-01 2022-01-04 w1
4 1 2022-01-01 2022-02-01 2022-01-05 w1