I've been searching google to solve this problem for a couple hour now. So I have an interval datetime column (Interval) and another column (Datetime) with datetime. I want to check if each row in (Datetime) is in all of (Interval), if it is, sum the total number and put it in a new column (Number).
Original column:
Datetime
Interval
2021-12-01 09:00:00
(2021-12-01 08:24:42, 2021-12-01 09:03:41]
2021-12-01 08:01:00
2021-12-01 08:03:42, 2021-12-01 09:03:41
2021-12-01 09:03:50
(2021-12-01 09:03:43, 2021-12-01 10:03:42
2021-12-01 08:03:42
(2021-12-01 08:03:42, 2021-12-01 09:03:41]
2021-12-01 08:00:12
2021-12-01 08:03:42, 2021-12-01 09:03:41
2021-12-01 09:03:43
(2021-12-01 09:03:43, 2021-12-01 10:03:42
2021-12-01 08:03:42
(2021-12-01 09:03:43, 2021-12-01 10:03:42]
What I want:
Datetime
Interval
Total
2021-12-01 09:00:00
(2021-12-01 08:24:42, 2021-12-01 09:03:41]
7
2021-12-01 08:01:00
2021-12-01 08:03:42, 2021-12-01 09:03:41
0
2021-12-01 09:03:50
(2021-12-01 09:03:43, 2021-12-01 10:03:42
3
2021-12-01 08:03:42
(2021-12-01 08:03:42, 2021-12-01 09:03:41]
3
2021-12-01 08:00:12
2021-12-01 08:03:42, 2021-12-01 09:03:41
0
2021-12-01 09:03:43
(2021-12-01 09:03:43, 2021-12-01 10:03:42
3
2021-12-01 08:03:42
(2021-12-01 09:03:43, 2021-12-01 10:03:42]
3
Observe that 1st row of the datetime column can be located in all of the 2nd column of Interval bringing the total to 7. The 3rd column of the datetime column can be located in 3rd, 6th and 7th rows of the 2nd column Interval bring that total to 3 .......so on and so forth.
This is my attempt in Python:
For ii in dt.index:
if dt['Datetime'].notnull():
dt['Total'][ii] = sum(dt['Datetime'] in (dt['Interval']))
You could iterate over each row, and within that iterate over intervals:
intervals = df['Interval'].values
df['Total'] = [sum([x in y for y in intervals]) for x in df['Datetime']]
Related
here is my code and datetime columns.
import pandas as pd
xcel_file=pd.read_excel('data.xlsx',usecols=['datetime'])
date=[]
time=[]
date.append((xcel_file['datetime']).dt.date)
time.append((xcel_file['datetime']).dt.time)
new_file=pd.DataFrame({'a':len(xcel_file['datetime'])},index=xcel_file['datetime'])
day=new_file.between_time('9:00','16:00')
day.reset_index(inplace=True)
day=day.drop(columns={'a'})
day['time']=pd.to_datetime(day['datetime']).dt.date
model_list=day['time'].drop_duplicates()
data_set=[]
i=0
for n in day['datetime']:
data_2=max(day['datetime'][day['time']==model_list[i])
i+=1
data_set.append(data_2)
datetime column
0 2022-01-10 09:30:00
1 2022-01-10 10:30:00
2 2022-01-11 10:30:00
3 2022-01-11 15:30:00
4 2022-01-11 11:00:00
5 2022-01-11 12:00:00
6 2022-01-12 13:00:00
7 2022-01-12 15:30:00
8 2022-01-13 14:00:00
9 2022-01-14 15:00:00
10 2022-01-14 16:00:00
11 2022-01-14 16:30:00
expected result
1 2022-01-10 10:30:00
3 2022-01-11 15:30:00
7 2022-01-12 15:30:00
8 2022-01-13 14:00:00
9 2022-01-14 15:00:00
I'm trying to get max value of same dates from datetime column in between time 9am to 4pm.
Is there any way of doing this? Truly thankful for any kind of help.
Use DataFrame.between_time with aggregate by days in Grouper for maximal datetimes:
df = pd.read_excel('data.xlsx',usecols=['datetime'])
df = df.set_index('datetime', drop=False)
df = (df.between_time('9:00','16:00')
.groupby(pd.Grouper(freq='d'))[['datetime']]
.max()
.reset_index(drop=True))
print (df)
datetime
0 2022-01-10 10:30:00
1 2022-01-11 15:30:00
2 2022-01-12 15:30:00
3 2022-01-13 14:00:00
4 2022-01-14 16:00:00
EDIT: Added missing values if exist match, so DataFrame.dropna solve this problem.
print (df)
datetime
0 2022-01-10 17:40:00
1 2022-01-10 19:30:00
2 2022-01-11 19:30:00
3 2022-01-11 15:30:00
4 2022-01-12 19:30:00
5 2022-01-12 15:30:00
6 2022-01-14 18:30:00
7 2022-01-14 16:30:00
df = df.set_index('datetime', drop=False)
df = (df.between_time('17:00','19:30')
.groupby(pd.Grouper(freq='d'))[['datetime']]
.max()
.dropna()
.reset_index(drop=True))
print (df)
datetime
0 2022-01-10 19:30:00
1 2022-01-11 19:30:00
2 2022-01-12 19:30:00
3 2022-01-14 18:30:00
Added alternative solution:
df = df.set_index('datetime', drop=False)
df = (df.between_time('17:00','19:30')
.sort_index()
.assign(d = lambda x: x['datetime'].dt.date)
.drop_duplicates('d', keep='last')
.drop('d', axis=1)
.reset_index(drop=True)
)
print (df)
datetime
0 2022-01-10 19:30:00
1 2022-01-11 19:30:00
2 2022-01-12 19:30:00
3 2022-01-14 18:30:00
EDIT: solution for filter first by datetime column, then datetime2 and last filtering by dates from datetime column:
print (df)
datetime datetime2
0 2022-01-10 09:30:00 2022-01-10 17:40:00
1 2022-01-10 10:30:00 2022-01-10 19:30:00
2 2022-01-11 10:30:00 2022-01-11 19:30:00
3 2022-01-11 15:30:00 2022-01-11 15:30:00
4 2022-01-11 11:00:00 2022-01-12 15:30:00
5 2022-01-11 12:00:00 2022-01-14 18:30:00
6 2022-01-12 13:00:00 2022-01-14 16:30:00
7 2022-01-12 15:30:00 2022-01-14 17:30:00
df = (df.set_index('datetime', drop=False)
.between_time('9:00','16:00')
.sort_index()
.set_index('datetime2', drop=False)
.between_time('17:00','19:30')
.assign(d = lambda x: x['datetime'].dt.date)
.drop_duplicates('d', keep='last')
.drop('d', axis=1)
.reset_index(drop=True)
)
print (df)
datetime datetime2
0 2022-01-10 10:30:00 2022-01-10 19:30:00
1 2022-01-11 12:00:00 2022-01-14 18:30:00
2 2022-01-12 15:30:00 2022-01-14 17:30:00
If filtering by dates by datetim2 output is different:
df = (df.set_index('datetime', drop=False)
.between_time('9:00','16:00')
.sort_index()
.set_index('datetime2', drop=False)
.between_time('17:00','19:30')
.assign(d = lambda x: x['datetime2'].dt.date)
.drop_duplicates('d', keep='last')
.drop('d', axis=1)
.reset_index(drop=True)
)
print (df)
datetime datetime2
0 2022-01-10 10:30:00 2022-01-10 19:30:00
1 2022-01-11 10:30:00 2022-01-11 19:30:00
2 2022-01-12 15:30:00 2022-01-14 17:30:00
I have a Pandas dataframe that looks like this :
# date
--- -------------------
0 2022-01-01 08:00:00
1 2022-01-01 08:01:00
2 2022-01-01 08:52:00
My goal is to add a new column that contains a datetime object with the value of the next hour. I looked at the documentation of the ceil function, and it works pretty well in most cases.
Issue
The problem concerns hours that are perfectly round (like the one at #0) :
df["next"] = (df["date"]).dt.ceil("H")
# date next
--- ------------------- -------------------
0 2022-01-01 08:00:00 2022-01-01 08:00:00 <--- wrong, expected 09:00:00
1 2022-01-01 08:01:00 2022-01-01 09:00:00 <--- correct
2 2022-01-01 08:52:00 2022-01-01 09:00:00 <--- correct
Sub-optimal solution
I have come up with the following workaround, but I find it really clumsy :
def nextHour(current):
return pd.date_range(start=current, periods=2, freq="H")[1]
df["next"] = (df["date"]).apply(lambda x: nextHour(x))
I have around 1-2 million rows in my dataset and I find this solution extremely slow compared to the native dt.ceil(). Is there a better way of doing it ?
This is the way ceil works, it won't jump to the next hour.
What you want seems more like a floor + 1h using pandas.Timedelta:
df['next'] = df['date'].dt.floor('H')+pd.Timedelta('1h')
output:
date next
0 2022-01-01 08:00:00 2022-01-01 09:00:00
1 2022-01-01 08:01:00 2022-01-01 09:00:00
2 2022-01-01 08:52:00 2022-01-01 09:00:00
difference of bounds behavior between floor and ceil:
date ceil floor
0 2022-01-01 08:00:00 2022-01-01 08:00:00 2022-01-01 08:00:00
1 2022-01-01 08:01:00 2022-01-01 09:00:00 2022-01-01 08:00:00
2 2022-01-01 08:52:00 2022-01-01 09:00:00 2022-01-01 08:00:00
3 2022-01-01 09:00:00 2022-01-01 09:00:00 2022-01-01 09:00:00
4 2022-01-01 09:01:00 2022-01-01 10:00:00 2022-01-01 09:00:00
I have a table as below and want to fill down the Stage of the same category based on the condition
if Stage = "Delivered" then fill down "Delivered" to all the next rows else if Stage = "Paid" then fill down "Paid" to all the next rows
Category
Date
Stage
A
2021-11-01
Ordered
A
2021-12-01
Paid
A
2022-01-01
B
2021-08-01
B
2021-09-01
Ordered
B
2021-10-01
Paid
B
2021-11-01
Ordered
B
2021-12-01
Delivered
The result should look like:
Category
Date
Stage
A
2021-11-01
Ordered
A
2021-12-01
Paid
A
2022-01-01
Paid
B
2021-08-01
B
2021-09-01
Ordered
B
2021-10-01
Paid
B
2021-11-01
Paid
B
2021-12-01
Delivered
Could anyone help? I would really appreciate it!
You can use mask and combine_first:
Assuming your dataframe is already sorted by Date column.
df['Stage'] = df['Stage'].mask(~df['Stage'].isin(['Paid', 'Delivered'])) \
.groupby(df['Category']).ffill() \
.combine_first(df['Stage'])
print(df)
# Output
Category Date Stage
0 A 2021-11-01 Ordered
1 A 2021-12-01 Paid
2 A 2022-01-01 Paid
3 B 2021-08-01
4 B 2021-09-01 Ordered
5 B 2021-10-01 Paid
6 B 2021-11-01 Paid
7 B 2021-12-01 Delivered
I have a dataset that shows who is booking which room at which timing and it looks like this.
email room Start Date End Date
abc#corp.com L11M2 2021-02-01 08:00:00 2021-02-01 11:00:00
xyz#corp.com L12M4 2021-02-01 08:00:00 2021-02-01 10:00:00
I want to split this up into different hours such that one row only contains one hour of data.
This is the dataframe that I want to get.
email room Start Date End Date
abc#corp.com L11M2 2021-02-01 08:00:00 2021-02-01 09:00:00
abc#corp.com L11M2 2021-02-01 09:00:00 2021-02-01 10:00:00
abc#corp.com L11M2 2021-02-01 10:00:00 2021-02-01 11:00:00
xyz#corp.com L12M4 2021-02-01 08:00:00 2021-02-01 09:00:00
xyz#corp.com L12M4 2021-02-01 09:00:00 2021-02-01 10:00:00
Is there any way that I can do this in python?
Here is a simple solution using pandas.date_range and explode:
df['Start Date'] = df.apply(lambda d: pd.date_range(d['Start Date'],
d['End Date'],
freq='h')[:-1],
axis=1)
df = df.explode('Start Date')
df['End Date'] = df['Start Date'] + pd.Timedelta('1h')
output:
email room Start Date End Date
0 abc#corp.com L11M2 2021-02-01 08:00:00 2021-02-01 09:00:00
0 abc#corp.com L11M2 2021-02-01 09:00:00 2021-02-01 10:00:00
0 abc#corp.com L11M2 2021-02-01 10:00:00 2021-02-01 11:00:00
1 xyz#corp.com L12M4 2021-02-01 08:00:00 2021-02-01 09:00:00
1 xyz#corp.com L12M4 2021-02-01 09:00:00 2021-02-01 10:00:00
A combination of pandas melt, with pyjanitor's complete could help transform the data:
# pip install pyjanitor
import pandas as pd
import janitor
(df.melt(['email', 'room'], value_name = 'Start_Date')
.reindex([3,1,2,0])
# complete is a wrapper around pandas functions
# to expose missing values ... in this case it exposes the
# missing dates for each group in by
.complete([{'Start_Date':lambda df: pd.date_range(df.min(), df.max(),freq='H')}],
by=['email', 'room'])
.assign(End_Date = lambda df: df.Start_Date.add(pd.Timedelta('1 hour')))
.query('variable != "End Date"').drop(columns='variable'))
email room Start_Date End_Date
0 abc#corp.com L11M2 2021-02-01 08:00:00 2021-02-01 09:00:00
1 abc#corp.com L11M2 2021-02-01 09:00:00 2021-02-01 10:00:00
2 abc#corp.com L11M2 2021-02-01 10:00:00 2021-02-01 11:00:00
4 xyz#corp.com L12M4 2021-02-01 08:00:00 2021-02-01 09:00:00
5 xyz#corp.com L12M4 2021-02-01 09:00:00 2021-02-01 10:00:00
Let's create some sample data
from datetime import datetime, timedelta
ref = now.replace(minute=0, second=0, microsecond=0)
def shifted(i): return ref + timedelta(hour=i)
df = pd.DataFrame([
('A', 'B', shifted(1), shifted(10)),
('C', 'D', shifted(-5), shifted(-1))],
columns=['name', 'email', 'start', 'end'])
The data looks like this
name email start end
0 A B 2021-08-27 12:00:00 2021-08-27 21:00:00
1 C D 2021-08-27 06:00:00 2021-08-27 05:00:00
You can split up each row with the apply function, making sure you return a pd.Series.
new_start = df.apply(lambda row: pd.Series(pd.date_range(row.start, row.end, freq='H')), axis=`).stack()
After this, new_start is the start of every hour, with a double index, one is the original index and one is the order of that particular block, could be useful as well.
0 0 2021-08-27 12:00:00
1 2021-08-27 13:00:00
2 2021-08-27 14:00:00
3 2021-08-27 15:00:00
4 2021-08-27 16:00:00
5 2021-08-27 17:00:00
6 2021-08-27 18:00:00
7 2021-08-27 19:00:00
8 2021-08-27 20:00:00
9 2021-08-27 21:00:00
1 0 2021-08-27 06:00:00
1 2021-08-27 07:00:00
2 2021-08-27 08:00:00
3 2021-08-27 09:00:00
4 2021-08-27 10:00:00
dtype: datetime64[ns]
Now just join this to the original frame.
res = df[["name", "email"]].join(
new_start.reset_index(1, drop=True).rename("start"))
And you can add the end column like this
res["end"] = res.start + timedelta(hours=1)
Ive got a dataframe with strange hourly timestamps. It has both 00:00:00 and 24:00:00. It is as follows:
TIMESTAMP RECORD
2 2021-08-01 00:01:00 85878
3 2021-08-01 00:02:00 85879
4 2021-08-01 00:03:00 85880
5 2021-08-01 00:04:00 85881
6 2021-08-01 00:05:00 85882
... ... ...
1437 2021-08-01 23:56:00 87313
1438 2021-08-01 23:57:00 87314
1439 2021-08-01 23:58:00 87315
1440 2021-08-01 23:59:00 87316
1441 2021-08-01 24:00:00 87317
What I would like to do is if the hour is 24 then change it to 00 and the day to the next day.
Ive tried replacing all 24hrs etc with replace but cant get it to work and it would only tackle to hour issue, code I've tried is as follows:
data['TIMESTAMP'][10:11]=data['TIMESTAMP'][10:11].str.replace("24","00", case= False)
split date and time, parse the date to datetime and add the time as a timedelta:
import pandas as pd
# split date and time
date_time = df['TIMESTAMP'].str.split(' ', expand=True)
# parse date to datetime and time to timedelta and combine
df['TIMESTAMP'] = pd.to_datetime(date_time[0]) + pd.to_timedelta(date_time[1])
df['TIMESTAMP']
0 2021-08-01 00:01:00
1 2021-08-01 00:02:00
2 2021-08-01 00:03:00
3 2021-08-01 00:04:00
4 2021-08-01 00:05:00
5 2021-08-01 23:56:00
6 2021-08-01 23:57:00
7 2021-08-01 23:58:00
8 2021-08-01 23:59:00
9 2021-08-02 00:00:00
Name: TIMESTAMP, dtype: datetime64[ns]