Get the max value of dates in Pandas - python

here is my code and datetime columns.
import pandas as pd
xcel_file=pd.read_excel('data.xlsx',usecols=['datetime'])
date=[]
time=[]
date.append((xcel_file['datetime']).dt.date)
time.append((xcel_file['datetime']).dt.time)
new_file=pd.DataFrame({'a':len(xcel_file['datetime'])},index=xcel_file['datetime'])
day=new_file.between_time('9:00','16:00')
day.reset_index(inplace=True)
day=day.drop(columns={'a'})
day['time']=pd.to_datetime(day['datetime']).dt.date
model_list=day['time'].drop_duplicates()
data_set=[]
i=0
for n in day['datetime']:
data_2=max(day['datetime'][day['time']==model_list[i])
i+=1
data_set.append(data_2)
datetime column
0 2022-01-10 09:30:00
1 2022-01-10 10:30:00
2 2022-01-11 10:30:00
3 2022-01-11 15:30:00
4 2022-01-11 11:00:00
5 2022-01-11 12:00:00
6 2022-01-12 13:00:00
7 2022-01-12 15:30:00
8 2022-01-13 14:00:00
9 2022-01-14 15:00:00
10 2022-01-14 16:00:00
11 2022-01-14 16:30:00
expected result
1 2022-01-10 10:30:00
3 2022-01-11 15:30:00
7 2022-01-12 15:30:00
8 2022-01-13 14:00:00
9 2022-01-14 15:00:00
I'm trying to get max value of same dates from datetime column in between time 9am to 4pm.
Is there any way of doing this? Truly thankful for any kind of help.

Use DataFrame.between_time with aggregate by days in Grouper for maximal datetimes:
df = pd.read_excel('data.xlsx',usecols=['datetime'])
df = df.set_index('datetime', drop=False)
df = (df.between_time('9:00','16:00')
.groupby(pd.Grouper(freq='d'))[['datetime']]
.max()
.reset_index(drop=True))
print (df)
datetime
0 2022-01-10 10:30:00
1 2022-01-11 15:30:00
2 2022-01-12 15:30:00
3 2022-01-13 14:00:00
4 2022-01-14 16:00:00
EDIT: Added missing values if exist match, so DataFrame.dropna solve this problem.
print (df)
datetime
0 2022-01-10 17:40:00
1 2022-01-10 19:30:00
2 2022-01-11 19:30:00
3 2022-01-11 15:30:00
4 2022-01-12 19:30:00
5 2022-01-12 15:30:00
6 2022-01-14 18:30:00
7 2022-01-14 16:30:00
df = df.set_index('datetime', drop=False)
df = (df.between_time('17:00','19:30')
.groupby(pd.Grouper(freq='d'))[['datetime']]
.max()
.dropna()
.reset_index(drop=True))
print (df)
datetime
0 2022-01-10 19:30:00
1 2022-01-11 19:30:00
2 2022-01-12 19:30:00
3 2022-01-14 18:30:00
Added alternative solution:
df = df.set_index('datetime', drop=False)
df = (df.between_time('17:00','19:30')
.sort_index()
.assign(d = lambda x: x['datetime'].dt.date)
.drop_duplicates('d', keep='last')
.drop('d', axis=1)
.reset_index(drop=True)
)
print (df)
datetime
0 2022-01-10 19:30:00
1 2022-01-11 19:30:00
2 2022-01-12 19:30:00
3 2022-01-14 18:30:00
EDIT: solution for filter first by datetime column, then datetime2 and last filtering by dates from datetime column:
print (df)
datetime datetime2
0 2022-01-10 09:30:00 2022-01-10 17:40:00
1 2022-01-10 10:30:00 2022-01-10 19:30:00
2 2022-01-11 10:30:00 2022-01-11 19:30:00
3 2022-01-11 15:30:00 2022-01-11 15:30:00
4 2022-01-11 11:00:00 2022-01-12 15:30:00
5 2022-01-11 12:00:00 2022-01-14 18:30:00
6 2022-01-12 13:00:00 2022-01-14 16:30:00
7 2022-01-12 15:30:00 2022-01-14 17:30:00
df = (df.set_index('datetime', drop=False)
.between_time('9:00','16:00')
.sort_index()
.set_index('datetime2', drop=False)
.between_time('17:00','19:30')
.assign(d = lambda x: x['datetime'].dt.date)
.drop_duplicates('d', keep='last')
.drop('d', axis=1)
.reset_index(drop=True)
)
print (df)
datetime datetime2
0 2022-01-10 10:30:00 2022-01-10 19:30:00
1 2022-01-11 12:00:00 2022-01-14 18:30:00
2 2022-01-12 15:30:00 2022-01-14 17:30:00
If filtering by dates by datetim2 output is different:
df = (df.set_index('datetime', drop=False)
.between_time('9:00','16:00')
.sort_index()
.set_index('datetime2', drop=False)
.between_time('17:00','19:30')
.assign(d = lambda x: x['datetime2'].dt.date)
.drop_duplicates('d', keep='last')
.drop('d', axis=1)
.reset_index(drop=True)
)
print (df)
datetime datetime2
0 2022-01-10 10:30:00 2022-01-10 19:30:00
1 2022-01-11 10:30:00 2022-01-11 19:30:00
2 2022-01-12 15:30:00 2022-01-14 17:30:00

Related

Group consecutive rises and falls using Pandas Series

I want to group consecutive growth and falls in pandas series. I have tried this, but it seems not working:
consec_rises = self.df_dataset.diff().cumsum()
group_consec = consec_rises.groupby(consec_rises)
My dataset:
date
2022-01-07 25.847718
2022-01-08 29.310294
2022-01-09 31.791339
2022-01-10 33.382136
2022-01-11 31.791339
2022-01-12 29.310294
2022-01-13 25.847718
2022-01-14 21.523483
2022-01-15 16.691068
2022-01-16 11.858653
2022-01-17 7.534418
I want to get result as following:
Group #1 (consecutive growth)
2022-01-07 25.847718
2022-01-08 29.310294
2022-01-09 31.791339
2022-01-10 33.382136
Group #2 (consecutive fall)
2022-01-12 29.310294
2022-01-13 25.847718
2022-01-14 21.523483
2022-01-15 16.691068
2022-01-16 11.858653
2022-01-17 7.534418
If I understand you correctly:
mask = df["date"].diff().bfill() >= 0
for _, g in df.groupby((mask != mask.shift(1)).cumsum()):
print(g)
print("-" * 80)
Prints:
date
2022-01-07 25.847718
2022-01-08 29.310294
2022-01-09 31.791339
2022-01-10 33.382136
--------------------------------------------------------------------------------
date
2022-01-11 31.791339
2022-01-12 29.310294
2022-01-13 25.847718
2022-01-14 21.523483
2022-01-15 16.691068
2022-01-16 11.858653
2022-01-17 7.534418
--------------------------------------------------------------------------------

Measure different between timestamps using conditions - python

I'm trying to measure the difference between timestamps using certain conditions. Using below, for each unique ID, I'm hoping to subtract the End Time where Item == A and the Start Time where Item == D.
So the timestamps are actually located on separate rows.
At the moment my process is returning an error. I'm also hoping to drop the .shift() for something more robust as each unique ID will have different combinations. For ex, A,B,C,D - A,B,D - A,D etc.
df = pd.DataFrame({'ID': [10,10,10,20,20,30],
'Start Time': ['2019-08-02 09:00:00','2019-08-03 10:50:00','2019-08-05 16:00:00','2019-08-04 08:00:00','2019-08-04 15:30:00','2019-08-06 11:00:00'],
'End Time': ['2019-08-04 15:00:00','2019-08-04 16:00:00','2019-08-05 16:00:00','2019-08-04 14:00:00','2019-08-05 20:30:00','2019-08-07 10:00:00'],
'Item': ['A','B','D','A','D','A'],
})
df['Start Time'] = pd.to_datetime(df['Start Time'])
df['End Time'] = pd.to_datetime(df['End Time'])
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
.reset_index(drop=True))
Intended Output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-04 15:00:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-04 16:00:00 B NaT
2 10 2019-08-05 16:00:00 2019-08-05 16:00:00 D 1 days 01:00:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 A NaT
4 20 2019-08-04 15:30:00 2019-08-05 20:30:00 D 0 days 01:30:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT
df2 = df.set_index('ID')
df2.query('Item == "D"')['Start Time']-df2.query('Item == "A"')['End Time']
output:
ID
10 2 days 05:30:00
20 0 days 20:30:00
30 NaT
dtype: timedelta64[ns]
older answer
The issue is your fillna, you can't have strings in a timedelta column:
df['diff'] = (df.groupby('ID')
.apply(lambda x: x['End Time'].shift(1) - x['Start Time'].shift(1))
#.fillna('-') # the issue is here
.reset_index(drop=True))
output:
ID Start Time End Time Item diff
0 10 2019-08-02 09:00:00 2019-08-02 09:30:00 A NaT
1 10 2019-08-03 10:50:00 2019-08-03 11:00:00 B 0 days 00:30:00
2 10 2019-08-04 15:00:00 2019-08-05 16:00:00 C 0 days 00:10:00
3 20 2019-08-04 08:00:00 2019-08-04 14:00:00 B NaT
4 20 2019-08-05 10:30:00 2019-08-05 20:30:00 C 0 days 06:00:00
5 30 2019-08-06 11:00:00 2019-08-07 10:00:00 A NaT
IIUC use:
df1 = df.pivot('ID','Item')
print (df1)
Start Time \
Item A B D
ID
10 2019-08-02 09:00:00 2019-08-03 10:50:00 2019-08-04 15:00:00
20 2019-08-04 08:00:00 NaT 2019-08-05 10:30:00
30 2019-08-06 11:00:00 NaT NaT
End Time
Item A B D
ID
10 2019-08-02 09:30:00 2019-08-03 11:00:00 2019-08-05 16:00:00
20 2019-08-04 14:00:00 NaT 2019-08-05 20:30:00
30 2019-08-07 10:00:00 NaT NaT
a = df1[('Start Time','D')].sub(df1[('End Time','A')])
print (a)
ID
10 2 days 05:30:00
20 0 days 20:30:00
30 NaT
dtype: timedelta64[ns]

Pandas fill missing Time-Series data. Only if more than one day is missing

I have two time-series with different frequencies. Would like to fill values using the lower frequency data.
Here is what I mean. Hope it is clear this way:
index = [pd.datetime(2022,1,10,1),
pd.datetime(2022,1,10,2),
pd.datetime(2022,1,12,7),
pd.datetime(2022,1,14,12),]
df1 = pd.DataFrame([1,2,3,4],index=index)
2022-01-10 01:00:00 1
2022-01-10 02:00:00 2
2022-01-12 07:00:00 3
2022-01-14 12:00:00 4
index = pd.date_range(start=pd.datetime(2022,1,9),
end = pd.datetime(2022,1,15),
freq='D')
df2 = pd.DataFrame([n+99 for n in range(len(index))],index=index)
2022-01-09 99
2022-01-10 100
2022-01-11 101
2022-01-12 102
2022-01-13 103
2022-01-14 104
2022-01-15 105
The final df should only fill values if more than one day is missing under df1. So the result should be:
2022-01-09 00:00:00 99
2022-01-10 01:00:00 1
2022-01-10 02:00:00 2
2022-01-11 00:00:00 101
2022-01-12 07:00:00 3
2022-01-13 00:00:00 103
2022-01-14 12:00:00 4
2022-01-15 00:00:00 105
Any idea how to do this?
You can filter df2 to keep only the new dates and concat to df1:
import numpy as np
idx1 = pd.to_datetime(df1.index).date
idx2 = pd.to_datetime(df2.index).date
df3 = pd.concat([df1, df2[~np.isin(idx2, idx1)]]).sort_index()
Output:
0
2022-01-09 00:00:00 99
2022-01-10 01:00:00 1
2022-01-10 02:00:00 2
2022-01-11 00:00:00 101
2022-01-12 07:00:00 3
2022-01-13 00:00:00 103
2022-01-14 12:00:00 4
2022-01-15 00:00:00 105

Set the time as 0 which is having value in specific column using python

I have a dataset with three inputs X1,X2,X3 including date and time.
Here In X3 column contain with 0 and 5. Here what I want to code is first 5 value contain in X3 column time take as start time and it will be equal to 0 time.
Other time is not changing if 5 value contain in X3 column. Only I want is first time of every day put it as 0 time.
date time x3
10/3/2018 6:15:00 0
10/3/2018 6:45:00 5
10/3/2018 7:45:00 0
10/3/2018 9:00:00 0
10/3/2018 9:25:00 0
10/3/2018 9:30:00 0
10/3/2018 11:00:00 0
10/3/2018 11:30:00 0
10/3/2018 13:30:00 0
10/3/2018 13:50:00 5
10/3/2018 15:00:00 0
10/3/2018 15:25:00 0
10/3/2018 16:25:00 0
10/3/2018 18:00:00 0
10/3/2018 19:00:00 0
10/3/2018 19:30:00 0
10/3/2018 20:00:00 0
10/3/2018 22:05:00 0
10/3/2018 22:15:00 5
10/3/2018 23:40:00 0
10/4/2018 6:58:00 5
10/4/2018 13:00:00 0
10/4/2018 16:00:00 0
10/4/2018 17:00:00 0
As you see I have X3 column data with values 0 and 5 with date and time.
First taking the value of 5
desired output
10/3/208 6:45:00 5 start time 6:45:00 convert 00:00:00
10/3/2018 13:50:00 5 Not taking
10/3/2018 22:15:00 5 Not taking
10/4/2018 6:58:00 5 start time 6:58:00 convert 00:00:00
I just want to code like this. Can anyone help me to solve this problem?
when we used this code it is giving with time difference of each row. I just don't want the difference of time in each rows. I just want to read start time and it should be converted to the 0 time.
I tried this code, and it gave the time difference of each rows also
df['time_diff']= pd.to_datetime(df['date'] + " " + df['time'],
format='%d/%m/%Y %H:%M:%S', dayfirst=True)
mask = df['x3'].ne(0)
df['Duration'] = df[mask].groupby(['date','x3'])['time_diff'].transform('first')
df['Duration'] = df['time_diff'].sub(df['Duration']).dt.total_seconds().div(3600)
This gave me time duration each of 5 values.
Here what I exactly want:
For filter only first values of 5 per groups add DataFrame.drop_duplicates:
df['time_diff']= pd.to_datetime(df['date'] + " " + df['time'],
format='%d/%m/%Y %H:%M:%S', dayfirst=True)
mask = df['x3'].eq(5)
df['Duration'] = (df[mask].drop_duplicates(['date','x3'])
.groupby(['date','x3'])['time_diff']
.transform('first'))
df['Duration'] = df['time_diff'].sub(df['Duration']).dt.total_seconds().div(3600)
print (df)
date time x3 time_diff Duration
0 10/3/2018 6:15:00 0 2018-03-10 06:15:00 NaN
1 10/3/2018 6:45:00 5 2018-03-10 06:45:00 0.0
2 10/3/2018 7:45:00 0 2018-03-10 07:45:00 NaN
3 10/3/2018 9:00:00 0 2018-03-10 09:00:00 NaN
4 10/3/2018 9:25:00 0 2018-03-10 09:25:00 NaN
5 10/3/2018 9:30:00 0 2018-03-10 09:30:00 NaN
6 10/3/2018 11:00:00 0 2018-03-10 11:00:00 NaN
7 10/3/2018 11:30:00 0 2018-03-10 11:30:00 NaN
8 10/3/2018 13:30:00 0 2018-03-10 13:30:00 NaN
9 10/3/2018 13:50:00 5 2018-03-10 13:50:00 NaN
10 10/3/2018 15:00:00 0 2018-03-10 15:00:00 NaN
11 10/3/2018 15:25:00 0 2018-03-10 15:25:00 NaN
12 10/3/2018 16:25:00 0 2018-03-10 16:25:00 NaN
13 10/3/2018 18:00:00 0 2018-03-10 18:00:00 NaN
14 10/3/2018 19:00:00 0 2018-03-10 19:00:00 NaN
15 10/3/2018 19:30:00 0 2018-03-10 19:30:00 NaN
16 10/3/2018 20:00:00 0 2018-03-10 20:00:00 NaN
17 10/3/2018 22:05:00 0 2018-03-10 22:05:00 NaN
18 10/3/2018 22:15:00 5 2018-03-10 22:15:00 NaN
19 10/3/2018 23:40:00 0 2018-03-10 23:40:00 NaN
20 10/4/2018 6:58:00 5 2018-04-10 06:58:00 0.0
21 10/4/2018 13:00:00 0 2018-04-10 13:00:00 NaN
22 10/4/2018 16:00:00 0 2018-04-10 16:00:00 NaN
23 10/4/2018 17:00:00 0 2018-04-10 17:00:00 NaN

Resample python list with pandas

Fairly new to python and pandas here.
I make a query that's giving me back a timeseries. I'm never sure how many data points I receive from the query (run for a single day), but what I do know is that I need to resample them to contain 24 points (one for each hour in the day).
Printing m3hstream gives
[(1479218009000L, 109), (1479287368000L, 84)]
Then I try to make a dataframe df with
df = pd.DataFrame(data = list(m3hstream), columns=['Timestamp', 'Value'])
and this gives me an output of
Timestamp Value
0 1479218009000 109
1 1479287368000 84
Following I do this
daily_summary = pd.DataFrame()
daily_summary['value'] = df['Value'].resample('H').mean()
daily_summary = daily_summary.truncate(before=start, after=end)
print "Now daily summary"
print daily_summary
But this is giving me a TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
Could anyone please let me know how to resample it so I have 1 point for each hour in the 24 hour period that I'm querying for?
Thanks.
First thing you need to do is convert that 'Timestamp' to an actual pd.Timestamp. It looks like those are milliseconds
Then resample with the on parameter set to 'Timestamp'
df = df.assign(
Timestamp=pd.to_datetime(df.Timestamp, unit='ms')
).resample('H', on='Timestamp').mean().reset_index()
Timestamp Value
0 2016-11-15 13:00:00 109.0
1 2016-11-15 14:00:00 NaN
2 2016-11-15 15:00:00 NaN
3 2016-11-15 16:00:00 NaN
4 2016-11-15 17:00:00 NaN
5 2016-11-15 18:00:00 NaN
6 2016-11-15 19:00:00 NaN
7 2016-11-15 20:00:00 NaN
8 2016-11-15 21:00:00 NaN
9 2016-11-15 22:00:00 NaN
10 2016-11-15 23:00:00 NaN
11 2016-11-16 00:00:00 NaN
12 2016-11-16 01:00:00 NaN
13 2016-11-16 02:00:00 NaN
14 2016-11-16 03:00:00 NaN
15 2016-11-16 04:00:00 NaN
16 2016-11-16 05:00:00 NaN
17 2016-11-16 06:00:00 NaN
18 2016-11-16 07:00:00 NaN
19 2016-11-16 08:00:00 NaN
20 2016-11-16 09:00:00 84.0
If you want to fill those NaN values, use ffill, bfill, or interpolate
df.assign(
Timestamp=pd.to_datetime(df.Timestamp, unit='ms')
).resample('H', on='Timestamp').mean().reset_index().interpolate()
Timestamp Value
0 2016-11-15 13:00:00 109.00
1 2016-11-15 14:00:00 107.75
2 2016-11-15 15:00:00 106.50
3 2016-11-15 16:00:00 105.25
4 2016-11-15 17:00:00 104.00
5 2016-11-15 18:00:00 102.75
6 2016-11-15 19:00:00 101.50
7 2016-11-15 20:00:00 100.25
8 2016-11-15 21:00:00 99.00
9 2016-11-15 22:00:00 97.75
10 2016-11-15 23:00:00 96.50
11 2016-11-16 00:00:00 95.25
12 2016-11-16 01:00:00 94.00
13 2016-11-16 02:00:00 92.75
14 2016-11-16 03:00:00 91.50
15 2016-11-16 04:00:00 90.25
16 2016-11-16 05:00:00 89.00
17 2016-11-16 06:00:00 87.75
18 2016-11-16 07:00:00 86.50
19 2016-11-16 08:00:00 85.25
20 2016-11-16 09:00:00 84.00
Let's try:
daily_summary = daily_summary.set_index('Timestamp')
daily_summary.index = pd.to_datetime(daily_summary.index, unit='ms')
For once an hour:
daily_summary.resample('H').mean()
or for once a day:
daily_summary.resample('D').mean()

Categories