I have a time series for visibility data that contains half-hourly measurements of visibility. A fog event is defined when the visibility falls below 1 km and the fog event ends when the visibility exceeds 1 Km. Please find the code attached below. I intend to find out the number of such fog events and the duration of each such fog event.
from IPython.display import display
import pandas as pd
import matplotlib.pyplot as plt
from google.colab import files
uploaded = files.upload()
import io
df = pd.read_csv(io.BytesIO(uploaded['visibility.csv']))
df.set_index('Unnamed: 0',inplace=True)
df.index = pd.to_datetime(df.index)
df=df.interpolate(method='linear', limit_direction='forward')
display(df)
Unnamed: 0 visibility_km
2016-01-01 00:00:00 0.595456
2016-01-01 00:30:00 0.595456
2016-01-01 01:00:00 0.595456
2016-01-01 01:30:00 0.595456
2016-01-01 02:00:00 0.595456
... ...
2020-12-31 21:30:00 0.925370
2020-12-31 22:00:00 0.901230
2020-12-31 22:30:00 0.804670
2020-12-31 23:00:00 0.804670
2020-12-31 23:30:00 0.692016
# FOG Events
fog_events=df[df<1.0].count()
print('no. of fog events',fog_events)
no. of fog events 10318
But it simply gives the number of times the visibility drops below 1 km and not the number of fog events.
You can create sample time series data like this:
import pandas as pd
tdf = pd.DataFrame({'Time':pd.date_range(start='1/1/2016', periods=11, freq='30s'),
'Visibility_km': [0.56, 0.75, 0.99, 1.01, 1.1, 1.3, 0.5, 0.6, 0.7, 1.2, 1.3]})
Data in this format makes it easier to copy and paste your problem. To get the total number of fog events and their durations, start by creating a column for the events and one to mark when the event starts and ends
# Create column to mark duration of events
tdf['fog_event'] = (tdf['Visibility_km'] < 1.).astype(int)
# Create column to mark event start and end
tdf['event_diff'] = tdf['fog_event'] != tdf['fog_event'].shift(1)
print(tdf)
Time Visibility_km fog_event event_diff
0 2016-01-01 00:00:00 0.56 1 True
1 2016-01-01 00:00:30 0.75 1 False
2 2016-01-01 00:01:00 0.99 1 False
3 2016-01-01 00:01:30 1.01 0 True
4 2016-01-01 00:02:00 1.10 0 False
5 2016-01-01 00:02:30 1.30 0 False
6 2016-01-01 00:03:00 0.50 1 True
7 2016-01-01 00:03:30 0.60 1 False
8 2016-01-01 00:04:00 0.70 1 False
9 2016-01-01 00:04:30 1.20 0 True
10 2016-01-01 00:05:00 1.30 0 False
Now you can get the events in two ways:
The first way doesn't use Pandas and was the original way I grouped the events by.
from itertools import groupby
groups = [list(g) for _, g in groupby(tdf.fog_event.values)]
fog_durations = np.array([sum(g) for g in groups])
duration_each_event = fog_durations[fog_durations != 0]
total_fog_events = sum(fog_durations != 0)
print(duration_each_event)
array([3, 3])
print(total_fog_events)
2
To do it using Pandas, you can group by the cumulative sum of the event difference
fdf = tdf.groupby([tdf['event_diff'].cumsum(), 'fog_event']).size()
fdf = fdf.reset_index(name = 'duration').rename(columns = {'event_diff': 'index'})
duration_each_event = fdf.loc[fdf['fog_event'] == 1, 'duration'].values
total_fog_events = fdf.loc[fdf['fog_event'] == 1, 'fog_event'].sum()
print(duration_each_event)
[3, 3]
print(total_fog_events)
2
Assuming the time interval between measurements doesn't change (i.e. always measured 30 seconds apart), you can multiply duration_each_event by 30 (for seconds) or 0.5 (for minutes) to get the duration in time units.
Related
The context
I am looking to apply a ufuncs (cumsum in this case) to blocks of contiguous rows in a time serie, which is stored in a panda DataFrame.
This time serie is sorted according its DatetimeIndex.
Blocks are defined by a custom DatetimeIndex.
To do so, I came up with this (ok) code.
# input dataset
length = 10
ts = pd.date_range(start='2021/01/01 00:00', periods=length, freq='1h')
random.seed(1)
val = random.sample(range(1, 10+length), length)
df = pd.DataFrame({'val' : val}, index=ts)
# groupby custom datetimeindex
key_ts = [ts[i] for i in [1,3,7]]
df.loc[key_ts, 'id'] = range(len(key_ts))
df['id'] = df['id'].ffill()
# cumsum
df['cumsum'] = df.groupby('id')['val'].cumsum()
# initial dataset
In [13]: df
Out[13]:
val
2021-01-01 00:00:00 5
2021-01-01 01:00:00 3
2021-01-01 02:00:00 9
2021-01-01 03:00:00 4
2021-01-01 04:00:00 8
2021-01-01 05:00:00 13
2021-01-01 06:00:00 15
2021-01-01 07:00:00 14
2021-01-01 08:00:00 11
2021-01-01 09:00:00 7
# DatetimeIndex defining custom time intervals for 'resampling'.
In [14]: key_ts
Out[14]:
[Timestamp('2021-01-01 01:00:00', freq='H'),
Timestamp('2021-01-01 03:00:00', freq='H'),
Timestamp('2021-01-01 07:00:00', freq='H')]
# result
In [16]: df
Out[16]:
val id cumsum
2021-01-01 00:00:00 5 NaN -1
2021-01-01 01:00:00 3 0.0 3
2021-01-01 02:00:00 9 0.0 12
2021-01-01 03:00:00 4 1.0 4
2021-01-01 04:00:00 8 1.0 12
2021-01-01 05:00:00 13 1.0 25
2021-01-01 06:00:00 15 1.0 40
2021-01-01 07:00:00 14 2.0 14
2021-01-01 08:00:00 11 2.0 25
2021-01-01 09:00:00 7 2.0 32
The question
Is groupby the most efficient in terms of CPU and memory in this case where blocks are made with contiguous rows?
I would think that with groupby, a 1st read of the full the dataset is made to identify all rows to group together.
Knowing rows are contiguous in my case, I don't need to read the full dataset to know I have gathered all the rows of current group.
As soon as I hit the row of the next group, I know calculations are done with previous group.
In case rows are contiguous, the sorting step is lighter.
Hence the question, is there a way to mention this to pandas to save some CPU?
Thanks in advance for your feedbacks,
Bests
group_by is clearly not the fastest solution here because it should either use a slow sort or slow hashing operations to group the values.
What you want to implement is called a segmented cumulative sum. You can implement this quite efficiently using Numpy, but this is a bit tricky to implement (especially due to the NaN values) and not the fastest solution because multiple one need multiple steps iterating over all the id/valcolumns. The fastest solution is to use something like Numba to do this very quickly in one step.
Here is an implementation:
import numpy as np
import numba as nb
# To avoid the compilation cost at runtime, use:
# #nb.njit('int64[:](float64[:],int64[:])')
#nb.njit
def segmentedCumSum(ids, values):
size = len(ids)
res = np.empty(size, dtype=values.dtype)
if size == 0:
return res
zero = values.dtype.type(0)
curValue = zero
for i in range(size):
if not np.isnan(ids[i]):
if i > 0 and ids[i-1] != ids[i]:
curValue = zero
curValue += values[i]
res[i] = curValue
else:
res[i] = -1
curValue = zero
return res
df['cumsum'] = segmentedCumSum(df['id'].to_numpy(), df['val'].to_numpy())
Note that ids[i-1] != ids[i] might fail with big floats because of their imprecision. The best solution is to use integers and -1 to replace the NaN value. If you do want to keep the float values, you can use the expression np.abs(ids[i-1]-ids[i]) > epsilon with a very small epsilon. See this for more information.
I need to find the timeframe from the master based on the input time.
cust_id starttime
0 1 2000-01-01 09:00:03
1 2 2000-01-01 18:01:03
output i needed is
cust_id starttime timeframe
0 1 2000-01-01 09:00:03 morning
1 2 2000-01-01 18:01:03 evening
Code for creating master timeframe details
mastdf={'timeframe':['morning','latemorning','midnoon','evening'],'start_time':['8:00:00','11:00:00','13:00:00','17:00:00'],'end_time':['10:59:59','13:59:59','16:59:59','7:59:59']}enter code here
Code for creating input dataframe
inputdf={'cust_id':[1,2],'starttime':['2000-01-01 09:00:03', '2000-01-01 18:01:03']}
Use cut for binning but first convert values to timedeltas by to_timedelta, create bins with add endpoint 24H and for timeframe between 00:00:00 to 8:00:00 is used fillna by last value of column timeframe:
mastdf={'timeframe':['morning','latemorning','midnoon','evening'],
'start_time':['8:00:00','11:00:00','13:00:00','17:00:00'],
'end_time':['10:59:59','13:59:59','16:59:59','7:59:59']}
mastdf = pd.DataFrame(mastdf)
print (mastdf)
timeframe start_time end_time
0 morning 8:00:00 10:59:59
1 latemorning 11:00:00 13:59:59
2 midnoon 13:00:00 16:59:59
3 evening 17:00:00 7:59:59
inputdf={'cust_id':[1,2],'starttime':['2000-01-01 09:00:03', '2000-01-01 18:01:03']}
inputdf = pd.DataFrame(inputdf)
inputdf['starttime'] = pd.to_datetime(inputdf['starttime'])
start = pd.to_timedelta(mastdf['start_time']).tolist() + [pd.Timedelta(24, unit='h')]
s = pd.to_timedelta(inputdf['starttime'].dt.strftime('%H:%M:%S'))
last = mastdf['timeframe'].iat[-1]
inputdf['timeframe'] = pd.cut(s,
bins=start,
labels=mastdf['timeframe'], right=False).fillna(last)
print (inputdf)
cust_id starttime timeframe
0 1 2000-01-01 09:00:03 morning
1 2 2000-01-01 18:01:03 evening
I have a dataset of samples covering multiple days, all with a timestamp.
I want to select rows within a specific time window. E.g. all rows that were generated between 1pm and 3 pm every day.
This is a sample of my data in a pandas dataframe:
22 22 2018-04-12T20:14:23Z 2018-04-12T21:14:23Z 0 6370.1
23 23 2018-04-12T21:14:23Z 2018-04-12T21:14:23Z 0 6368.8
24 24 2018-04-12T22:14:22Z 2018-04-13T01:14:23Z 0 6367.4
25 25 2018-04-12T23:14:22Z 2018-04-13T01:14:23Z 0 6365.8
26 26 2018-04-13T00:14:22Z 2018-04-13T01:14:23Z 0 6364.4
27 27 2018-04-13T01:14:22Z 2018-04-13T01:14:23Z 0 6362.7
28 28 2018-04-13T02:14:22Z 2018-04-13T05:14:22Z 0 6361.0
29 29 2018-04-13T03:14:22Z 2018-04-13T05:14:22Z 0 6359.3
.. ... ... ... ... ...
562 562 2018-05-05T08:13:21Z 2018-05-05T09:13:21Z 0 6300.9
563 563 2018-05-05T09:13:21Z 2018-05-05T09:13:21Z 0 6300.7
564 564 2018-05-05T10:13:14Z 2018-05-05T13:13:14Z 0 6300.2
565 565 2018-05-05T11:13:14Z 2018-05-05T13:13:14Z 0 6299.9
566 566 2018-05-05T12:13:14Z 2018-05-05T13:13:14Z 0 6299.6
How do I achieve that? I need to ignore the date and just evaluate the time component. I could traverse the dataframe in a loop and evaluate the date time in that way, but there must be a more simple way to do that..
I converted the messageDate which was read a a string to a dateTime by
df["messageDate"]=pd.to_datetime(df["messageDate"])
But after that I got stuck on how to filter on time only.
Any input appreciated.
datetime columns have DatetimeProperties object, from which you can extract datetime.time and filter on it:
import datetime
df = pd.DataFrame(
[
'2018-04-12T12:00:00Z', '2018-04-12T14:00:00Z','2018-04-12T20:00:00Z',
'2018-04-13T12:00:00Z', '2018-04-13T14:00:00Z', '2018-04-13T20:00:00Z'
],
columns=['messageDate']
)
df
messageDate
# 0 2018-04-12 12:00:00
# 1 2018-04-12 14:00:00
# 2 2018-04-12 20:00:00
# 3 2018-04-13 12:00:00
# 4 2018-04-13 14:00:00
# 5 2018-04-13 20:00:00
df["messageDate"] = pd.to_datetime(df["messageDate"])
time_mask = (df['messageDate'].dt.hour >= 13) & \
(df['messageDate'].dt.hour <= 15)
df[time_mask]
# messageDate
# 1 2018-04-12 14:00:00
# 4 2018-04-13 14:00:00
I hope the code is self explanatory. You can always ask questions.
import pandas as pd
# Prepping data for example
dates = pd.date_range('1/1/2018', periods=7, freq='H')
data = {'A' : range(7)}
df = pd.DataFrame(index = dates, data = data)
print df
# A
# 2018-01-01 00:00:00 0
# 2018-01-01 01:00:00 1
# 2018-01-01 02:00:00 2
# 2018-01-01 03:00:00 3
# 2018-01-01 04:00:00 4
# 2018-01-01 05:00:00 5
# 2018-01-01 06:00:00 6
# Creating a mask to filter the value we with to have or not.
# Here, we use df.index because the index is our datetime.
# If the datetime is a column, you can always say df['column_name']
mask = (df.index > '2018-1-1 01:00:00') & (df.index < '2018-1-1 05:00:00')
print mask
# [False False True True True False False]
df_with_good_dates = df.loc[mask]
print df_with_good_dates
# A
# 2018-01-01 02:00:00 2
# 2018-01-01 03:00:00 3
# 2018-01-01 04:00:00 4
df=df[(df["messageDate"].apply(lambda x : x.hour)>13) & (df["messageDate"].apply(lambda x : x.hour)<15)]
You can use x.minute, x.second similarly.
try this after ensuring messageDate is indeed datetime format as you have done
df.set_index('messageDate',inplace=True)
choseInd = [ind for ind in df.index if (ind.hour>=13)&(ind.hour<=15)]
df_select = df.loc[choseInd]
you can do the same, even without making the datetime column as an index, as the answer with apply: lambda shows
it just makes your dataframe 'better looking' if the datetime is your index rather than numerical one.
I'm trying to aggregate from the end of a date range instead of from the beginning. Despite the fact that I would think that adding closed='right' to the grouper would solve the issue, it doesn't. Please let me know how I can achieve my desired output shown at the bottom, thanks.
import pandas as pd
df = pd.DataFrame(columns=['date','number'])
df['date'] = pd.date_range('1/1/2000', periods=8, freq='T')
df['number'] = pd.Series(range(8))
df
date number
0 2000-01-01 00:00:00 0
1 2000-01-01 00:01:00 1
2 2000-01-01 00:02:00 2
3 2000-01-01 00:03:00 3
4 2000-01-01 00:04:00 4
5 2000-01-01 00:05:00 5
6 2000-01-01 00:06:00 6
7 2000-01-01 00:07:00 7
With the groupby and aggregation of the date I get the following. Since I have 8 dates and I'm grouping by periods of 3 it must choose whether to truncate the earliest date group or the oldest date group, and it chooses the oldest date group (the oldest date group has a count of 2):
df.groupby(pd.Grouper(key='date', freq='3T')).agg('count')
date number
2000-01-01 00:00:00 3
2000-01-01 00:03:00 3
2000-01-01 00:06:00 2
My desired output is to instead truncate the earliest date group:
date number
2000-01-01 00:00:00 2
2000-01-01 00:02:00 3
2000-01-01 00:05:00 3
Please let me know how this can be achieved, I'm hopeful there's just a parameter that can be set that I've overlooked. Note that this is similar to this question, but my question is specific to the date truncation.
EDIT: To reframe the question (thanks Alexdor) the default behavior in pandas is to bin by period [0, 3), [3, 6), [6, 9) but instead I'd like to bin by (-1, 2], (2, 5], (5, 8]
It seems like the grouper function build up the bins starting from the oldest time in the series that you pass to it. I couldn't see a way to make it build up the bins from the newest time, but it's fairly easy to construct the bins from scratch.
freq = '3min'
minTime = df.date.min()
maxTime = df.date.max()
deltaT = pd.Timedelta(freq)
minTime -= deltaT - (maxTime - minTime) % deltaT # adjust min time to start of first bin
r = pd.date_range(start=minTime, end=maxTime, freq=freq)
df.groupby(pd.cut(df["date"], r)).agg('count')
Gives
date date number
(1999-12-31 23:58:00, 2000-01-01 00:01:00] 2 2
(2000-01-01 00:01:00, 2000-01-01 00:04:00] 3 3
(2000-01-01 00:04:00, 2000-01-01 00:07:00] 3 3
This is one hack, which let's you group by a constant group size, counting bottom up.
from itertools import chain
def grouper(x, k=3):
n = len(df.index)
return list(chain.from_iterable([[0]*int(n//k)] + [[i]*k for i in range(1, int(n/k)+1)]))
df['grouper'] = grouper(df, 3)
res = df.groupby('grouper', as_index=False)\
.agg({'date': 'first', 'number': 'count'})\
.drop('grouper', 1)
# date number
# 0 2000-01-01 00:00:00 2
# 1 2000-01-01 00:02:00 3
# 2 2000-01-01 00:05:00 3
I have days like this:
eventday_idxs
2005-01-07 00:00:00
2005-01-31 00:00:00
2005-02-15 00:00:00
2005-04-18 00:00:00
2005-05-11 00:00:00
2005-08-12 00:00:00
2005-08-15 00:00:00
2005-09-06 00:00:00
2005-09-19 00:00:00
2005-10-12 00:00:00
2005-10-13 00:00:00
2005-10-20 00:00:00
2006-01-10 00:00:00
2006-01-30 00:00:00
2006-02-10 00:00:00
2006-03-29 00:00:00
I want to do calculations From : To ranges of it like this on AAPL stock dataset.
As i am beginner in Pandas i use loop and do like this.
aap1_10_years = pd.io.data.get_data_yahoo('AAPL',
start=datetime.datetime(2004, 12, 10),
end=datetime.datetime(2014, 12, 10))
one_day = timedelta(days=1)
for i,ind in enumerate(eventday_idxs):
try:
do_calculations(aapl_10_years[ ind: eventday_idxs[i+1] - one_day ]['High'])
except IndexError:
do_calculations(aapl_10_years[ ind:]['High'] )
How can i apply do_calcuations without loops like this? Because loops like this are discouraged in panda because are slow, right?
The time spans between the events are not regular:
In [141]: eventday_idxs.diff().head()
Out[141]:
0 NaT
1 24 days
2 15 days
3 62 days
4 23 days
Name: 0, dtype: timedelta64[ns]
so we can't express the calculation using rolling_apply. However, if we could
assign an "event number" to each of the rows in aap1_10_years, then we could
groupby these event numbers and apply do_calculations to each group.
If we define:
# mark each event day with a 1
aap1_10_years.loc[eventday_idxs, 'event'] = 1
# use cumsum to assign an event number to each event range
aap1_10_years['event'] = aap1_10_years['event'].fillna(0).cumsum()
then aap1_10_years['event'] equals 1 for these rows:
In [144]: aap1_10_years.loc[aap1_10_years['event'] == 1, ['Close', 'event']]
Out[144]:
Close event
Date
2005-01-07 69.25 1
2005-01-10 68.96 1
2005-01-11 64.56 1
2005-01-12 65.46 1
2005-01-13 69.80 1
2005-01-14 70.20 1
2005-01-18 70.65 1
2005-01-19 69.88 1
2005-01-20 70.46 1
2005-01-21 70.49 1
2005-01-24 70.76 1
2005-01-25 72.05 1
2005-01-26 72.25 1
2005-01-27 72.64 1
2005-01-28 73.98 1
Thus event number 1 has been assigned to all the rows with dates between
2005-01-07 and 2005-01-28. And similarly, each of the other event ranges have been assigned a unique event number.
import datetime as DT
import pandas as pd
import pandas.io.data as pdata
eventday_idxs = pd.to_datetime(pd.read_table('data', header=None)[0])
aap1_10_years = pdata.get_data_yahoo(
'AAPL',
start=DT.datetime(2004, 12, 10),
end=DT.datetime(2014, 12, 10))
# mark each event day with a 1
aap1_10_years.loc[eventday_idxs, 'event'] = 1
# use cumsum to assign an event number to each event range
aap1_10_years['event'] = aap1_10_years['event'].fillna(0).cumsum()
mask = aap1_10_years['event'] > 0
aap1_10_years.loc[mask].groupby(['event'])['High'].apply(do_calculations)