I have a dataframe which is having multiple rows with column date. date column is having date and time. not each row has incremental time so I want to calculate after each row how much was the time difference between current and previous date in seconds.
import pandas as pd
data = pd.date_range('1/1/2011', periods = 10, freq ='H')
In the above snippet time difference after each step is 1hr which means 3600 seconds so I want a list of tuple having [(<prev date time>, <current_datetime>, <time_difference>),.....].
I want a list of tuple having [(prev date time, current_datetime,
time_difference),.....]
In this case, use list with zip and compute the time difference with tolal_seconds :
data = pd.date_range("1/1/2011", periods = 10, freq ="H")
L = list(zip(data.shift(), # <- previous time
data, # <- current time
(data.shift() - data).total_seconds())) # <- time diff
NB : If you manipulate a dataframe, you need to replace data by df["date_column"].
Output :
print(L)
[(Timestamp('2011-01-01 01:00:00', freq='H'),
Timestamp('2011-01-01 00:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 02:00:00', freq='H'),
Timestamp('2011-01-01 01:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 03:00:00', freq='H'),
Timestamp('2011-01-01 02:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 04:00:00', freq='H'),
Timestamp('2011-01-01 03:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 05:00:00', freq='H'),
Timestamp('2011-01-01 04:00:00', freq='H'),
3600.0),
...
You can achieve this by using diff function in Pandas to calculate the time difference between consecutive rows in the data column. Here's an example:
df = pd.DataFrame({"date": pd.date_range("1/1/2011", periods=10, freq="H")})
# Calculate the time difference between consecutive rows in seconds
df["time_diff"] = df["date"].diff().dt.total_seconds()
# Create a list of tuples
result = [(df.iloc[i-1]["date"], row["date"], row["time_diff"]) for i, row in df[1:].iterrows()]
df:
date time_diff
0 2011-01-01 00:00:00 NaN
1 2011-01-01 01:00:00 3600.0
2 2011-01-01 02:00:00 3600.0
3 2011-01-01 03:00:00 3600.0
4 2011-01-01 04:00:00 3600.0
5 2011-01-01 05:00:00 3600.0
6 2011-01-01 06:00:00 3600.0
7 2011-01-01 07:00:00 3600.0
8 2011-01-01 08:00:00 3600.0
9 2011-01-01 09:00:00 3600.0
result:
[(Timestamp('2011-01-01 00:00:00'), Timestamp('2011-01-01 01:00:00'), 3600.0),
(Timestamp('2011-01-01 01:00:00'), Timestamp('2011-01-01 02:00:00'), 3600.0),
(Timestamp('2011-01-01 02:00:00'), Timestamp('2011-01-01 03:00:00'), 3600.0),
(Timestamp('2011-01-01 03:00:00'), Timestamp('2011-01-01 04:00:00'), 3600.0),
(Timestamp('2011-01-01 04:00:00'), Timestamp('2011-01-01 05:00:00'), 3600.0),
(Timestamp('2011-01-01 05:00:00'), Timestamp('2011-01-01 06:00:00'), 3600.0),
(Timestamp('2011-01-01 06:00:00'), Timestamp('2011-01-01 07:00:00'), 3600.0),
(Timestamp('2011-01-01 07:00:00'), Timestamp('2011-01-01 08:00:00'), 3600.0),
(Timestamp('2011-01-01 08:00:00'), Timestamp('2011-01-01 09:00:00'), 3600.0)]
It's possible to do this with list comprehension. [:-1] is required because we get a list of 10 intervals just using shift, but there are N-1 intervals between N points.
result = [(i[0],
i[1],
(i[1] - i[0]).total_seconds())
for i in list(zip(data, data.shift(1)))[:-1]]
print(result)
[(Timestamp('2011-01-01 00:00:00', freq='H'),
Timestamp('2011-01-01 01:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 01:00:00', freq='H'),
Timestamp('2011-01-01 02:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 02:00:00', freq='H'),
Timestamp('2011-01-01 03:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 03:00:00', freq='H'),
Timestamp('2011-01-01 04:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 04:00:00', freq='H'),
Timestamp('2011-01-01 05:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 05:00:00', freq='H'),
Timestamp('2011-01-01 06:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 06:00:00', freq='H'),
Timestamp('2011-01-01 07:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 07:00:00', freq='H'),
Timestamp('2011-01-01 08:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 08:00:00', freq='H'),
Timestamp('2011-01-01 09:00:00', freq='H'),
3600.0)]
Related
I am having the below data frame which is a time-series data and I process this information to input to my prediction models.
df = pd.DataFrame({"timestamp": [pd.Timestamp('2019-01-01 01:00:00', tz=None),
pd.Timestamp('2019-01-01 01:00:00', tz=None),
pd.Timestamp('2019-01-01 01:00:00', tz=None),
pd.Timestamp('2019-01-01 02:00:00', tz=None),
pd.Timestamp('2019-01-01 02:00:00', tz=None),
pd.Timestamp('2019-01-01 02:00:00', tz=None),
pd.Timestamp('2019-01-01 03:00:00', tz=None),
pd.Timestamp('2019-01-01 03:00:00', tz=None),
pd.Timestamp('2019-01-01 03:00:00', tz=None)],
"value":[5.4,5.1,100.8,20.12,21.5,80.08,150.09,160.12,20.06]
})
From this, I take the mean of the value for each timestamp and will send the value as the input to the predictor. But currently, I am using just thresholds to filter out the outliers,but those seem to filter out real vales and also not filter some outliers .
For example, I kept
df[(df['value']>3 )& (df['value']<120 )]
and then this does not filter out
2019-01-01 01:00:00 100.8
which is an outlier for that timestamp and does filter out
2019-01-01 03:00:00 150.09
2019-01-01 03:00:00 160.12
which are not outliers for that timestamp.
So how do I filter out outliers for each timestamp based on which does not fit that group?
Any help is appreciated.
Ok, let's assume you are searching for the confidence interval to detect outlier.
Then you have to get the mean and the confidence intervals for each timestamp group. Therefore you can run:
stats = df.groupby(['timestamp'])['value'].agg(['mean', 'count', 'std'])
ci95_hi = []
ci95_lo = []
import math
for i in stats.index:
m, c, s = stats.loc[i]
ci95_hi.append(m + 1.96*s/math.sqrt(c))
ci95_lo.append(m - 1.96*s/math.sqrt(c))
stats['ci95_hi'] = ci95_hi
stats['ci95_lo'] = ci95_lo
df = pd.merge(df, stats, how='left', on='timestamp')
which leads to the following output:
then you can adjust a filter column:
import numpy as np
df['Outlier'] = np.where(df['value'] >= df['ci95_hi'], 1, np.where(df['value']<= df['ci95_lo'], 1, 0))
then everythign with a 1 in the column outlier is an outlier. You can adjust the values with 1.96 to play a little with it.
The outcome looks like:
When I use pandas.date_range(), I sometimes have timestamp that have lots of milliseconds that I don't wish to keep.
Suppose I do...
import pandas as pd
dr = pd.date_range('2011-01-01', '2011-01-03', periods=15)
>>> dr
DatetimeIndex([ '2011-01-01 00:00:00',
'2011-01-01 03:25:42.857142784',
'2011-01-01 06:51:25.714285824',
'2011-01-01 10:17:08.571428608',
'2011-01-01 13:42:51.428571392',
'2011-01-01 17:08:34.285714176',
'2011-01-01 20:34:17.142857216',
'2011-01-02 00:00:00',
'2011-01-02 03:25:42.857142784',
'2011-01-02 06:51:25.714285824',
'2011-01-02 10:17:08.571428608',
'2011-01-02 13:42:51.428571392',
'2011-01-02 17:08:34.285714176',
'2011-01-02 20:34:17.142857216',
'2011-01-03 00:00:00'],
dtype='datetime64[ns]', freq=None)
To ignore the currend miliseconds, I am forced to do this.
>>> t = []
>>> for item in dr:
... idx = str(item).find('.')
... if idx != -1:
... item = str(item)[:idx]
... t.append(pd.to_datetime(item))
...
>>> t
[Timestamp('2011-01-01 00:00:00'),
Timestamp('2011-01-01 03:25:42'),
Timestamp('2011-01-01 06:51:25'),
Timestamp('2011-01-01 10:17:08'),
Timestamp('2011-01-01 13:42:51'),
Timestamp('2011-01-01 17:08:34'),
Timestamp('2011-01-01 20:34:17'),
Timestamp('2011-01-02 00:00:00'),
Timestamp('2011-01-02 03:25:42'),
Timestamp('2011-01-02 06:51:25'),
Timestamp('2011-01-02 10:17:08'),
Timestamp('2011-01-02 13:42:51'),
Timestamp('2011-01-02 17:08:34'),
Timestamp('2011-01-02 20:34:17'),
Timestamp('2011-01-03 00:00:00')]
Is there a better way ?
I already tried this...
dr = [ pd.to_datetime(item, format='%Y-%m-%d %H:%M:%S') for item in dr ]
But it doesn't do anything.
(pd.date_range('2011-01-01', '2011-01-03', periods=15)).astype('datetime64[s]')
But it says it can't cast it.
dr = (dr.to_series()).apply(lambda x:x.replace(microseconds=0))
But this line doesn't solve my problem, as...
2018-04-17 15:07:04.777777664 gives --> 2018-04-17 15:07:04.000000664
I believe need DatetimeIndex.floor:
print (dr.floor('S'))
DatetimeIndex(['2011-01-01 00:00:00', '2011-01-01 03:25:42',
'2011-01-01 06:51:25', '2011-01-01 10:17:08',
'2011-01-01 13:42:51', '2011-01-01 17:08:34',
'2011-01-01 20:34:17', '2011-01-02 00:00:00',
'2011-01-02 03:25:42', '2011-01-02 06:51:25',
'2011-01-02 10:17:08', '2011-01-02 13:42:51',
'2011-01-02 17:08:34', '2011-01-02 20:34:17',
'2011-01-03 00:00:00'],
dtype='datetime64[ns]', freq=None)
I have a user case that I need always the non-leap calendar whatever the year is a leap year or not. I want to construct a 6-hourly datetime list for year 2000, for example:
import datetime
import pandas as pa
tdelta = datetime.timedelta(hours=6)
dt = datetime.datetime(2000,1,1,0,)
ts = [dt+i*tdelta for i in range(1460)]
pa.DatetimeIndex(ts)
With this block of code, I get the result:
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 06:00:00',
'2000-01-01 12:00:00', '2000-01-01 18:00:00',
'2000-01-02 00:00:00', '2000-01-02 06:00:00',
'2000-01-02 12:00:00', '2000-01-02 18:00:00',
'2000-01-03 00:00:00', '2000-01-03 06:00:00',
...
'2000-12-28 12:00:00', '2000-12-28 18:00:00',
'2000-12-29 00:00:00', '2000-12-29 06:00:00',
'2000-12-29 12:00:00', '2000-12-29 18:00:00',
'2000-12-30 00:00:00', '2000-12-30 06:00:00',
'2000-12-30 12:00:00', '2000-12-30 18:00:00'],
dtype='datetime64[ns]', length=1460, freq=None, tz=None)
However I want the February to have 28 days and thus the last member of the output should be '2000-12-31 18:00:00', are there some way to do this with python? Thanks!!
All you need to do is check for the .month and .day attribute for the datetime instance. So just insert a condition that checks:
if month == 2
if day == 2
If both the conditions are true, you don't add it to the list.
To make it more descriptive:
ts = []
for i in range(1460):
x = dt + i * tdelta
if x.month == 2 and x.day == 29:
continue
ts.append(x)
Assume I have loaded time series data from SQL or CSV (not created in Python), the index would be:
DatetimeIndex(['2015-03-02 00:00:00', '2015-03-02 01:00:00',
'2015-03-02 02:00:00', '2015-03-02 03:00:00',
'2015-03-02 04:00:00', '2015-03-02 05:00:00',
'2015-03-02 06:00:00', '2015-03-02 07:00:00',
'2015-03-02 08:00:00', '2015-03-02 09:00:00',
...
'2015-07-19 14:00:00', '2015-07-19 15:00:00',
'2015-07-19 16:00:00', '2015-07-19 17:00:00',
'2015-07-19 18:00:00', '2015-07-19 19:00:00',
'2015-07-19 20:00:00', '2015-07-19 21:00:00',
'2015-07-19 22:00:00', '2015-07-19 23:00:00'],
dtype='datetime64[ns]', name=u'hour', length=3360, freq=None, tz=None)
As you can see, the freq is None. I am wondering how can I detect the frequency of this series and set the freq as its frequency. If possible, I would like this to work in the case of data which isn't continuous (there are plenty of breaks in the series).
I was trying to find the mode of all the differences between two timestamps, but I am not sure how to transfer it into a format that is readable by Series
It is worth mentioning that if data is continuous, you can use pandas.DateTimeIndex.inferred_freq property:
dt_ix = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_ix._set_freq(None)
dt_ix.inferred_freq
Out[2]: 'H'
or pandas.infer_freq method:
pd.infer_freq(dt_ix)
Out[3]: 'H'
If not continuous pandas.infer_freq will return None. Similarly to what has been proposed yet, another alternative is using pandas.Series.diff method:
split_ix = dt_ix.drop(pd.date_range('2015-05-01 00:00:00','2015-05-30 00:00:00', freq='1H'))
split_ix.to_series().diff().min()
Out[4]: Timedelta('0 days 01:00:00')
Maybe try taking difference of the timeindex and use the mode (or smallest difference) as the freq.
import pandas as pd
import numpy as np
# simulate some data
# ===================================
np.random.seed(0)
dt_rng = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_idx = pd.DatetimeIndex(np.random.choice(dt_rng, size=2000, replace=False))
df = pd.DataFrame(np.random.randn(2000), index=dt_idx, columns=['col']).sort_index()
df
col
2015-03-02 01:00:00 2.0261
2015-03-02 04:00:00 1.3325
2015-03-02 05:00:00 -0.9867
2015-03-02 06:00:00 -0.0671
2015-03-02 08:00:00 -1.1131
2015-03-02 09:00:00 0.0494
2015-03-02 10:00:00 -0.8130
2015-03-02 11:00:00 1.8453
... ...
2015-07-19 13:00:00 -0.4228
2015-07-19 14:00:00 1.1962
2015-07-19 15:00:00 1.1430
2015-07-19 16:00:00 -1.0080
2015-07-19 18:00:00 0.4009
2015-07-19 19:00:00 -1.8434
2015-07-19 20:00:00 0.5049
2015-07-19 23:00:00 -0.5349
[2000 rows x 1 columns]
# processing
# ==================================
# the gap distribution
res = (pd.Series(df.index[1:]) - pd.Series(df.index[:-1])).value_counts()
01:00:00 1181
02:00:00 499
03:00:00 180
04:00:00 93
05:00:00 24
06:00:00 10
07:00:00 9
08:00:00 3
dtype: int64
# the mode can be considered as frequency
res.index[0] # output: Timedelta('0 days 01:00:00')
# or maybe the smallest difference
res.index.min() # output: Timedelta('0 days 01:00:00')
# get full datetime rng
full_rng = pd.date_range(df.index[0], df.index[-1], freq=res.index[0])
full_rng
DatetimeIndex(['2015-03-02 01:00:00', '2015-03-02 02:00:00',
'2015-03-02 03:00:00', '2015-03-02 04:00:00',
'2015-03-02 05:00:00', '2015-03-02 06:00:00',
'2015-03-02 07:00:00', '2015-03-02 08:00:00',
'2015-03-02 09:00:00', '2015-03-02 10:00:00',
...
'2015-07-19 14:00:00', '2015-07-19 15:00:00',
'2015-07-19 16:00:00', '2015-07-19 17:00:00',
'2015-07-19 18:00:00', '2015-07-19 19:00:00',
'2015-07-19 20:00:00', '2015-07-19 21:00:00',
'2015-07-19 22:00:00', '2015-07-19 23:00:00'],
dtype='datetime64[ns]', length=3359, freq='H', tz=None)
The minimum time difference is found with
np.diff(data.index.values).min()
which is normally in units of ns. To get a frequency, assuming ns:
freq = 1e9 / np.diff(df.index.values).min().astype(int)
how to check if a numpy datetime is between time1 and time2(without date).
Say I have a series of datetime, i want to check its weekday, and whether it's between 13:00 and 13:30. For example
2014-03-05 22:55:00
is Wed and it's not between 13:00 and 13:30
Using pandas, you could use the DatetimeIndex.indexer_between_time method to find those dates whose time is between 13:00 and 13:30.
For example,
import pandas as pd
dates = pd.date_range('2014-3-1 00:00:00', '2014-3-8 0:00:00', freq='50T')
dates_between = dates[dates.indexer_between_time('13:00','13:30')]
wednesdays_between = dates_between[dates_between.weekday == 2]
These are the first 5 items in dates:
In [95]: dates.tolist()[:5]
Out[95]:
[Timestamp('2014-03-01 00:00:00', tz=None),
Timestamp('2014-03-01 00:50:00', tz=None),
Timestamp('2014-03-01 01:40:00', tz=None),
Timestamp('2014-03-01 02:30:00', tz=None),
Timestamp('2014-03-01 03:20:00', tz=None)]
Notice that these dates are all between 13:00 and 13:30:
In [96]: dates_between.tolist()[:5]
Out[96]:
[Timestamp('2014-03-01 13:20:00', tz=None),
Timestamp('2014-03-02 13:30:00', tz=None),
Timestamp('2014-03-04 13:00:00', tz=None),
Timestamp('2014-03-05 13:10:00', tz=None),
Timestamp('2014-03-06 13:20:00', tz=None)]
And of those dates, here is the only one that is a Wednesday:
In [99]: wednesdays_between.tolist()
Out[99]: [Timestamp('2014-03-05 13:10:00', tz=None)]