Verifying timestamps in a time series - python

I am working with time series data and I would like to know if there is a efficient & pythonic way to verify if the sequence of timestamps associated to the series is valid. In other words, I would like to know if the sequence of time stamps is in the correct ascending order without missing or duplicated values.
I suppose that verifying the correct order and the presence of duplicated values should be fairly straightforward but I am not so sure about the detection of missing timestamps.

numpy.diff can be used to find the difference between subsequent time stamps. These diffs can then be evaluated to determine if the timestamps look as expected:
import numpy as np
import datetime as dt
def errant_timestamps(ts, expected_time_step=None, tolerance=0.02):
# get the time delta between subsequent time stamps
ts_diffs = np.array([tsd.total_seconds() for tsd in np.diff(ts)])
# get the expected delta
if expected_time_step is None:
expected_time_step = np.median(ts_diffs)
# find the index of timestamps that don't match the spacing of the rest
ts_slow_idx = np.where(ts_diffs < expected_time_step * (1-tolerance))[0] + 1
ts_fast_idx = np.where(ts_diffs > expected_time_step * (1+tolerance))[0] + 1
# find the errant timestamps
ts_slow = ts[ts_slow_idx]
ts_fast = ts[ts_fast_idx]
# if the timestamps appear valid, return None
if len(ts_slow) == 0 and len(ts_fast) == 0:
return None
# return any errant timestamps
return ts_slow, ts_fast
sample_timestamps = np.array(
[dt.datetime.strptime(sts, "%d%b%Y %H:%M:%S") for sts in (
"05Jan2017 12:45:00",
"05Jan2017 12:50:00",
"05Jan2017 12:55:00",
"05Jan2017 13:05:00",
"05Jan2017 13:10:00",
"05Jan2017 13:00:00",
)]
)
print errant_timestamps(sample_timestamps)

Related

how to transform for loop to lambda function

I have written this function:
def time_to_unix(df,dateToday):
'''this function creates the timestamp column for the dataframe. it also gets today's date (ex: 2022-8-8 0:0:0)
and then it adds the seconds that were originally in the timestamp column.
input: dataframe, dateToday(type: pandas.core.series.Series)
output: list of times
'''
dateTime = dateToday[0]
times = []
for i in range(0,len(df['timestamp'])):
dateAndTime = dateTime + timedelta(seconds = float(df['timestamp'][i]))
unix = pd.to_datetime([dateAndTime]).astype(int) / 10**9
times.append(unix[0])
return times
so it takes a dataframe and it gets today's date and then its taking the value of the timestamp in the dataframe( which is in seconds like 10,20,.... ) then it applies the function and returns times in unix time
however, because I have approx 2million row in my dataframe, its taking me a lot of time to run this code.
how can I use lambda function or something else in order to speed up my code and the process.
something along the line of:
df['unix'] = df.apply(lambda row : something in here), axis = 1)
What I think you'll find is that most of the time is spent in the creation and manipulation of the datetime / timestamp objects in the dataframe (see here for more info). I also try to avoid using lambdas like this on large dataframes as they go row by row which should be avoided. What I've done when dealing with datetimes / timestamps / timezone changes in the past is to build a dictionary of the possible datetime combinations and then use map to apply them. Something like this:
import datetime as dt
import pandas as pd
#Make a time key column out of your date and timestamp fields
df['time_key'] = df['date'].astype(str) + '#' + df['timestamp']
#Build a dictionary from the unique time keys in the dataframe
time_dict = dict()
for time_key in df['time_key'].unique():
time_split = time_key.split('#')
#Create the Unix time stamp based on the values in the key; store it in the dictionary so it can be mapped later
time_dict[time_key] = (pd.to_datetime(time_split[0]) + dt.timedelta(seconds=float(time_split[1]))).astype(int) / 10**9
#Now map the time_key to the unix column in the dataframe from the dictionary
df['unix'] = df['time_key'].map(time_dict)
Note if all the datetime combinations are unique in the dataframe, this likely won't help.
I'm not exactly sure what type dateTime[0] has. But you could try a more vectorized approach:
import pandas as pd
df["unix"] = (
(pd.Timestamp(dateTime[0]) + pd.to_timedelta(df["timestamp"], unit="seconds"))
.astype("int").div(10**9)
)
or
df["unix"] = (
(dateTime[0] + pd.to_timedelta(df["timestamp"], unit="seconds"))
.astype("int").div(10**9)
)

How to find the time interval between two unix timestamps?

I want to check if the time difference between two unix timestamps is close to a given interval and print the timestamp and use this new timestamp to compare further timestamps to see if the difference is close to the interval. The timestamps are in a numpy array.
This is my try at this:
from math import isclose
def check_time_interval(now, update, interval):
if isclose(update - now, interval):
# do something
print(update)
return update
else:
return now
interval = 60.0
now = timestamps[0]
for timestamp in timestamps:
now = check_time_interval(now, timestamp, interval)
This code doesn't print any timestamps although the difference is close to the interval. What am I doing wrong? Is there a better and efficient way to do this?
Edit:
sample input:
timestamps = [1632267861.212 + i for i in range(100)]
You use isclose incorrectly. Try this:
from math import isclose
def check_time_interval(now, update, interval):
if isclose(update, now, abs_tol=interval):
# do something
print(update)
return update
else:
return now
interval = 60.0
timestamps = [1632267861.212 + i for i in range(100)]
now = timestamps[0]
for timestamp in timestamps:
check_time_interval(now, timestamp, interval)
As for more efficient way, you could vectorize and check all intervals using numpy, something like this:
import numpy as np
timestamps = np.array(timestamps)
is_close = np.abs(timestamps - now) <= 60
print(timestamps[is_close])

Modifying the date index of pandas dataframe

I am trying to write a highly efficient function that would take an average size dataframe (~5000 rows) and return a dataframe with column of the latest year (and same index) such that for each date index of the original dataframe the month containing that date is between some pre-specified start date (st_d) and end date (end_d). I wrote a code where the year is decremented till the month for a particular dateindex is within the desired range. However, it is really slow. For the dataframe with only 366 entries it takes ~0.2s. I need to make it at least an order of magnitude faster so that I can repeatedly apply it to tens of thousands of dataframes. I would very much appreciate any suggestions for this.
import pandas as pd
import numpy as np
import time
from pandas.tseries.offsets import MonthEnd
def year_replace(st_d, end_d, x):
tmp = time.perf_counter()
def prior_year(d):
# 100 is number of the years back, more than enough.
for i_t in range(100):
#The month should have been fully seen in one of the data years.
t_start = pd.to_datetime(str(d.month) + '/' + str(end_d.year - i_t), format="%m/%Y")
t_end = t_start + MonthEnd(1)
if t_start <= end_d and t_start >= st_d and t_end <= end_d and t_end >= st_d:
break
if i_t < 99:
return t_start.year
else:
raise BadDataException("Not enough data for Gradient Boosted tree.")
output = pd.Series(index = x.index, data = x.index.map(lambda tt: prior_year(tt)), name = 'year')
print("time for single dataframe replacement = ", time.perf_counter() - tmp)
return output
i = pd.date_range('01-01-2019', '01-01-2020')
x = pd.DataFrame(index = i, data=np.full(len(i), 0))
st_d = pd.to_datetime('01/2016', format="%m/%Y")
end_d = pd.to_datetime('01/2018', format="%m/%Y")
year_replace(st_d, end_d, x)
My advice is: avoid loop whenever you can and check out if an easier way is available.
If I do understand what you aim to do is:
For given start and stop timestamps, find the latest (higher) timestamp t where month is given from index and start <= t <= stop
I believe this can be formalized as follow (I kept your function signature for conveniance):
def f(start, stop, x):
assert start < stop
tmp = time.perf_counter()
def y(d):
# Check current year:
if start <= d.replace(day=1, year=stop.year) <= stop:
return stop.year
# Check previous year:
if start <= d.replace(day=1, year=stop.year-1) <= stop:
return stop.year-1
# Otherwise fail:
raise TypeError("Ooops")
# Apply to index:
df = pd.Series(index=x.index, data=x.index.map(lambda t: y(t)), name='year')
print("Tick: ", time.perf_counter() - tmp)
return df
It seems to execute faster as requested (almost two decades, we should benchmark to be sure, eg.: with timeit):
Tick: 0.004744200000004639
There is no need to iterate, you can just check current and previous year. If it fails, it cannot exist a timestamp fulfilling your requirements.
If the day must be kept, then just remove the day=1 in replace method. If you require cut criteria not being equal then modify inequalities accordingly. The following function:
def y(d):
if start < d.replace(year=stop.year) < stop:
return stop.year
if start < d.replace(year=stop.year-1) < stop:
return stop.year-1
raise TypeError("Ooops")
Returns the same dataframe as yours.

Binning custom data based on a unix timestamp

I have a dict of data entries with a UNIX epoch timestamp as the key, and some value (this could be Boolean, int, float, enumerated string). I'm trying to set up a method that takes a start time, end time, and bin size (x minutes, x hours or x days), puts the values in the dict into the array of one of the bins between these times.
Essentially, I'm trying to convert data from the real world measured at a certain time to data occurring on a time-step, starting at time=0 and going until time=T, where the length of the time step can be set when calling the method.
I'm trying to make something along the lines of:
def binTimeSeries(dict, startTime, endTime, timeStep):
bins = []
#floor begin time to a timeStep increment
#ciel end time to a timeStep increment
for key in dict.keys():
if key > floorStartTime and key < cielEndTime:
timeDiff = (key - floorStartTime)
binIndex = floor(timeDiff/timeStep)
bins[binIndex].append(dict[key])
I'm having trouble working out what time format is suitable to do the conversion from UNIX epoch timestamp to, that can handle the floor, ciel and modulo operations given a variable timeStep interval, and then how to actually perform those operations. I've searched for this, but am getting confused with the formalisms for datetime, pandas, and which might be more suitable for this.
Maybe something like this? Instead of asking for a bin size (the interval of each bin), I think it makes more sense to ask how many bins you'd like instead. That way you're guaranteed each bin will be the same size (cover the same interval).
In my example below, I generated some fake data, which I called data. The start- and end-timestamps I picked arbitrarily, as well as the number of bins. I calculate the difference between the end- and start-timestamps, which I'm calling the duration - this yields the total duration between the two timestamps (I realize it's a bit silly to recalculate this value, seeing as how I hardcoded it earlier in the end_time_stamp definition, but it's just there for completeness). The bin_interval (in seconds) can be calculated by dividing the duration by the number of bins.
I ended up doing everything just using plain old UNIX / POSIX timestamps, without any conversion. However, I will mention that datetime.datetime has a method called fromtimestamp, which accepts a POSIX timestamp and returns a datetime object populated with the year, month, seconds, etc.
In addition, in my example, all I end up adding to the bins are the keys - just for demonstration - you'll have to modify it to suit your needs.
def main():
import time
values = ["A", "B", "C", "D", "E", "F", "G"]
data = {time.time() + (offset * 32): value for offset, value in enumerate(values)}
start_time_stamp = time.time() + 60
end_time_stamp = start_time_stamp + 75
number_of_bins = 12
assert end_time_stamp > start_time_stamp
duration = end_time_stamp - start_time_stamp
bin_interval = duration / number_of_bins
bins = [[] for _ in range(number_of_bins)]
for key, value in data.items():
if not (start_time_stamp <= key <= end_time_stamp):
continue
for bin_index, current_bin in enumerate(bins):
if start_time_stamp + (bin_index * bin_interval) <= key < start_time_stamp + ((bin_index + 1) * bin_interval):
current_bin.append(key)
break
print("Original data:")
for key, value in data.items():
print(key, value)
print(f"\nStart time stamp: {start_time_stamp}")
print(f"End time stamp: {end_time_stamp}\n")
print(f"Bin interval: {bin_interval}")
print("Bins:")
for current_bin in bins:
print(current_bin)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
Original data:
1573170895.1871762 A
1573170927.1871762 B
1573170959.1871762 C
1573170991.1871762 D
1573171023.1871762 E
1573171055.1871762 F
1573171087.1871762 G
Start time stamp: 1573170955.1871762
End time stamp: 1573171030.1871762
Bin interval: 6.25
Bins:
[1573170959.1871762]
[]
[]
[]
[]
[1573170991.1871762]
[]
[]
[]
[]
[1573171023.1871762]
[]

Convert timestamps of "yyyy-MM-dd'T'HH:mm:ss.SSSZ" format in Python

I have a log file with timestamps like "2012-05-12T13:04:35.347-07:00". I want to convert each timestamp into a number so that I sort them by ascending order based on time.
How can I do this in Python? In Java I found out that I can convert timestamps for such format with SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZ") but for Python I couldn't find anything.
As py2.x has issues with the %z directive you've to do something like this:
from datetime import timedelta,datetime
strs = "2012-05-12T13:04:35.347-07:00"
#replace the last ':' with an empty string, as python UTC offset format is +HHMM
strs = strs[::-1].replace(':','',1)[::-1]
As datetime.striptime doesn't supports %z(UTC offset)(at least not in py2.x), so you need a work around:
#Snippet taken from http://stackoverflow.com/a/526450/846892
try:
offset = int(strs[-5:])
except:
print "Error"
delta = timedelta(hours = offset / 100)
Now apply formatting to : '2012-05-12T13:04:35.347'
time = datetime.strptime(strs[:-5], "%Y-%m-%dT%H:%M:%S.%f")
time -= delta #reduce the delta from this time object
print time
#2012-05-12 20:04:35.347000

Categories