I'm wondering how I can make these three statements a single statement that I loop through:
minute_dt_array = np.arange(start, end, dt.timedelta(minutes=1)).astype(dt.datetime)
hour_dt_array = np.arange(start, end, dt.timedelta(hours=1)).astype(dt.datetime)
day_dt_array = np.arange(start, end, dt.timedelta(days=1)).astype(dt.datetime)
If I want to create a list that is [minutes, days, hours] so I can iterate through a single statement as opposed to writing it three times. How do I do that?
for example, I'm looking to write a loop that does something like this:
timeunits = ['day','hour','minute']
for interval in timeunits:
arrays['%s_array' % interval] = np.arange(start, end, dt.timedelta(**interval**=1)).astype(dt.datetime)
But I don't know what to put in the time delta function.
If you want to be able to call the array by name, what about a zip to fill a dict?
from datetime import datetime, timedelta
import numpy as np
start, end = datetime(2020,11,20), datetime(2020,11,22)
arrays = dict()
for k, i in zip(('days','hours','minutes'), (1, 1/24, 1/1440)):
arrays[k] = np.arange(start, end, timedelta(i)).astype(datetime)
# one-liner:
# arrays = {k: np.arange(start, end, timedelta(i)).astype(datetime) for k, i in zip(('days','hours','minutes'), (1, 1/24, 1/1440))}
Or if it is sufficient that arrays is a list, simply iterate the time intervals as
arrays = []
for i in (1, 1/24, 1/1440):
arrays.append(np.arange(start, end, timedelta(i)).astype(datetime))
There is no need to do that as with your intervals:
1d = 24h = 1440 minutes
So, every 60minutes you have 1h, every 24hrs you get 1d.
When you combine intervals as you suggest you get duplicate data points for for 1h timedelta resolution, and even triples for daily resolution. So, it should be enough to use the first granularity and just check if you got a round number of minutes, hours:
minute_dt_array = np.arange(start, end, dt.timedelta(minutes=1)).astype(dt.datetime)
the_list = []
for n, dtt in enumerate(minute_dt_array):
the_list.append([n // 1440, n // 60, n]) # days, hours, minutes
the_list is what you need.
Related
If a np.datetime64 type data is given, how to get a period of time around the time?
For example, if np.datetime64('2020-04-01T21:32') is given to a function, I want the function to return np.datetime64('2020-04-01T21:30') and np.datetime64('2020-04-01T21:39') - 10 minutes around the given time.
Is there any way to do this with numpy?
Numpy does not have a built in time period like Pandas does.
If all you want are two time stamps the following function should work.
def ten_minutes_around(t):
t = ((t.astype('<M8[m]') # Truncate time to integer minute representation
.astype('int') # Convert to integer representation
// 10) * 10 # Remove any sub 10 minute minutes
).astype('<M8[m]') # convert back to minute timestamp
return np.array([t, t + np.timedelta64(10, 'm')]).T
For example:
for t in [np.datetime64('2020-04-01T21:32'), np.datetime64('2052-02-03T13:56:03.172')]:
s, e = ten_minutes_around(t)
print(s, t, e)
gives:
2020-04-01T21:30 2020-04-01T21:32 2020-04-01T21:40
2652-02-03T13:50 2652-02-03T13:56:03.172 2652-02-03T14:00
and
ten_minutes_around(np.array([
np.datetime64('2020-04-01T21:32'),
np.datetime64('2652-02-03T13:56:03.172'),
np.datetime64('1970-04-01'),
]))
gives
array([['2020-04-01T21:30', '2020-04-01T21:40'],
['2652-02-03T13:50', '2652-02-03T14:00'],
['1970-04-01T00:00', '1970-04-01T00:10']], dtype='datetime64[m]')
To do so we can get the minute from the given time and subtract it from the given time to get the starting of the period and add 9 minutes to get the ending time of the period.
import numpy as np
time = '2020-04-01T21:32'
dt = np.datetime64(time)
base = (dt.tolist().time().minute) % 10 // base would be 3 in this case
start = dt - np.timedelta64(base,'m')
end = start + np.timedelta64(9,'m')
print(start,end,sep='\n')
I hope this helps.
I have a dict of data entries with a UNIX epoch timestamp as the key, and some value (this could be Boolean, int, float, enumerated string). I'm trying to set up a method that takes a start time, end time, and bin size (x minutes, x hours or x days), puts the values in the dict into the array of one of the bins between these times.
Essentially, I'm trying to convert data from the real world measured at a certain time to data occurring on a time-step, starting at time=0 and going until time=T, where the length of the time step can be set when calling the method.
I'm trying to make something along the lines of:
def binTimeSeries(dict, startTime, endTime, timeStep):
bins = []
#floor begin time to a timeStep increment
#ciel end time to a timeStep increment
for key in dict.keys():
if key > floorStartTime and key < cielEndTime:
timeDiff = (key - floorStartTime)
binIndex = floor(timeDiff/timeStep)
bins[binIndex].append(dict[key])
I'm having trouble working out what time format is suitable to do the conversion from UNIX epoch timestamp to, that can handle the floor, ciel and modulo operations given a variable timeStep interval, and then how to actually perform those operations. I've searched for this, but am getting confused with the formalisms for datetime, pandas, and which might be more suitable for this.
Maybe something like this? Instead of asking for a bin size (the interval of each bin), I think it makes more sense to ask how many bins you'd like instead. That way you're guaranteed each bin will be the same size (cover the same interval).
In my example below, I generated some fake data, which I called data. The start- and end-timestamps I picked arbitrarily, as well as the number of bins. I calculate the difference between the end- and start-timestamps, which I'm calling the duration - this yields the total duration between the two timestamps (I realize it's a bit silly to recalculate this value, seeing as how I hardcoded it earlier in the end_time_stamp definition, but it's just there for completeness). The bin_interval (in seconds) can be calculated by dividing the duration by the number of bins.
I ended up doing everything just using plain old UNIX / POSIX timestamps, without any conversion. However, I will mention that datetime.datetime has a method called fromtimestamp, which accepts a POSIX timestamp and returns a datetime object populated with the year, month, seconds, etc.
In addition, in my example, all I end up adding to the bins are the keys - just for demonstration - you'll have to modify it to suit your needs.
def main():
import time
values = ["A", "B", "C", "D", "E", "F", "G"]
data = {time.time() + (offset * 32): value for offset, value in enumerate(values)}
start_time_stamp = time.time() + 60
end_time_stamp = start_time_stamp + 75
number_of_bins = 12
assert end_time_stamp > start_time_stamp
duration = end_time_stamp - start_time_stamp
bin_interval = duration / number_of_bins
bins = [[] for _ in range(number_of_bins)]
for key, value in data.items():
if not (start_time_stamp <= key <= end_time_stamp):
continue
for bin_index, current_bin in enumerate(bins):
if start_time_stamp + (bin_index * bin_interval) <= key < start_time_stamp + ((bin_index + 1) * bin_interval):
current_bin.append(key)
break
print("Original data:")
for key, value in data.items():
print(key, value)
print(f"\nStart time stamp: {start_time_stamp}")
print(f"End time stamp: {end_time_stamp}\n")
print(f"Bin interval: {bin_interval}")
print("Bins:")
for current_bin in bins:
print(current_bin)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
Original data:
1573170895.1871762 A
1573170927.1871762 B
1573170959.1871762 C
1573170991.1871762 D
1573171023.1871762 E
1573171055.1871762 F
1573171087.1871762 G
Start time stamp: 1573170955.1871762
End time stamp: 1573171030.1871762
Bin interval: 6.25
Bins:
[1573170959.1871762]
[]
[]
[]
[]
[1573170991.1871762]
[]
[]
[]
[]
[1573171023.1871762]
[]
I have two lists.
The list times is a list of datetimes from 2018-04-10 00:00 to
2018-04-10 23:59.
For each item in times I have a corresponding label of 0 or 1 recorded in the list labels.
My goal is to get the mean label value (between 0 and 1) for every minute interval.
times = [Timestamp('2018-04-10 00:00:00.118000'),
Timestamp('2018-04-10 00:00:00.547000'),
Timestamp('2018-04-10 00:00:00.569000'),
Timestamp('2018-04-10 00:00:00.690000'),
.
.
.
Timestamp('2018-04-10 23:59:59.999000') ]
labels = [0,1,1,0,1,0,....1]
where len(times) == len(labels)
For every minute interval between 2018-04-10 00:00 and 2018-04-10 23:59, the min and max times in the list respectively, I am trying to get two lists:
1) The start time of the minute interval.
2) The mean average label value of all the datetimes in that interval.
In particular I am having trouble with (2).
Note: the times list is not necessarily chronologically ordered
Firstly, I begin with how I generated the data as above format
from datetime import datetime
size = int(1e6)
timestamp_a_day = np.linspace(datetime.now().timestamp(), datetime.now().timestamp()+24*60*60, size)
dummy_sec = np.random.rand(size)
timestamp_series = pd.Series(timestamp_a_day + dummy_sec)\
.sort_values().reset_index(drop=True)\
.apply(lambda x: datetime.fromtimestamp(x))
data = pd.DataFrame(timestamp_series, columns=['timestamp'])
data['label'] = np.random.randint(0, 2, size)
Let's solve this problem !!!
(I hope I understand your question precisely hahaha)
1) data['start_interval'] = data['timestamp'].dt.floor('s')
2) data.groupby('start_interval')['label'].mean()
zip times and labels then sort;
Write a function that returns the date, hour, minute of a Timestamp;
groupby that function;
sum and average the labels for each group
For my football data analysis, to use the pandas between_time function, I need to convert a list of strings representing fractional seconds from measurement onset into the pandas date_time index. The time data looks as follows:
In order to achieve this I tried the following:
df['Time'] = df['Timestamp']*(1/freq)
df.index = pd.to_datetime(df['Time'], unit='s')
In which freq=600 and Timestamp is the frame number counting up from 0.
I was expecting the new index to show the following format:
%y%m%d-%h%m%s%f
But unfortunately, the to_datetime doesn't know how to handle my type of time data (namely counting up till 4750s after the start).
My question is, therefore, how do I convert my time sample data into a date_time index.
Based on this topic I now created the following function:
def timeDelta2DateTime(self, time_delta_list):
'''This method converts a list containing the time since measurement onset [seconds] into a
list containing dateTime objects counting up from 00:00:00.
Args:
time_delta_list (list): List containing the times since the measurement has started.
Returns:
list: A list with the time in the DateTime format.
'''
### Use divmod to convert seconds to m,h,s.ms ###
s, fs = list(zip(*[divmod(item, 1) for item in time_delta_list]))
m, s = list(zip(*[divmod(item, 60) for item in s]))
h, m = list(zip(*[divmod(item, 60) for item in m]))
### Create DatTime list ###
ms = [item*1000 for item in fs] # Convert fractional seconds to ms
time_list_int = list(zip(*[list(map(int,h)), list(map(int,m)), list(map(int,s)), list(map(int,ms))])) # Combine h,m,s,ms in one list
### Return dateTime object list ###
return [datetime(2018,1,1,item[0],item[1],item[2],item[3]) for item in time_list_int]
As it seems to very slow feel free to suggest a better option.
Here is what I am using to group items by time frame when parsing a csv file, it is working fine, but now I would like to create slices by 4h and 30mn, this particular code just works for slices by the hour, I would like to create 4 hours slices ( or 30m slices )
tf = "%d-%b-%Y-%H"
lmb = lambda d: datetime.datetime.strptime(d["Date[G]"]+"-"+d["Time[G]"], "%d-%b-%Y-%H:%M:%S.%f").strftime(tf)
for k, g in itertools.groupby(csvReader, key = lmb):
for i in g:
"do something"
Thanks!
The general best approach is to have the groupby key return a tuple which groups items into the appropriate bucket.
For example, for 4h slices:
def by_4h(d):
dt = datetime.datetime.strptime(d["Date[G]"]+"-"+d["Time[G]"], "%d-%b-%Y-%H:%M:%S.%f")
return (dt.year, dt.month, dt.day, dt.hour // 4)
You now know that if two times are in the same 4 hour slice (starting from midnight) then hour // 4 will give the same result for those times, so you end the tuple there.
Or for 30m slices:
def by_30m(d):
dt = datetime.datetime.strptime(d["Date[G]"]+"-"+d["Time[G]"], "%d-%b-%Y-%H:%M:%S.%f")
return (dt.year, dt.month, dt.day, dt.hour, dt.minute // 30)
This is using // integer division for Python 3 compatibility, but it also works in Python 2.x and makes it clear that you want integer division.