So I have a deltatime array dt=[(20,6)(20,7)(20,9)(20,10)(20,11)(20,13)] and the issue i have is that i cant allow any data to be more than one second apart from the next value in the list. I wrote out a little if statement that goes
for k in range(len(dt)-15):
if dt[k+1].seconds-dt[k].seconds>1:
gj.append(dt[k])
gj.append(dt[k+1])
and I end up with (20,7)(20,9)(20,11)(20,13) so I know which times are greater than one second apart, but I can't figure out how to delete the values from a deltatime array. I tried numpy.delete but that didnt work because it's in a non useable format. The end goal is having a new array [(20,6)(20,10)] with only data that is one second apart.
Why not check for a difference smaller than 1 second and append those to a list?
Code
from datetime import time
dt = [(20,6), (20,7), (20,9), (20,10), (20,11), (20,13)]
dt = [time(0, m, s) for m, s in dt]
left = []
for i in range(len(dt) - 1):
if dt[i + 1].second - dt[i].second <= 1:
left.append(dt[i])
print(left)
Result
>>> [datetime.time(0, 20, 6), datetime.time(0, 20, 9), datetime.time(0, 20, 10)]
Related
I have a series of lists (np.arrays, actually), of which the elements are dates.
id
0a0fe3ed-d788-4427-8820-8b7b696a6033 [2019-01-30, 2019-01-31, 2019-02-01, 2019-02-0...
0a48d1e8-ead2-404a-a5a2-6b05371200b1 [2019-01-30, 2019-01-31, 2019-02-01, 2019-02-0...
0a9edba1-14e3-466a-8d0c-f8a8170cefc8 [2019-01-29, 2019-01-30, 2019-01-31, 2019-02-0...
Name: startDate, dtype: object
For each element in the series (i.e. for each list of dates), I want to retain the longest sublist in which all dates are consecutive. I'm struggling to approach this in a pythonic (simple/efficient) way. The only approach that I can think of is to use multiple loops: loop over the series values (the lists), and loop over each element in the list. I would then store the first date and the number of consecutive days, and use temporary values to overwrite the results if a longer sequence of consecutive days is encountered. This seems highly inefficient though. Is there a better way of doing this?
Since you mention you are using numpy arrays of dates it makes sense to stick to numpy types instead of converting to the built-in type. I'm assuming here that your arrays have dtype 'datetime64[D]'. In that case you could do something like
import numpy as np
date_list = np.array(['2005-02-01', '2005-02-02', '2005-02-03',
'2005-02-05', '2005-02-06', '2005-02-07', '2005-02-08', '2005-02-09',
'2005-02-11', '2005-02-12',
'2005-02-14', '2005-02-15', '2005-02-16', '2005-02-17',
'2005-02-19', '2005-02-20',
'2005-02-22', '2005-02-23', '2005-02-24',
'2005-02-25', '2005-02-26', '2005-02-27', '2005-02-28'],
dtype='datetime64[D]')
i0max, i1max = 0, 0
i0 = 0
for i1, date in enumerate(date_list):
if date - date_list[i0] != np.timedelta64(i1-i0, 'D'):
if i1 - i0 > i1max - i0max:
i0max, i1max = i0, i1
i0 = i1
print(date_list[i0max:i1max])
# output: ['2005-02-05' '2005-02-06' '2005-02-07' '2005-02-08' '2005-02-09']
Here, i0 and i1 indicate the start and stop indeces of the current sub-array of consecutive dates, and i0max and i1max the start and stop indices of the longest sub-array found so far. The solution uses the fact that the difference between the i-th and zeroth entry in a list of consecutive dates is exactly i days.
You can convert list to ordinals which are increasing for all consecutive dates. Which means next_date = previous_date + 1 read more.
Then find the longest consecutive sub-array.
This process will take O(n)->single loop time which is the most efficient way to get this.
CODE
from datetime import datetime
def get_consecutive(date_list):
# convert to ordinals
v = [datetime.strptime(d, "%Y-%m-%d").toordinal() for d in date_list]
consecutive = []
run = []
dates = []
# get consecutive ordinal sequence
for i in range(1, len(v) + 1):
run.append(v[i-1])
dates.append(date_list[i-1])
if i == len(v) or v[i-1] + 1 != v[i]:
if len(consecutive) < len(run):
consecutive = dates
dates = []
run = []
return consecutive
OUTPUT:
date_list = ['2019-01-29', '2019-01-30', '2019-01-31','2019-02-05']
get_consecutive(date_list )
# ordinales will be -> v = [737088, 737089, 737090, 737095]
OUTPUT:
['2019-01-29', '2019-01-30', '2019-01-31']
Now use get_consecutive in df.column.apply(get_consecutive)it will give you all increasing date list. Or you can all function for each list if you are using some other data structure.
I'm going to reduce this problem to finding consecutive days in a single list. There are a few tricks that make it more Pythonic as you ask. The following script should run as-is. I've documented how it works inline:
from datetime import timedelta, date
# example input
days = [
date(2020, 1, 1), date(2020, 1, 2), date(2020, 1, 4),
date(2020, 1, 5), date(2020, 1, 6), date(2020, 1, 8),
]
# store the longest interval and the current consecutive interval
# as we iterate through a list
longest_interval_index = current_interval_index = 0
longest_interval_length = current_interval_length = 1
# using zip here to reduce the number of indexing operations
# this will turn the days list into [(2020-01-1, 2020-01-02), (2020-01-02, 2020-01-03), ...]
# use enumerate to get the index of the current day
for i, (previous_day, current_day) in enumerate(zip(days, days[1:]), start=1):
if current_day - previous_day == timedelta(days=+1):
# we've found a consecutive day! increase the interval length
current_interval_length += 1
else:
# nope, not a consecutive day! start from this day and start
# counting from 1
current_interval_index = i
current_interval_length = 1
if current_interval_length > longest_interval_length:
# we broke the record! record it as the longest interval
longest_interval_index = current_interval_index
longest_interval_length = current_interval_length
print("Longest interval index:", longest_interval_index)
print("Longest interval: ", days[longest_interval_index:longest_interval_index + longest_interval_length])
It should be easy enough to turn this into a reusable function.
I need some help.
I have a dictionary, which represents measured data during 8 days every 15 minutes, but some measurements are missing. Keys are datetime objects:
datetime(year, month, date, hour, minute)
and values represent measured parameter. My aim is to obtain a new dictionary with keys, which represent only time of day, e.g.:
time(hour, minute)
and its values are lists of measurements, which were provided at the same time every day. So, instead of the first dictionary of length L, I want to obtain new dictionary of length L/8 where every value is a list of 8 numbers (sometimes less, if there is a missing value at one or two days). It is very simple problem for me, if there are no missing values, but with missing values my program returns some strange result. If somebody can provide an idea, it would be great! Here is my code (tec is my initial dictionary):
time_of_day = []
date = datetime(2020, 1, 13, 0, 0)
while date < datetime(2020, 1, 14, 0, 0):
time_of_day.append(date)
date = date + td(minutes = 15)
day_tec = dict.fromkeys(time_of_day, [])
for i in day_tec.keys():
j = 0
while j < 8:
try:
day_tec[i].append(tec[i + j * timedelta(days = 1)])
except Exception as e:
print(e)
pass
j = j + 1
print(day_tec)
print(day_tec) returns dictionary with datetime objects as a keys from datetime(2020, 1, 13, 0, 0) to datetime(2020, 1, 14, 0, 0) every 15 minutes, but its values are lists with the same length as an initial dictionary length.
You can use a defaultdict to get what you want. The code below will give you a dictionary where each key is a tuple of hour and minute, and each value is the recorded value in the tec dict.
from collections import defaultdict
day_tec = defaultdict(list)
for dt, value in tec.items():
tm = (dt.hour, dt.minute)
day_tec[tm].append(value)
print(day_tec)
I have a Numpy array A of shape nX2, representing n different events. The first column holds the starting times of the events, and the second holds the respective durations of each event.
For some time duration [0, T] and N different equidistant time points, I would like a count of how many events are ongoing at each time point. (i.e. an integer array of length N, each entry has the number of events that started before that time and lasted till after)
What is the most efficient way to achieve this in Python?
*I know what I'm asking for isn't really a histogram. If someone has a better term feel free to edit the title
You can try something like this. The idea is: for each bin, determine which events have started before the end of the bin but end after the start of the bin.
A = np.array([[1, 5, 6, 10], [5, 4, 1, 1]]).T
start = A[:, 0]
end = A.sum(axis=1)
lower = 0
upper = 100
N = 10
bins = np.linspace(lower, upper, num=N+1)
[( (end > bins[n]) & (start < bins[n+1]) ).sum() for n in range(N)]
Given a datetime array of the shape (n, 2):
x = np.array([['2017-10-02T00:00:00.000000000', '2017-10-12T00:00:00.000000000']], dtype='datetime64[ns]')
x has shape (1, 2), but in reality it could be (n, 2), n >= 1. In each pair, the first date is always smaller than (or equal to) the second. I want to get a list of all date ranges between each pair of dates in x. This is what I'm doing basically:
np.concatenate([pd.date_range(*y, closed='right') for y in x])
And it works, giving
array(['2017-10-03T00:00:00.000000000', '2017-10-04T00:00:00.000000000',
'2017-10-05T00:00:00.000000000', '2017-10-06T00:00:00.000000000',
'2017-10-07T00:00:00.000000000', '2017-10-08T00:00:00.000000000',
'2017-10-09T00:00:00.000000000', '2017-10-10T00:00:00.000000000',
'2017-10-11T00:00:00.000000000', '2017-10-12T00:00:00.000000000'], dtype='datetime64[ns]')
But this is pretty slow because of the list comp - it isn't exactly vectorised as I'd like. I'm wondering if there's a better way to obtain date ranges for multiple pairs of dates?
I'll provide as much clarification as needed. Thanks.
It's a tad convoluted...
But
d = np.array(1, dtype='timedelta64[D]')
x = x.astype('datetime64[D]')
deltas = np.diff(x, axis=1) / d
np.concatenate([
i + np.arange(j + 1) for i, j in zip(x[:, 0], deltas[:, 0].astype(int))
]).astype('datetime64[ns]')
array(['2017-10-02T00:00:00.000000000', '2017-10-03T00:00:00.000000000',
'2017-10-04T00:00:00.000000000', '2017-10-05T00:00:00.000000000',
'2017-10-06T00:00:00.000000000', '2017-10-07T00:00:00.000000000',
'2017-10-08T00:00:00.000000000', '2017-10-09T00:00:00.000000000',
'2017-10-10T00:00:00.000000000', '2017-10-11T00:00:00.000000000',
'2017-10-12T00:00:00.000000000'], dtype='datetime64[ns]')
How it works
d represents one day
x is turned into dates with no timestamps
diff gets me the number of days difference... but in timedelta space
I divide by my d which is also in timedelta space and the dimensions disappear... leaving me with float which I cast to int
When I add the first column of the pairs x[:, 0] to an array of integers, I get a broadcasting of adding 1 unit of whatever the dimension is of x, which is datetime64[D]. So I'm adding one day.
Derived from / Inspired by #hpaulj
Will remove if they post an answer
d = np.array(1, dtype='timedelta64[D]')
np.concatenate([np.arange(row[0], row[1] + 1, d) for row in x])
array(['2017-10-02T00:00:00.000000000', '2017-10-03T00:00:00.000000000',
'2017-10-04T00:00:00.000000000', '2017-10-05T00:00:00.000000000',
'2017-10-06T00:00:00.000000000', '2017-10-07T00:00:00.000000000',
'2017-10-08T00:00:00.000000000', '2017-10-09T00:00:00.000000000',
'2017-10-10T00:00:00.000000000', '2017-10-11T00:00:00.000000000',
'2017-10-12T00:00:00.000000000'], dtype='datetime64[ns]')
Is there any efficient way in python to count the times an array of numbers is between certain intervals? the number of intervals i will be using may get quite large
like:
mylist = [4,4,1,18,2,15,6,14,2,16,2,17,12,3,12,4,15,5,17]
some function(mylist, startpoints):
# startpoints = [0,10,20]
count values in range [0,9]
count values in range [10-19]
output = [9,10]
you will have to iterate the list at least once.
The solution below works with any sequence/interval that implements comparision (<, >, etc) and uses bisect algorithm to find the correct point in the interval, so it is very fast.
It will work with floats, text, or whatever. Just pass a sequence and a list of the intervals.
from collections import defaultdict
from bisect import bisect_left
def count_intervals(sequence, intervals):
count = defaultdict(int)
intervals.sort()
for item in sequence:
pos = bisect_left(intervals, item)
if pos == len(intervals):
count[None] += 1
else:
count[intervals[pos]] += 1
return count
data = [4,4,1,18,2,15,6,14,2,16,2,17,12,3,12,4,15,5,17]
print count_intervals(data, [10, 20])
Will print
defaultdict(<type 'int'>, {10: 10, 20: 9})
Meaning that you have 10 values <10 and 9 values <20.
I don't know how large your list will get but here's another approach.
import numpy as np
mylist = [4,4,1,18,2,15,6,14,2,16,2,17,12,3,12,4,15,5,17]
np.histogram(mylist, bins=[0,9,19])
You can also use a combination of value_counts() and pd.cut() to help you get the job done.
import pandas as pd
mylist = [4,4,1,18,2,15,6,14,2,16,2,17,12,3,12,4,15,5,17]
split_mylist = pd.cut(mylist, [0, 9, 19]).value_counts(sort = False)
print(split_mylist)
This piece of code will return this:
(0, 10] 10
(10, 20] 9
dtype: int64
Then you can utilise the to_list() function to get what you want
split_mylist = split_mylist.tolist()
print(split_mylist)
Output: [10, 9]
If the numbers are integers, as in your example, representing the intervals as frozensets can perhaps be fastest (worth trying). Not sure if the intervals are guaranteed to be mutually exclusive -- if not, then
intervals = [frozenzet(range(10)), frozenset(range(10, 20))]
counts = [0] * len(intervals)
for n in mylist:
for i, inter in enumerate(intervals):
if n in inter:
counts[i] += 1
if the intervals are mutually exclusive, this code could be sped up a bit by breaking out of the inner loop right after the increment. However for mutually exclusive intervals of integers >= 0, there's an even more attractive option: first, prepare an auxiliary index, e.g. given your startpoints data structure that could be
indices = [sum(i > x for x in startpoints) - 1 for i in range(max(startpoints))]
and then
counts = [0] * len(intervals)
for n in mylist:
if 0 <= n < len(indices):
counts[indices[n]] += 1
this can be adjusted if the intervals can be < 0 (everything needs to be offset by -min(startpoints) in that case.
If the "numbers" can be arbitrary floats (or decimal.Decimals, etc), not just integer, the possibilities for optimization are more restricted. Is that the case...?