I have a series of lists (np.arrays, actually), of which the elements are dates.
id
0a0fe3ed-d788-4427-8820-8b7b696a6033 [2019-01-30, 2019-01-31, 2019-02-01, 2019-02-0...
0a48d1e8-ead2-404a-a5a2-6b05371200b1 [2019-01-30, 2019-01-31, 2019-02-01, 2019-02-0...
0a9edba1-14e3-466a-8d0c-f8a8170cefc8 [2019-01-29, 2019-01-30, 2019-01-31, 2019-02-0...
Name: startDate, dtype: object
For each element in the series (i.e. for each list of dates), I want to retain the longest sublist in which all dates are consecutive. I'm struggling to approach this in a pythonic (simple/efficient) way. The only approach that I can think of is to use multiple loops: loop over the series values (the lists), and loop over each element in the list. I would then store the first date and the number of consecutive days, and use temporary values to overwrite the results if a longer sequence of consecutive days is encountered. This seems highly inefficient though. Is there a better way of doing this?
Since you mention you are using numpy arrays of dates it makes sense to stick to numpy types instead of converting to the built-in type. I'm assuming here that your arrays have dtype 'datetime64[D]'. In that case you could do something like
import numpy as np
date_list = np.array(['2005-02-01', '2005-02-02', '2005-02-03',
'2005-02-05', '2005-02-06', '2005-02-07', '2005-02-08', '2005-02-09',
'2005-02-11', '2005-02-12',
'2005-02-14', '2005-02-15', '2005-02-16', '2005-02-17',
'2005-02-19', '2005-02-20',
'2005-02-22', '2005-02-23', '2005-02-24',
'2005-02-25', '2005-02-26', '2005-02-27', '2005-02-28'],
dtype='datetime64[D]')
i0max, i1max = 0, 0
i0 = 0
for i1, date in enumerate(date_list):
if date - date_list[i0] != np.timedelta64(i1-i0, 'D'):
if i1 - i0 > i1max - i0max:
i0max, i1max = i0, i1
i0 = i1
print(date_list[i0max:i1max])
# output: ['2005-02-05' '2005-02-06' '2005-02-07' '2005-02-08' '2005-02-09']
Here, i0 and i1 indicate the start and stop indeces of the current sub-array of consecutive dates, and i0max and i1max the start and stop indices of the longest sub-array found so far. The solution uses the fact that the difference between the i-th and zeroth entry in a list of consecutive dates is exactly i days.
You can convert list to ordinals which are increasing for all consecutive dates. Which means next_date = previous_date + 1 read more.
Then find the longest consecutive sub-array.
This process will take O(n)->single loop time which is the most efficient way to get this.
CODE
from datetime import datetime
def get_consecutive(date_list):
# convert to ordinals
v = [datetime.strptime(d, "%Y-%m-%d").toordinal() for d in date_list]
consecutive = []
run = []
dates = []
# get consecutive ordinal sequence
for i in range(1, len(v) + 1):
run.append(v[i-1])
dates.append(date_list[i-1])
if i == len(v) or v[i-1] + 1 != v[i]:
if len(consecutive) < len(run):
consecutive = dates
dates = []
run = []
return consecutive
OUTPUT:
date_list = ['2019-01-29', '2019-01-30', '2019-01-31','2019-02-05']
get_consecutive(date_list )
# ordinales will be -> v = [737088, 737089, 737090, 737095]
OUTPUT:
['2019-01-29', '2019-01-30', '2019-01-31']
Now use get_consecutive in df.column.apply(get_consecutive)it will give you all increasing date list. Or you can all function for each list if you are using some other data structure.
I'm going to reduce this problem to finding consecutive days in a single list. There are a few tricks that make it more Pythonic as you ask. The following script should run as-is. I've documented how it works inline:
from datetime import timedelta, date
# example input
days = [
date(2020, 1, 1), date(2020, 1, 2), date(2020, 1, 4),
date(2020, 1, 5), date(2020, 1, 6), date(2020, 1, 8),
]
# store the longest interval and the current consecutive interval
# as we iterate through a list
longest_interval_index = current_interval_index = 0
longest_interval_length = current_interval_length = 1
# using zip here to reduce the number of indexing operations
# this will turn the days list into [(2020-01-1, 2020-01-02), (2020-01-02, 2020-01-03), ...]
# use enumerate to get the index of the current day
for i, (previous_day, current_day) in enumerate(zip(days, days[1:]), start=1):
if current_day - previous_day == timedelta(days=+1):
# we've found a consecutive day! increase the interval length
current_interval_length += 1
else:
# nope, not a consecutive day! start from this day and start
# counting from 1
current_interval_index = i
current_interval_length = 1
if current_interval_length > longest_interval_length:
# we broke the record! record it as the longest interval
longest_interval_index = current_interval_index
longest_interval_length = current_interval_length
print("Longest interval index:", longest_interval_index)
print("Longest interval: ", days[longest_interval_index:longest_interval_index + longest_interval_length])
It should be easy enough to turn this into a reusable function.
Related
I need some help.
I have a dictionary, which represents measured data during 8 days every 15 minutes, but some measurements are missing. Keys are datetime objects:
datetime(year, month, date, hour, minute)
and values represent measured parameter. My aim is to obtain a new dictionary with keys, which represent only time of day, e.g.:
time(hour, minute)
and its values are lists of measurements, which were provided at the same time every day. So, instead of the first dictionary of length L, I want to obtain new dictionary of length L/8 where every value is a list of 8 numbers (sometimes less, if there is a missing value at one or two days). It is very simple problem for me, if there are no missing values, but with missing values my program returns some strange result. If somebody can provide an idea, it would be great! Here is my code (tec is my initial dictionary):
time_of_day = []
date = datetime(2020, 1, 13, 0, 0)
while date < datetime(2020, 1, 14, 0, 0):
time_of_day.append(date)
date = date + td(minutes = 15)
day_tec = dict.fromkeys(time_of_day, [])
for i in day_tec.keys():
j = 0
while j < 8:
try:
day_tec[i].append(tec[i + j * timedelta(days = 1)])
except Exception as e:
print(e)
pass
j = j + 1
print(day_tec)
print(day_tec) returns dictionary with datetime objects as a keys from datetime(2020, 1, 13, 0, 0) to datetime(2020, 1, 14, 0, 0) every 15 minutes, but its values are lists with the same length as an initial dictionary length.
You can use a defaultdict to get what you want. The code below will give you a dictionary where each key is a tuple of hour and minute, and each value is the recorded value in the tec dict.
from collections import defaultdict
day_tec = defaultdict(list)
for dt, value in tec.items():
tm = (dt.hour, dt.minute)
day_tec[tm].append(value)
print(day_tec)
given a list of indices that match a condition, where there will be many spans in the list that are sequentially adjacent, how can I easily select only the first of each span.
such that
magicallySelect([1,2,3,10,11,12,100,101,102]) == [1,10,100]
but -- importantly, this should also work for other indicies, like dates (which is the case in my data). The actual code I'm hoping to get working is:
original.reset_index(inplace=True)
predict = {}
for app in apps:
reg = linear_model.LinearRegression()
reg.fit(original.index.values.reshape(-1, 1), original[app].values)
slope = reg.coef_.tolist()[0]
delta = original[app].apply(lambda x: abs(slope - x))
forecast['test_delta'] = forecast[app].apply(lambda x: abs(slope - x))
tdm = forecast['test_delta'].mean()
tds = forecast['test_delta'].std(ddof=0)
# identify moments that are σ>2 abnormal
forecast['z'] = forecast['test_delta'].apply(lambda x: abs(x - tdm / tds))
sig = forecast.index[forecast[forecast['z'] > 2]].tolist()
predict[app] = FIRST_INDEX_IN_EACH_SPAN_OF(sig)
l = [1,2,3,10,11,12,100,101,102]
indices = [l[i] for i in range(len(l)) if l[i-1]!=l[i]-1]
Reordering this slightly to work for datetimes, this would give you all items in the list where the gap from the previous item is greater than 1 day (plus the first item by default):
indices = [l[0]] + [l[i] for i in range(len(l)) if (l[i]-l[i-1]).days>1]
For a difference in time measured in minutes, you can convert to seconds and substitute this in. E.g. for 15 minutes (900 seconds) you can do:
indices = [l[0]] + [l[i] for i in range(len(l)) if (l[i]-l[i-1]).seconds>900]
So I have a deltatime array dt=[(20,6)(20,7)(20,9)(20,10)(20,11)(20,13)] and the issue i have is that i cant allow any data to be more than one second apart from the next value in the list. I wrote out a little if statement that goes
for k in range(len(dt)-15):
if dt[k+1].seconds-dt[k].seconds>1:
gj.append(dt[k])
gj.append(dt[k+1])
and I end up with (20,7)(20,9)(20,11)(20,13) so I know which times are greater than one second apart, but I can't figure out how to delete the values from a deltatime array. I tried numpy.delete but that didnt work because it's in a non useable format. The end goal is having a new array [(20,6)(20,10)] with only data that is one second apart.
Why not check for a difference smaller than 1 second and append those to a list?
Code
from datetime import time
dt = [(20,6), (20,7), (20,9), (20,10), (20,11), (20,13)]
dt = [time(0, m, s) for m, s in dt]
left = []
for i in range(len(dt) - 1):
if dt[i + 1].second - dt[i].second <= 1:
left.append(dt[i])
print(left)
Result
>>> [datetime.time(0, 20, 6), datetime.time(0, 20, 9), datetime.time(0, 20, 10)]
Is there any efficient way in python to count the times an array of numbers is between certain intervals? the number of intervals i will be using may get quite large
like:
mylist = [4,4,1,18,2,15,6,14,2,16,2,17,12,3,12,4,15,5,17]
some function(mylist, startpoints):
# startpoints = [0,10,20]
count values in range [0,9]
count values in range [10-19]
output = [9,10]
you will have to iterate the list at least once.
The solution below works with any sequence/interval that implements comparision (<, >, etc) and uses bisect algorithm to find the correct point in the interval, so it is very fast.
It will work with floats, text, or whatever. Just pass a sequence and a list of the intervals.
from collections import defaultdict
from bisect import bisect_left
def count_intervals(sequence, intervals):
count = defaultdict(int)
intervals.sort()
for item in sequence:
pos = bisect_left(intervals, item)
if pos == len(intervals):
count[None] += 1
else:
count[intervals[pos]] += 1
return count
data = [4,4,1,18,2,15,6,14,2,16,2,17,12,3,12,4,15,5,17]
print count_intervals(data, [10, 20])
Will print
defaultdict(<type 'int'>, {10: 10, 20: 9})
Meaning that you have 10 values <10 and 9 values <20.
I don't know how large your list will get but here's another approach.
import numpy as np
mylist = [4,4,1,18,2,15,6,14,2,16,2,17,12,3,12,4,15,5,17]
np.histogram(mylist, bins=[0,9,19])
You can also use a combination of value_counts() and pd.cut() to help you get the job done.
import pandas as pd
mylist = [4,4,1,18,2,15,6,14,2,16,2,17,12,3,12,4,15,5,17]
split_mylist = pd.cut(mylist, [0, 9, 19]).value_counts(sort = False)
print(split_mylist)
This piece of code will return this:
(0, 10] 10
(10, 20] 9
dtype: int64
Then you can utilise the to_list() function to get what you want
split_mylist = split_mylist.tolist()
print(split_mylist)
Output: [10, 9]
If the numbers are integers, as in your example, representing the intervals as frozensets can perhaps be fastest (worth trying). Not sure if the intervals are guaranteed to be mutually exclusive -- if not, then
intervals = [frozenzet(range(10)), frozenset(range(10, 20))]
counts = [0] * len(intervals)
for n in mylist:
for i, inter in enumerate(intervals):
if n in inter:
counts[i] += 1
if the intervals are mutually exclusive, this code could be sped up a bit by breaking out of the inner loop right after the increment. However for mutually exclusive intervals of integers >= 0, there's an even more attractive option: first, prepare an auxiliary index, e.g. given your startpoints data structure that could be
indices = [sum(i > x for x in startpoints) - 1 for i in range(max(startpoints))]
and then
counts = [0] * len(intervals)
for n in mylist:
if 0 <= n < len(indices):
counts[indices[n]] += 1
this can be adjusted if the intervals can be < 0 (everything needs to be offset by -min(startpoints) in that case.
If the "numbers" can be arbitrary floats (or decimal.Decimals, etc), not just integer, the possibilities for optimization are more restricted. Is that the case...?
I have a number of nodes in a network. The nodes send status information every hour to indicate that they are alive. So i have a list of Nodes and the time when they were last alive. I want to graph the number of alive nodes over the time.
The list of nodes is sorted by the time they were last alive but i cant figure out a nice way to count how many are alive at a each date.
from datetime import datetime, timedelta
seen = [ n.last_seen for n in c.nodes ] # a list of datetimes
seen.sort()
start = seen[0]
end = seen[-1]
diff = end - start
num_points = 100
step = diff / num_points
num = len( c.nodes )
dates = [ start + i * step for i in range( num_points ) ]
What i want is basically
alive = [ len([ s for s in seen if s > date]) for date in dates ]
but thats not really efficient. The solution should use the fact that the seen list is sorted and not loop over the whole list for every date.
this generator traverses the list only once:
def get_alive(seen, dates):
c = len(seen)
for date in dates:
for s in seen[-c:]:
if s >= date: # replaced your > for >= as it seems to make more sense
yield c
break
else:
c -= 1
The python bisect module will find the correct index for you, and you can deduct the number of items before and after.
If I'm understanding right, that would be O(dates) * O(log(seen))
Edit 1
It should be possible to do in one pass, just like SilentGhost demonstrates. However,itertools.groupby works fine with sorted data, it should be able to do something here, perhaps like this (this is more than O(n) but could be improved):
import itertools
# numbers are easier to make up now
seen = [-1, 10, 12, 15, 20, 75]
dates = [5, 15, 25, 50, 100]
def finddate(s, dates):
"""Find the first date in #dates larger than s"""
for date in dates:
if s < date:
break
return date
for date, group in itertools.groupby(seen, key=lambda s: finddate(s, dates)):
print date, list(group)
I took SilentGhosts generator solution a bit further using explicit iterators. This is the linear time solution i was thinking of.
def splitter( items, breaks ):
""" assuming `items` and `breaks` are sorted """
c = len( items )
items = iter(items)
item = items.next()
breaks = iter(breaks)
breaker = breaks.next()
while True:
if breaker > item:
for it in items:
c -= 1
if it >= breaker:
item = it
yield c
break
else:# no item left that is > the current breaker
yield 0 # 0 items left for the current breaker
# and 0 items left for all other breaks, since they are > the current
for _ in breaks:
yield 0
break # and done
else:
yield c
for br in breaks:
if br > item:
breaker = br
break
yield c
else:
# there is no break > any item in the list
break