Split a list of dates by another list of dates - python

I have a number of nodes in a network. The nodes send status information every hour to indicate that they are alive. So i have a list of Nodes and the time when they were last alive. I want to graph the number of alive nodes over the time.
The list of nodes is sorted by the time they were last alive but i cant figure out a nice way to count how many are alive at a each date.
from datetime import datetime, timedelta
seen = [ n.last_seen for n in c.nodes ] # a list of datetimes
seen.sort()
start = seen[0]
end = seen[-1]
diff = end - start
num_points = 100
step = diff / num_points
num = len( c.nodes )
dates = [ start + i * step for i in range( num_points ) ]
What i want is basically
alive = [ len([ s for s in seen if s > date]) for date in dates ]
but thats not really efficient. The solution should use the fact that the seen list is sorted and not loop over the whole list for every date.

this generator traverses the list only once:
def get_alive(seen, dates):
c = len(seen)
for date in dates:
for s in seen[-c:]:
if s >= date: # replaced your > for >= as it seems to make more sense
yield c
break
else:
c -= 1

The python bisect module will find the correct index for you, and you can deduct the number of items before and after.
If I'm understanding right, that would be O(dates) * O(log(seen))
Edit 1
It should be possible to do in one pass, just like SilentGhost demonstrates. However,itertools.groupby works fine with sorted data, it should be able to do something here, perhaps like this (this is more than O(n) but could be improved):
import itertools
# numbers are easier to make up now
seen = [-1, 10, 12, 15, 20, 75]
dates = [5, 15, 25, 50, 100]
def finddate(s, dates):
"""Find the first date in #dates larger than s"""
for date in dates:
if s < date:
break
return date
for date, group in itertools.groupby(seen, key=lambda s: finddate(s, dates)):
print date, list(group)

I took SilentGhosts generator solution a bit further using explicit iterators. This is the linear time solution i was thinking of.
def splitter( items, breaks ):
""" assuming `items` and `breaks` are sorted """
c = len( items )
items = iter(items)
item = items.next()
breaks = iter(breaks)
breaker = breaks.next()
while True:
if breaker > item:
for it in items:
c -= 1
if it >= breaker:
item = it
yield c
break
else:# no item left that is > the current breaker
yield 0 # 0 items left for the current breaker
# and 0 items left for all other breaks, since they are > the current
for _ in breaks:
yield 0
break # and done
else:
yield c
for br in breaks:
if br > item:
breaker = br
break
yield c
else:
# there is no break > any item in the list
break

Related

Python: Selecting longest consecutive series of dates in list

I have a series of lists (np.arrays, actually), of which the elements are dates.
id
0a0fe3ed-d788-4427-8820-8b7b696a6033 [2019-01-30, 2019-01-31, 2019-02-01, 2019-02-0...
0a48d1e8-ead2-404a-a5a2-6b05371200b1 [2019-01-30, 2019-01-31, 2019-02-01, 2019-02-0...
0a9edba1-14e3-466a-8d0c-f8a8170cefc8 [2019-01-29, 2019-01-30, 2019-01-31, 2019-02-0...
Name: startDate, dtype: object
For each element in the series (i.e. for each list of dates), I want to retain the longest sublist in which all dates are consecutive. I'm struggling to approach this in a pythonic (simple/efficient) way. The only approach that I can think of is to use multiple loops: loop over the series values (the lists), and loop over each element in the list. I would then store the first date and the number of consecutive days, and use temporary values to overwrite the results if a longer sequence of consecutive days is encountered. This seems highly inefficient though. Is there a better way of doing this?
Since you mention you are using numpy arrays of dates it makes sense to stick to numpy types instead of converting to the built-in type. I'm assuming here that your arrays have dtype 'datetime64[D]'. In that case you could do something like
import numpy as np
date_list = np.array(['2005-02-01', '2005-02-02', '2005-02-03',
'2005-02-05', '2005-02-06', '2005-02-07', '2005-02-08', '2005-02-09',
'2005-02-11', '2005-02-12',
'2005-02-14', '2005-02-15', '2005-02-16', '2005-02-17',
'2005-02-19', '2005-02-20',
'2005-02-22', '2005-02-23', '2005-02-24',
'2005-02-25', '2005-02-26', '2005-02-27', '2005-02-28'],
dtype='datetime64[D]')
i0max, i1max = 0, 0
i0 = 0
for i1, date in enumerate(date_list):
if date - date_list[i0] != np.timedelta64(i1-i0, 'D'):
if i1 - i0 > i1max - i0max:
i0max, i1max = i0, i1
i0 = i1
print(date_list[i0max:i1max])
# output: ['2005-02-05' '2005-02-06' '2005-02-07' '2005-02-08' '2005-02-09']
Here, i0 and i1 indicate the start and stop indeces of the current sub-array of consecutive dates, and i0max and i1max the start and stop indices of the longest sub-array found so far. The solution uses the fact that the difference between the i-th and zeroth entry in a list of consecutive dates is exactly i days.
You can convert list to ordinals which are increasing for all consecutive dates. Which means next_date = previous_date + 1 read more.
Then find the longest consecutive sub-array.
This process will take O(n)->single loop time which is the most efficient way to get this.
CODE
from datetime import datetime
def get_consecutive(date_list):
# convert to ordinals
v = [datetime.strptime(d, "%Y-%m-%d").toordinal() for d in date_list]
consecutive = []
run = []
dates = []
# get consecutive ordinal sequence
for i in range(1, len(v) + 1):
run.append(v[i-1])
dates.append(date_list[i-1])
if i == len(v) or v[i-1] + 1 != v[i]:
if len(consecutive) < len(run):
consecutive = dates
dates = []
run = []
return consecutive
OUTPUT:
date_list = ['2019-01-29', '2019-01-30', '2019-01-31','2019-02-05']
get_consecutive(date_list )
# ordinales will be -> v = [737088, 737089, 737090, 737095]
OUTPUT:
['2019-01-29', '2019-01-30', '2019-01-31']
Now use get_consecutive in df.column.apply(get_consecutive)it will give you all increasing date list. Or you can all function for each list if you are using some other data structure.
I'm going to reduce this problem to finding consecutive days in a single list. There are a few tricks that make it more Pythonic as you ask. The following script should run as-is. I've documented how it works inline:
from datetime import timedelta, date
# example input
days = [
date(2020, 1, 1), date(2020, 1, 2), date(2020, 1, 4),
date(2020, 1, 5), date(2020, 1, 6), date(2020, 1, 8),
]
# store the longest interval and the current consecutive interval
# as we iterate through a list
longest_interval_index = current_interval_index = 0
longest_interval_length = current_interval_length = 1
# using zip here to reduce the number of indexing operations
# this will turn the days list into [(2020-01-1, 2020-01-02), (2020-01-02, 2020-01-03), ...]
# use enumerate to get the index of the current day
for i, (previous_day, current_day) in enumerate(zip(days, days[1:]), start=1):
if current_day - previous_day == timedelta(days=+1):
# we've found a consecutive day! increase the interval length
current_interval_length += 1
else:
# nope, not a consecutive day! start from this day and start
# counting from 1
current_interval_index = i
current_interval_length = 1
if current_interval_length > longest_interval_length:
# we broke the record! record it as the longest interval
longest_interval_index = current_interval_index
longest_interval_length = current_interval_length
print("Longest interval index:", longest_interval_index)
print("Longest interval: ", days[longest_interval_index:longest_interval_index + longest_interval_length])
It should be easy enough to turn this into a reusable function.

Rolling a roulette with chances and counting the maximum same-side sequences of a number of rolls

I am trying to generate a random sequence of numbers, with each "result" having a chance of {a}48.6%/{b}48.6%/{c}2.8%.
Counting how many times in a sequence of 6 or more {a} occurred, same for {b}.
{c} counts as neutral, meaning that if an {a} sequence is happening, {c} will count as {a}, additionally if a {b} sequence is happening, then {c} will count as {b}.
The thing is that the results seem right, but every "i" iteration seems to give results that are "weighted" either on the {a} side or the {b} side. And I can't seem to figure out why.
I would expect for example to have a result of :
{a:6, b:7, a:8, a:7, b:9} but what I am getting is {a:7, a:9, a:6, a:8} OR {b:7, b:8, b:6} etc.
Any ideas?
import sys
import random
from random import seed
from random import randint
from datetime import datetime
import time
loopRange = 8
flips = 500
median = 0
for j in range(loopRange):
random.seed(datetime.now())
sequenceArray = []
maxN = 0
flag1 = -1
flag2 = -1
for i in range(flips):
number = randint(1, 1000)
if(number <= 486):
flag1 = 0
sequenceArray.append(number)
elif(number > 486 and number <= 972):
flag1 = 1
sequenceArray.append(number)
elif(number > 972):
sequenceArray.append(number)
if(flag1 != -1 and flag2 == -1):
flag2 = flag1
if(flag1 != flag2):
sequenceArray.pop()
if(len(sequenceArray) > maxN):
maxN = len(sequenceArray)
if(len(sequenceArray) >= 6):
print(len(sequenceArray))
# print(sequenceArray)
# print(sequenceArray)
sequenceArray = []
median += maxN
print("Maximum sequence is %d " % maxN)
print("\n")
time.sleep(random.uniform(0.1, 1))
median = float(median/loopRange)
print("\n")
print(median)
I would implement something with two cursors prev (previous) and curr (current) since you need to detect a change between the current and the previous state.
I just write the code of the inner loop on i since the external loop adds complexity without focusing on the source of the problem. You can then include this in your piece of code. It seems to work for me, but I am not sure to understand perfectly how you want to manage all the behaviours (especially at start).
prev = -1
curr = -1
seq = []
maxN = 0
for i in range(flips):
number = randint(1, 1000)
if number<=486:
curr = 0 # case 'a'
elif number<=972:
curr = 1 # case 'b'
else:
curr = 2 # case 'c', the joker
if (prev==-1) and (curr==2):
# If we start with the joker, don't do anything
# You can add code here to change this behavior
pass
else:
if (prev==-1):
# At start, it is like the previous element is the current one
prev=curr
if (prev==curr) or (curr==2):
# We continue the sequence
seq.append(number)
else:
# Break the sequence
maxN = max(maxN,len(seq)) # save maximum length
if len(seq)>=6:
print("%u: %r"%(prev,seq))
seq = [] # reset sequence
seq.append(number) # don't forget to append the new number! It breaks the sequence, but starts a new one
prev=curr # We switch case 'a' for 'b', or 'b' for 'a'

Grouping images by time interval python

I have a script with taking out exif data from images, and putting it into to the list, I sort my list after and that's what i have its a list of lists, on first position its a image time in seconds and 2nd place its a image path, its my list,
[[32372, 'F:\rubish\VOL1\cam\G0013025.JPG'], [32373, 'F:\rubish\VOL1\cam\G0013026.JPG'], [32373, 'F:\rubish\VOL1\cam\G0013027.JPG'],.... etc etc etc
That a script with grouping my images made by #blhsing , with works great, but I want to start my grouping , not from first image , start grouping by given position
That a script:
groups = []
for r in img:
if groups and r[0] - groups[-1][-1][0] <= 5:
groups[-1].append(r)
else:
groups.append([r])
for g in groups:
print(g[0][1], g[0][0], g[-1][0], g[-1][1])
And that what I have and , its does not work well , its taking only one image, , does no create a group , did somebody can help me please to fix it ??
groups = []
print(iii, "iii")
#print(min_list, " my min list ")
img.sort()
cnt = 0
mili = [32372, 34880]
for n in min_list:
#print(n, "mili")
for i in img:
#print(i[0])
if n == i[0]:
if groups and i[0] - groups[-1][-1][0] <= 5:
groups[-1].append(i)
else:
groups.append([i])
for ii in groups:
print(ii[0][1], ii[0][0], ii[-1][0], ii[-1][1])
Over here I have my min_list with 2 position means I want to create only 2 groups , and classifier only images starting from those 2 position , with interval as before 5 seconds.
Since your img list is sorted by time already, you can iterate through the records and append them to the last sub-list of the output list (named groups in my example code) if the time difference to the last entry is no more than 5 seconds; otherwise put the record into a new sub-list of the output list. Keep in mind that in Python a subscript of -1 means the last item in a list.
groups = []
for r in img:
if groups and r[0] - groups[-1][-1][0] <= 5:
groups[-1].append(r)
else:
groups.append([r])
for g in groups:
print(g[0][1], g[0][0], g[-1][0], g[-1][1])
Sure! I just actually wrote this same algorithm the other day, but for JavaScript. Easy to port to Python...
import pprint
def group_seq(data, predicate):
groups = []
current_group = None
for datum in data:
if current_group:
if not predicate(current_group[-1], datum): # Abandon the group
current_group = None
if not current_group: # Need to start a new group
current_group = []
groups.append(current_group)
current_group.append(datum)
return groups
data = [
[32372, r'F:\rubish\VOL1\cam\G0013025.JPG'],
[32373, r'F:\rubish\VOL1\cam\G0013026.JPG'],
[32373, r'F:\rubish\VOL1\cam\G0013027.JPG'],
[32380, r'F:\rubish\VOL1\cam\G0064646.JPG'],
[32381, r'F:\rubish\VOL1\cam\G0064646.JPG'],
]
groups = group_seq(
data=data,
predicate=lambda a, b: abs(a[0] - b[0]) > 5,
)
pprint.pprint(groups)
outputs
[[[32372, 'F:\\rubish\\VOL1\\cam\\G0013025.JPG'],
[32373, 'F:\\rubish\\VOL1\\cam\\G0013026.JPG'],
[32373, 'F:\\rubish\\VOL1\\cam\\G0013027.JPG']],
[[32380, 'F:\\rubish\\VOL1\\cam\\G0064646.JPG'],
[32381, 'F:\\rubish\\VOL1\\cam\\G0064646.JPG']]]
Basically the predicate is a function that should return True if b belongs in the same group as a; for your use case, we look at the (absolute) difference of the first items in the tuples/lists, which is the timestamp.

How to reduce a collection of ranges to a minimal set of ranges [duplicate]

This question already has answers here:
Union of multiple ranges
(5 answers)
Closed 7 years ago.
I'm trying to remove overlapping values from a collection of ranges.
The ranges are represented by a string like this:
499-505 100-115 80-119 113-140 500-550
I want the above to be reduced to two ranges: 80-140 499-550. That covers all the values without overlap.
Currently I have the following code.
cr = "100-115 115-119 113-125 80-114 180-185 500-550 109-120 95-114 200-250".split(" ")
ar = []
br = []
for i in cr:
(left,right) = i.split("-")
ar.append(left);
br.append(right);
inc = 0
for f in br:
i = int(f)
vac = []
jnc = 0
for g in ar:
j = int(g)
if(i >= j):
vac.append(j)
del br[jnc]
jnc += jnc
print vac
inc += inc
I split the array by - and store the range limits in ar and br. I iterate over these limits pairwise and if the i is at least as great as the j, I want to delete the element. But the program doesn't work. I expect it to produce this result: 80-125 500-550 200-250 180-185
For a quick and short solution,
from operator import itemgetter
from itertools import groupby
cr = "499-505 100-115 80-119 113-140 500-550".split(" ")
fullNumbers = []
for i in cr:
a = int(i.split("-")[0])
b = int(i.split("-")[1])
fullNumbers+=range(a,b+1)
# Remove duplicates and sort it
fullNumbers = sorted(list(set(fullNumbers)))
# Taken From http://stackoverflow.com/questions/2154249
def convertToRanges(data):
result = []
for k, g in groupby(enumerate(data), lambda (i,x):i-x):
group = map(itemgetter(1), g)
result.append(str(group[0])+"-"+str(group[-1]))
return result
print convertToRanges(fullNumbers)
#Output: ['80-140', '499-550']
For the given set in your program, output is ['80-125', '180-185', '200-250', '500-550']
Main Possible drawback of this solution: This may not be scalable!
Let me offer another solution that doesn't take time linearly proportional to the sum of the range sizes. Its running time is linearly proportional to the number of ranges.
def reduce(range_text):
parts = range_text.split()
if parts == []:
return ''
ranges = [ tuple(map(int, part.split('-'))) for part in parts ]
ranges.sort()
new_ranges = []
left, right = ranges[0]
for range in ranges[1:]:
next_left, next_right = range
if right + 1 < next_left: # Is the next range to the right?
new_ranges.append((left, right)) # Close the current range.
left, right = range # Start a new range.
else:
right = max(right, next_right) # Extend the current range.
new_ranges.append((left, right)) # Close the last range.
return ' '.join([ '-'.join(map(str, range)) for range in new_ranges ]
This function works by sorting the ranges, then looking at them in order and merging consecutive ranges that intersect.
Examples:
print(reduce('499-505 100-115 80-119 113-140 500-550'))
# => 80-140 499-550
print(reduce('100-115 115-119 113-125 80-114 180-185 500-550 109-120 95-114 200-250'))
# => 80-125 180-185 200-250 500-550

Python interval interesction

My problem is as follows:
having file with list of intervals:
1 5
2 8
9 12
20 30
And a range of
0 200
I would like to do such an intersection that will report the positions [start end] between my intervals inside the given range.
For example:
8 9
12 20
30 200
Beside any ideas how to bite this, would be also nice to read some thoughts on optimization, since as always the input files are going to be huge.
this solution works as long the intervals are ordered by the start point and does not require to create a list as big as the total range.
code
with open("0.txt") as f:
t=[x.rstrip("\n").split("\t") for x in f.readlines()]
intervals=[(int(x[0]),int(x[1])) for x in t]
def find_ints(intervals, mn, mx):
next_start = mn
for x in intervals:
if next_start < x[0]:
yield next_start,x[0]
next_start = x[1]
elif next_start < x[1]:
next_start = x[1]
if next_start < mx:
yield next_start, mx
print list(find_ints(intervals, 0, 200))
output:
(in the case of the example you gave)
[(0, 1), (8, 9), (12, 20), (30, 200)]
Rough algorithm:
create an array of booleans, all set to false seen = [False]*200
Iterate over the input file, for each line start end set seen[start] .. seen[end] to be True
Once done, then you can trivially walk the array to find the unused intervals.
In terms of optimisations, if the list of input ranges is sorted on start number, then you can track the highest seen number and use that to filter ranges as they are processed -
e.g. something like
for (start,end) in input:
if end<=lowest_unseen:
next
if start<lowest_unseen:
start=lowest_unseen
...
which (ignoring the cost of the original sort) should make the whole thing O(n) - you go through the array once to tag seen/unseen and once to output unseens.
Seems I'm feeling nice. Here is the (unoptimised) code, assuming your input file is called input
seen = [False]*200
file = open('input','r')
rows = file.readlines()
for row in rows:
(start,end) = row.split(' ')
print "%s %s" % (start,end)
for x in range( int(start)-1, int(end)-1 ):
seen[x] = True
print seen[0:10]
in_unseen_block=False
start=1
for x in range(1,200):
val=seen[x-1]
if val and not in_unseen_block:
continue
if not val and in_unseen_block:
continue
# Must be at a change point.
if val:
# we have reached the end of the block
print "%s %s" % (start,x)
in_unseen_block = False
else:
# start of new block
start = x
in_unseen_block = True
# Handle end block
if in_unseen_block:
print "%s %s" % (start, 200)
I'm leaving the optimizations as an exercise for the reader.
If you make a note every time that one of your input intervals either opens or closes, you can do what you want by putting together the keys of opens and closes, sort into an ordered set, and you'll be able to essentially think, "okay, let's say that each adjacent pair of numbers forms an interval. Then I can focus all of my logic on these intervals as discrete chunks."
myRange = range(201)
intervals = [(1,5), (2,8), (9,12), (20,30)]
opens = {}
closes = {}
def open(index):
if index not in opens:
opens[index] = 0
opens[index] += 1
def close(index):
if index not in closes:
closes[index] = 0
closes[index] += 1
for start, end in intervals:
if end > start: # Making sure to exclude empty intervals, which can be problematic later
open(start)
close(end)
# Sort all the interval-endpoints that we really need to look at
oset = {0:None, 200:None}
for k in opens.keys():
oset[k] = None
for k in closes.keys():
oset[k] = None
relevant_indices = sorted(oset.keys())
# Find the clear ranges
state = 0
results = []
for i in range(len(relevant_indices) - 1):
start = relevant_indices[i]
end = relevant_indices[i+1]
start_state = state
if start in opens:
start_state += opens[start]
if start in closes:
start_state -= closes[start]
end_state = start_state
if end in opens:
end_state += opens[end]
if end in closes:
end_state -= closes[end]
state = end_state
if start_state == 0:
result_start = start
result_end = end
results.append((result_start, result_end))
for start, end in results:
print(str(start) + " " + str(end))
This outputs:
0 1
8 9
12 20
30 200
The intervals don't need to be sorted.
This question seems to be a duplicate of Merging intervals in Python.
If I understood well the problem, you have a list of intervals (1 5; 2 8; 9 12; 20 30) and a range (0 200), and you want to get the positions outside your intervals, but inside given range. Right?
There's a Python library that can help you on that: python-intervals (also available from PyPI using pip). Disclaimer: I'm the maintainer of that library.
Assuming you import this library as follows:
import intervals as I
It's quite easy to get your answer. Basically, you first want to create a disjunction of intervals based on the ones you provide:
inters = I.closed(1, 5) | I.closed(2, 8) | I.closed(9, 12) | I.closed(20, 30)
Then you compute the complement of these intervals, to get everything that is "outside":
compl = ~inters
Then you create the union with [0, 200], as you want to restrict the points to that interval:
print(compl & I.closed(0, 200))
This results in:
[0,1) | (8,9) | (12,20) | (30,200]

Categories