python find intersection timeranges in array - python

i have a python numpy array with two rows. One row describes the start time of an event and the other one describes the end (here times as epoch integers). The code example means, that the event at index=0 starts at time=1 and ends at time=7.
start = [1, 8, 15, 30]
end = [7, 16, 20, 40]
timeranges = np.array([start,end])
I want to know, if the time ranges are intersected. That means i need a function/algorithm, that calculates the information, that the time range from 8 to 16 is intersected with the time range from 15 to 20.
My solution is, to use two intersected loops and check if any start time or end time is within an other timerange. But with ipython it lasts very long, because my timeranges are filled with nearly 10000 events.
Is there an elegant solution, to get the result in "short" time (e.g. below one minute)?

Store the data as a collection of (time,index_in_list,start_or_end). For example, if the input data is:
start = [1, 8, 15, 30]
end = [7, 16, 20, 40]
Transform the input data to a list of tuples as follows:
def extract_times(times,is_start):
return [(times[i],i,is_start) for i in range(len(times))]
Which yields:
extract_times(start,true) == [(1,0,true),(8,1,true),(15,2,true),(30,3,true)]
extract_times(end,false) == [(7,0,false),(16,1,false),(20,2,false),(40,3,false)]
Now, merge the two lists and store them.
Then, start traversing the lists from beginning to end, each time keeping track of the currently intersecting intervals, updating the state based on whether each new tuple is a beginning or and ending of an interval. This way you'll find all overlaps.
The complexity is O(n log(n)) for the sorting plus some overhead if there are lots of intersections.

Given that the input lists might not be sorted and to handle cases where we might see timeranges with multiple intersections, here's a brute-force comparison based method using broadcasting -
np.argwhere(np.triu(timeranges[1][:,None] > timeranges[0],1))
Sample runs
Original sample case :
In [81]: timeranges
Out[81]:
array([[ 1, 8, 15, 30],
[ 7, 16, 20, 40]])
In [82]: np.argwhere(np.triu(timeranges[1][:,None] > timeranges[0],1))
Out[82]: array([[1, 2]])
Multiple intersections case :
In [77]: timeranges
Out[77]:
array([[ 5, 7, 18, 12, 19],
[11, 17, 28, 19, 28]])
In [78]: np.argwhere(np.triu(timeranges[1][:,None] > timeranges[0],1))
Out[78]:
array([[0, 1],
[1, 3],
[2, 3],
[2, 4]])
If by within in "if any start time or end time is within an other timerange", you meant the boundaries are inclusive, change the comparison of > to >= in the solution code.

Related

How to use random.sample() within a for-loop to generate multiple, *non-identical* sample lists?

I would like to know how to use the python random.sample() function within a for-loop to generate multiple sample lists that are not identical.
For example, right now I have:
for i in range(3):
sample = random.sample(range(10), k=2)
This will generate 3 sample lists containing two numbers each, but I would like to make sure none of those sample lists are identical. (It is okay if there are repeating values, i.e., (2,1), (3,2), (3,7) would be okay, but (2,1), (1,2), (5,4) would not.)
If you specifically need to "use random.sample() within a for-loop", then you could keep track of samples that you've seen, and check that new ones haven't been seen yet.
import random
seen = set()
for i in range(3):
while True:
sample = random.sample(range(10), k=2)
print(f'TESTING: {sample = }') # For demo
fr = frozenset(sample)
if fr not in seen:
seen.add(fr)
break
print(sample)
Example output:
TESTING: sample = [0, 7]
[0, 7]
TESTING: sample = [0, 7]
TESTING: sample = [1, 5]
[1, 5]
TESTING: sample = [7, 0]
TESTING: sample = [3, 5]
[3, 5]
Here I made seen a set to allow fast lookups, and I converted sample to a frozenset so that order doesn't matter in comparisons. It has to be frozen because a set can't contain another set.
However, this could be very slow with different inputs, especially a larger range of i or smaller range to draw samples from. In theory, its runtime is infinite, but in practice, random's number generator is finite.
Alternatives
There are other ways to do the same thing that could be much more performant. For example, you could take a big random sample, then chunk it into the desired size:
n = 3
k = 2
upper = 10
sample = random.sample(range(upper), k=k*n)
for chunk in chunks(sample, k):
print(chunk)
Example output:
[6, 5]
[3, 0]
[1, 8]
With this approach, you'll never get any duplicate numbers like [[2,1], [3,2], [3,7]] because the sample contains all unique numbers.
This approach was inspired by Sven Marnach's answer on "Non-repetitive random number in numpy", which I coincidentally just read today.
it looks like you are trying to make a nested list of certain list items without repetition from original list, you can try below code.
import random
mylist = list(range(50))
def randomlist(mylist,k):
length = lambda : len(mylist)
newlist = []
while length() >= k:
newlist.append([mylist.pop(random.randint(0, length() - 1)) for I in range(k)])
newlist.append(mylist)
return newlist
randomlist(mylist,6)
[[2, 20, 36, 46, 14, 30],
[4, 12, 13, 3, 28, 5],
[45, 37, 18, 9, 34, 24],
[31, 48, 11, 6, 19, 17],
[40, 38, 0, 7, 22, 42],
[23, 25, 47, 41, 16, 39],
[8, 33, 10, 43, 15, 26],
[1, 49, 35, 44, 27, 21],
[29, 32]]
This should do the trick.
import random
import math
# create set to store samples
a = set()
# number of distinct elements in the population
m = 10
# sample size
k = 2
# number of samples
n = 3
# this protects against an infinite loop (see Safety Note)
if n > math.comb(m, k):
print(
f"Error: {math.comb(m, k)} is the number of {k}-combinations "
f"from a set of {m} distinct elements."
)
exit()
# the meat
while len(a) < n:
a.add(tuple(sorted(random.sample(range(m), k = k))))
print(a)
With a set you are guaranteed to get a collection with no duplicate elements. In a set, you would be allowed to have (1, 2) and (2, 1) inside, which is why sorted is applied. So if [1, 2] is drawn, sorted([1, 2]) returns [1, 2]. And if [2, 1] is subsequently drawn, sorted([2, 1]) returns [1, 2], which won't be added to the set because (1, 2) is already in the set. We use tuple because objects in a set have to be hashable and list objects are not.
I hope this helps. Any questions, please let me know.
Safety Note
To avoid an infinite loop when you change 3 to some large number, you need to know the maximum number of possible samples of the type that you desire.
The relevant mathematical concept for this is a combination.
Suppose your first argument of random.sample() is range(m) where
m is some arbitrary positive integer. Note that this means that the
sample will be drawn from a population of m distinct members
without replacement.
Suppose that you wish to have n samples of length k in total.
The number of possible k-combinations from the set of m distinct elements is
m! / (k! * (m - k)!)
You can get this value via
from math import comb
num_comb = comb(m, k)
comb(m, k) gives the number of different ways to choose k elements from m elements without repetition and without order, which is exactly what we want.
So in the example above, m = 10, k = 2, n = 3.
With these m and k, the number of possible k-combinations from the set of m distinct elements is 45.
You need to ensure that n is less than 45 if you want to use those specific m and k and avoid an infinite loop.

Quickly remove outliers from list in Python?

I have a many long lists of time and temperature values, which has the following structure:
list1 = [[1, 72], [2, 72], [3, 73], [4, 72], [5, 74], [6, 73], [7, 71], [8, 92], [9, 73]]
Some of the time/temperature pairs are incorrect spikes in the data. For example, in time 8, it spiked to 92 degrees. I would like to get rid of these sudden jumps or dips in the temperature values.
To do this, I wrote the following code (I removed the stuff that isn't necessary and only copied the part that removes the spikes/outliers):
outlierpercent = 3
for i in values:
temperature = i[1]
index = values.index(i)
if index > 0:
prevtemp = values[index-1][1]
pctdiff = (temperature/prevtemp - 1) * 100
if abs(pctdiff) > outlierpercent:
outliers.append(i)
While this works (where I can set the minimum percentage difference required for it to be considered a spike as outlierpercent), it takes a super long time (5-10 minutes per list). My lists are extremely long (around 5 million data points each), and I have hundreds of lists.
I was wondering if there was a much quicker way of doing this? My main concern here is time. There are other similar questions like this, however, they don't seem to be quite efficient for super long list of this structure, so I'm not sure how to do it! Thanks!
outlierpercent = 3
for index in range(1, len(values)):
temperature = values[index][1]
prevtemp = values[index-1][1]
pctdiff = (temperature/prevtemp - 1) * 100
if abs(pctdiff) > outlierpercent:
outliers.append(index)
This should do a lot better with time
Update:
The issue of only first outlier being removed is because after we remove an outlier, in the next iteration, we are comparing the temp from the removed outlier (prevtemp = values[index-1][1]).
I believe you can avoid that by handling the previous temp better. Something like this:
outlierpercent = 3
prevtemp = values[0][1]
for index in range(1, len(values)):
temperature = values[index][1]
pctdiff = (temperature/prevtemp - 1) * 100
# outlier - add to list and don't update prev temp
if abs(pctdiff) > outlierpercent:
outliers.append(index)
# valid temp, update prev temp
else:
prevtemp = values[index-1][1]
Using Numpy to speed computations
With
values = [[1, 72], [2, 72], [3, 73], [4, 72], [5, 74], [6, 73], [7, 71], [8, 92], [9, 73]]
Numpy Code
# Convert list to Numpy array
a = np.array(values)
# Calculate absolute percent difference of temperature
b = np.diff(a[:, 1])*100/a[:-1, 1]
# List of outliers
outlier_indices = np.where(np.abs(b) > outlierpercent)
if outlier_indices:
print(a[outlier_indices[0]+1]) # add one since b is is one short due to
# computing difference
# Output: List of outliers same as original code
[[ 8 92]
[ 9 73]]
This should make two lists, valid and outliers.
I tried to keep math operations to a minimum for speed.
Pardon any typos, this was keyboard composed, untested.
lolim=None
outliers=[]
outlierpercent=3.0
lower_mult=(100.0-outlierpercent)/100.0
upper_mult=(100.0+outlierpercent)/100.0
for index,temp in values
if lolim is None:
valids=[[index,temp]] # start the valid list
lolim,hilim=[lower_mult,upper_mult]*temp # create initial range
else:
if lolim <= temp <= hilim:
valids.append([index,temp]) # new valid entry
lolim,hilim=[lower_mult,upper_mult]*temp # update range
else:
outliers.append([index,temp]) # save outliers, keep old range

Raise Elements of Array to Series of Exponents

Suppose I have a numpy array such as:
a = np.arange(9)
>> array([0, 1, 2, 3, 4, 5, 6, 7, 8])
If I want to raise each element to succeeding powers of two, I can do it this way:
power_2 = np.power(a,2)
power_4 = np.power(a,4)
Then I can combine the arrays by:
np.c_[power_2,power_4]
>> array([[ 0, 0],
[ 1, 1],
[ 4, 16],
[ 9, 81],
[ 16, 256],
[ 25, 625],
[ 36, 1296],
[ 49, 2401],
[ 64, 4096]])
What's an efficient way to do this if I don't know the degree of the even monomial (highest multiple of 2) in advance?
One thing to observe is that x^(2^n) = (...(((x^2)^2)^2)...^2)
meaning that you can compute each column from the previous by taking the square.
If you know the number of columns in advance you can do something like:
import functools as ft
a = np.arange(5)
n = 4
out = np.empty((*a.shape,n),a.dtype)
out[:,0] = a
# Note: this works by side-effect!
# The optional second argument of np.square is "out", i.e. an
# array to write the result to (nonetheless the result is also
# returned directly)
ft.reduce(np.square,out.T)
out
# array([[ 0, 0, 0, 0],
# [ 1, 1, 1, 1],
# [ 2, 4, 16, 256],
# [ 3, 9, 81, 6561],
# [ 4, 16, 256, 65536]])
If the number of columns is not known in advance then the most efficient method is to make a list of columns, append as needed and only in the end use np.column_stack or np.c_ (if using np.c_ do not forget to cast the list to tuple first).
The straightforward approach is:
exponents = [2**n for n in a]
[a**e for e in exponents]
This works fine for relatively small numbers, but I see what looks like numerical overflow on the larger numbers. (Although I can compute those high powers just fine using scalars.)
The most elegant way I could think of is to not calculate the exponents beforehand. Since your exponents follow a very easy pattern, you can express everything using on list-comprehension.
result = [item**2*index for index,item in enumerate(a)]
If you are working with quite large datasets, this will cause some serious overhead. This statement will do all calculations immediately and save all calculated element in one large array. To mitigate this problem, you could you a generator expression, which will generate the data on the fly.
result = (item**2*index for index,item in enumerate(a))
See here for more details.

Numpy-vectorized function to repeat blocks of consecutive elements

Numpy has а repeat function, that repeats each element of the array a given (per element) number of times.
I want to implement a function that does similar thing but repeats not individual elements, but variably sized blocks of consecutive elements. Essentially I want the following function:
import numpy as np
def repeat_blocks(a, sizes, repeats):
b = []
start = 0
for i, size in enumerate(sizes):
end = start + size
b.extend([a[start:end]] * repeats[i])
start = end
return np.concatenate(b)
For example, given
a = np.arange(20)
sizes = np.array([3, 5, 2, 6, 4])
repeats = np.array([2, 3, 2, 1, 3])
then
repeat_blocks(a, sizes, repeats)
returns
array([ 0, 1, 2,
0, 1, 2,
3, 4, 5, 6, 7,
3, 4, 5, 6, 7,
3, 4, 5, 6, 7,
8, 9,
8, 9,
10, 11, 12, 13, 14, 15,
16, 17, 18, 19,
16, 17, 18, 19,
16, 17, 18, 19 ])
I want to push these loops into numpy in the name of performance. Is this possible? If so, how?
Here's one vectorized approach using cumsum -
# Get repeats for each group using group lengths/sizes
r1 = np.repeat(np.arange(len(sizes)), repeats)
# Get total size of output array, as needed to initialize output indexing array
N = (sizes*repeats).sum() # or np.dot(sizes, repeats)
# Initialize indexing array with ones as we need to setup incremental indexing
# within each group when cumulatively summed at the final stage.
# Two steps here:
# 1. Within each group, we have multiple sequences, so setup the offsetting
# at each sequence lengths by the seq. lengths preceeeding those.
id_ar = np.ones(N, dtype=int)
id_ar[0] = 0
insert_index = sizes[r1[:-1]].cumsum()
insert_val = (1-sizes)[r1[:-1]]
# 2. For each group, make sure the indexing starts from the next group's
# first element. So, simply assign 1s there.
insert_val[r1[1:] != r1[:-1]] = 1
# Assign index-offseting values
id_ar[insert_index] = insert_val
# Finally index into input array for the group repeated o/p
out = a[id_ar.cumsum()]
This function is a great candidate to speed up using Numba:
#numba.njit
def repeat_blocks_jit(a, sizes, repeats):
out = np.empty((sizes * repeats).sum(), a.dtype)
start = 0
oi = 0
for i, size in enumerate(sizes):
end = start + size
for rep in range(repeats[i]):
oe = oi + size
out[oi:oe] = a[start:end]
oi = oe
start = end
return out
This is significantly faster than Divakar's pure NumPy solution, and a lot closer to your original code. I made no effort at all to optimize it. Note that np.dot() and np.repeat() can't be used here, but that doesn't matter when all the code gets compiled.
Plus, since it is njit meaning "nopython" mode, you can even use #numba.njit(nogil=True) and get multicore speedup if you have many of these calls to make.

logical arrays and mapping in python

I'm trying to vectorize some element calculations but having difficulty doing so without creating list comprehensions for local information to global information. I was told that I can accomplish what I want to do using logical arrays, but so far the examples I've found has not been helpful. While yes I can accomplish this with list comprehensions, speed is a main concern with my code.
I have a set of values that indicate indices in the "global" calculation that should not be adjusted.
For example, these "fixed" indices are
1 2 6
If my global calculation has ten elements, I would be able to set all the "free" values by creating a list of the set of the global indices and subtracting the fixed indices.
free = list(set(range(len(global)) - set(fixed))
[0, 3, 4, 5, 7, 8, 9]
in the global calculation, I would be able to adjust the "free" elements as shown in the following code snippet
global = np.ones(10)
global[free] = global[free] * 10
which should produce:
global = [10, 1, 1, 10, 10, 10, 1, 10, 10, 10]
my "local" calculation is a subset of the global one, where the local map indicates the corresponding indices in the global calculation.
local_map = [4, 2, 1, 8, 6]
local_values = [40, 40, 40, 40, 40]
but I need the values associated with the local map to retain their order for calculation purposes.
What would the equivalent of global[free] be on the local level?
the desired output would be something like this:
local_free = list(set(range(len(local)) - set(fixed))
local_values[local_free] *= 10
OUTPUT: local_values = [400, 40, 40, 400, 40]
I apologize if the question formatting is off, the code block formatting doesn't seem to be working in my browser, so please let me know if you need clarification.
For such comparison-related operations, NumPy has tools like np.setdiff1d and np.in1d among others. To solve our case, these two would be enough. I would assume that the inputs are NumPy arrays, as then we could use vectorized indexing methods supported by NumPy.
On the first case, we have -
In [97]: fixed = np.array([1,2,6])
...: global_arr = np.array([10, 1, 1, 10, 10, 10, 1, 10, 10, 10])
...:
To get the equivalent of list(set(range(len(global_arr)) - set(fixed)) in NumPy, we could make use of np.setdiff1d -
In [98]: np.setdiff1d(np.arange(len(global_arr)),fixed)
Out[98]: array([0, 3, 4, 5, 7, 8, 9])
Next up, we have -
In [99]: local_map = np.array([4, 2, 1, 8, 6])
...: local_values = np.array([42, 40, 48, 41, 43])
...:
We were trying to get -
local_free = list(set(range(len(local)) - set(fixed))
local_values[local_free] *= 10
Here, we can use np.in1d to get a mask to be an equivalent for local_free that could be used to index and assign into local_values with NumPy's boolean-indexing method -
In [100]: local_free = ~np.in1d(local_map,fixed)
...: local_values[local_free] *= 10
...:
In [101]: local_values
Out[101]: array([420, 40, 48, 410, 43])

Categories