Histogram of events with different durations (Given their start and end times) - python

I have a Numpy array A of shape nX2, representing n different events. The first column holds the starting times of the events, and the second holds the respective durations of each event.
For some time duration [0, T] and N different equidistant time points, I would like a count of how many events are ongoing at each time point. (i.e. an integer array of length N, each entry has the number of events that started before that time and lasted till after)
What is the most efficient way to achieve this in Python?
*I know what I'm asking for isn't really a histogram. If someone has a better term feel free to edit the title

You can try something like this. The idea is: for each bin, determine which events have started before the end of the bin but end after the start of the bin.
A = np.array([[1, 5, 6, 10], [5, 4, 1, 1]]).T
start = A[:, 0]
end = A.sum(axis=1)
lower = 0
upper = 100
N = 10
bins = np.linspace(lower, upper, num=N+1)
[( (end > bins[n]) & (start < bins[n+1]) ).sum() for n in range(N)]

Related

Recursive python function to make two arrays equal?

I'm attempting to write python code to solve a transportation problem using the Least Cost method. I have a 2D numpy array that I am iterating through to find the minimum, perform calculations with that minimum, and then replace it with a 0 so that the loops stops when values matches constantarray, an array of the same shape containing only 0s. The values array contains distances from points in supply to points in demand. I'm currently using a while loop to do so, but the loop isn't running because values.all() != constantarray.all() evaluates to False.
I also need the process to repeat once the arrays have been edited to move onto the next lowest number in values.
constarray = np.zeros((len(supply),len(demand)) #create array of 0s
sandmoved = np.zeros((len(supply),len(demand)) #used to store information needed for later
totalcost = 0
while values.all() != constantarray.all(): #iterate until `values` only contains 0s
m = np.argmin(values,axis = 0)[0] #find coordinates of minimum value
n = np.argmin(values,axis = 1)[0]
if supply[m] > abs(demand[m]): #all demand numbers are negative
supply[m]+=demand[n] #subtract demand from supply
totalcost +=abs(demand[n])*values[m,n]
sandmoved[m,n] = demand[n] #add amount of 'sand' moved to an empty array
values[m,0:-1] = 0 #replace entire m row with 0s since demand has been filled
demand[n]=0 #replace demand value with 0
elif supply[m]< abs(demand[n]):
demand[n]+=supply[m] #combine positive supply with negative demand
sandmoved[m,n]=supply[m]
totalcost +=supply[m]*values[m,n]
values[:-1,n]=0 #replace entire column with 0s since supply has been depleted
supply[m] = 0
There is an additional if statement for when supply[m]==demand[n] but I feel that isn't necessary. I've already tried using nested for loops, and so many different syntax combinations for a while loop but I just can't get it to work the way I want it to. Even when running the code block over over by itself, m and n stay the same and the function removes one value from values but doesn't add it to sandmoved. Any ideas are greatly appreciated!!
Well, here is an example from an old implementation of mine:
import numpy as np
values = np.array([[3, 1, 7, 4],
[2, 6, 5, 9],
[8, 3, 3, 2]])
demand = np.array([250, 350, 400, 200])
supply = np.array([300, 400, 500])
totCost = 0
MAX_VAL = 2 * np.max(values) # choose MAX_VAL higher than all values
while np.any(values.ravel() < MAX_VAL):
# find row and col indices of min
m, n = np.unravel_index(np.argmin(values), values.shape)
if supply[m] < demand[n]:
totCost += supply[m] * values[m,n]
demand[n] -= supply[m]
values[m,:] = MAX_VAL # set all row to MAX_VAL
else:
totCost += demand[n] * values[m,n]
supply[m] -= demand[n]
values[:,n] = MAX_VAL # set all col to MAX_VAL
Solution:
print(totCost)
# 2850
Basically, start by choosing a MAX_VAL higher than all given values and a totCost = 0. Then follow the standard steps of the algorithm. Find row and column indices of the smallest cell, say m, n. Select the m-th supply or the n-th demand whichever is smaller, then add what you selected multiplied by values[m,n] to the totCost, and set all entries of the selected row or column to MAX_VAL to avoid it in the next iterations. Update the greater value by subtracting the selected one and repeat until all values are equal to MAX_VAL.

Improving the solution to Codejam's 'Infinite House of Pancakes'

I'm trying to solve the Codejam 2015's Infinite House of Pancakes problem in the most efficient way. My current solution is similar to the one given in the analysis (but in Python instead of C++):
def solve():
T = int(input()) # the number of test cases
for case in range(1, T+1):
input() # the number of diners with non-empty plates, ignored
diners = [int(x) for x in input().split()]
minutes = max(diners) # the max stack of pancakes (= the max time)
# try to arrange all pancakes to stacks of equal height
for ncakes in range(1, minutes):
s = sum([(d - 1) // ncakes for d in diners if d > ncakes]) # number of special minutes
if s + ncakes < minutes:
minutes = s + ncakes
print(f'Case #{case}: {minutes}')
The time complexity of this solution is O(D*M), where D is the number of diners and M is the maximum number of pancakes.
However, the analysis also mentions another solution which is O(D*sqrt(M) + M):
Although the algorithm above is fast enough to
solve our problem, I have an even faster algorithm. Notice that the
list of ceil(a/1), ceil(a/2), ... only changes values at most
2*sqrt(a) times. For example, if a=10, the list is: 10, 5, 3, 3, 2, 2,
2, 2, 2, 1, 1, .... That list only changes value 4 times which is less
than 2 * sqrt(10)! Therefore, we can precompute when the list changes
value for every diner in only O(D*sqrt(M)). We can keep track these
value changes in a table. For example, if Pi=10, we can have a table
Ti: 10, -5, -2, 0, -1, 0, 0, 0, 0, -1, 0, .... Notice that the prefix
sum of this table is actually: 10, 5, 3, 3, 2, 2, 2, .... More
importantly, this table is sparse, i.e. it has only O(sqrt(M))
non-zeroes. If we do vector addition on all Ti, we can get a table
where every entry at index x of the prefix sum contains sum of
ceil(Pi/x). Then, we can calculate sum of ceil(Pi/x)-1 in the code
above by subtracting the xth index of the prefix sum with the number
of diners. Hence, only another O(M) pass is needed to calculate
candidate answers, which gives us O(D*sqrt(M) + M) running time. A
much faster solution!
Can anyone give me a hint how to translate this to Python?

Value in an array between two numbers in python

So making a title that actually explains what i want is harder than i thought, so here goes me explaining it.
I have an array filled with zeros that adds values every time a condition is met, so after 1 time step iteration i get something like this (minus the headers):
current_array =
bubble_size y_coord
14040 42
3943 71
6345 11
0 0
0 0
....
After this time step is complete this current_array gets set as previous_array and is wiped with zeros because there is not a guaranteed number of entries each time.
NOW the real question is i want to be able to check all rows in the first column of the previous_array and see if the current bubble size is within say 5% either side and if so i want to take the current y position away for the value associated with the matching bubble size number in the previous_array's second column.
currently i have something like;
if bubble_size in current_array[:, 0]:
do_whatever
but i don't know how to pull out the associated y_coord without using a loop, which i am fine with doing (there is about 100 rows to the array and atleast 1000 time steps so i want to make it as efficient as possible) but would like to avoid
i have included my thoughts on the for loop (note the current and previous_array are actually current and previous_frame)
for y in range (0, array_size):
if bubble_size >> previous_frame[y,0] *.95 &&<< previous_frame[y, 0] *1.05:
distance_travelled = current_y_coord - previous_frame[y,0]
y = y + 1
Any help is greatly appreciated :)
I probably did not get your issue here but if you want to first check if the bubble size is in between the same row element 95 % you can use the following:
import numpy as np
def apply(p, c): # For each element check the bubblesize grow
if(p*0.95 < c < p*1.05):
return 1
else:
return 0
def dist(p, c): # Calculate the distance
return c-p
def update(prev, cur):
assert isinstance(
cur, np.ndarray), 'Current array is not a valid numpy array'
assert isinstance(
prev, np.ndarray), 'Previous array is not a valid numpy array'
assert prev.shape == cur.shape, 'Arrays size mismatch'
applyvec = np.vectorize(apply)
toapply = applyvec(prev[:, 0], cur[:, 0])
print(toapply)
distvec = np.vectorize(dist)
distance = distvec(prev[:, 1], cur[:, 1])
print(distance)
current = np.array([[14040, 42],
[3943,71],
[6345,11],
[0,0],
[0,0]])
previous = np.array([[14039, 32],
[3942,61],
[6344,1],
[0,0],
[0,0]])
update(previous,current)
PS: Please, could you tell us what is the final array you look for based on my examples?
As I understand it (correct me if Im wrong):
You have a current bubble size (integer) and a current y value (integer)
You have a 2D array (prev_array) that contains bubble sizes and y coords
You want to check whether your current bubble size is within 5% (either way) of each stored bubble size in prev_array
If they are within range, subtract your current y value from the stored y coord
This will result in a new array, containing only bubble sizes that are within range, and the newly subtracted y value
You want to do this without an explicit loop
You can do that using boolean indexing in numpy...
Setup the previous array:
prev_array = np.array([[14040, 42], [3943, 71], [6345, 11], [3945,0], [0,0]])
prev_array
array([[14040, 42],
[ 3943, 71],
[ 6345, 11],
[ 3945, 0],
[ 0, 0]])
You have your stored bubble size you want to use for comparison, and a current y coord value:
bubble_size = 3750
cur_y = 10
Next we can create a boolean mask where we only select rows of prev_array that meets the 5% criteria:
ind = (bubble_size > prev_array[:,0]*.95) & (bubble_size < prev_array[:,0]*1.05)
# ind is a boolean array that looks like this: [False, True, False, True, False]
Then we use ind to index prev_array, and calculate the new (subtracted) y coords:
new_array = prev_array[ind]
new_array[:,1] = cur_y - new_array[:,1]
Giving your final output array:
array([[3943, -61],
[3945, 10]])
As its not clear what you want your output to actually look like, instead of creating a new array, you can also just update prev_array with the new y values:
ind = (bubble_size > prev_array[:,0]*.95) & (bubble_size < prev_array[:,0]*1.05)
prev_array[ind,1] = cur_y - prev_array[ind,1]
Which gives:
array([[14040, 42],
[ 3943, -61],
[ 6345, 11],
[ 3945, 10],
[ 0, 0]])

Numpy Dynamic Indexing With Both Slicing and Advanced Indexing

Looking to index a (N, M) array by taking a certain rows (indicated by an array of row index values) and take a slice from the columns with a fixed start and varying stop indices.
population = np.full((N,M), 0) #population of N guys with genome length M
#Choose some guys and change parts of their genome (cols)
rows_indices = [0,1,5,6] #four guys 0,1,5,6 will be changed
#all selected guys will have a start of 10
#the ends will be difference for each guy
slice_lengths = np.random.geometric(p=0.8, size = 4) #vector of 4
What I imagine is something like:
population[0, 10: 10+ slice_length[0]] = 100
population[1, 10: 10+ slice_length[1]] = 100
population[2, 10: 10+ slice_length[2]] = 100
population[3, 10: 10+ slice_length[3]] = 100
Except vectorized without hardcoding each value
#false code
population[rows_indices, start: start + slice_length]

How do I determine the high and low values in a series of cyclic data?

I've got some data that represents periodic motion. So, it goes from a high to a low and back again; if you were to plot it, it would like a sine wave. However, the amplitude varies slightly in each cycle. I would like to make a list of each maximum and minimum in the entire sequence. If there were 10 complete cycles, I would end up with 20 numbers, 10 positive (high) and 10 negative (low).
It seems like this is a job for time series analysis, but I'm not familiar with statistics enough to know for sure.
I'm working in python.
Can anybody give me some guidance as far as possible code libraries and terminology?
This isn't an overly complicated problem if you didn't want to use a library, something like this should do what you want. Basically as you iterate through the data if you go from ascending to descending you have a high, and from descending to ascending you have a low.
def get_highs_and_lows(data):
prev = data[0]
high = []
low = []
asc = None
for value in data[1:]:
if not asc and value > prev:
asc = True
low.append(prev)
elif (asc is None or asc) and value < prev:
asc = False
high.append(prev)
prev = value
if asc:
high.append(data[-1])
else:
low.append(data[-1])
return (high, low)
>>> data = [0, 1, 2, 1, 0, -2, 0, 2, 4, 2, 6, 8, 4, 0, 2, 4]
>>> print str(get_highs_and_lows(data))
([2, 4, 8, 4], [0, -2, 2, 0])
You'll probably need to familiarize yourself with some of the popular python science/statistics libraries. numpy comes to mind.
And here's an item from the SciPy mailing list discussing how to do what you want using numpy.
If x is a list of your data, and you happen to know the cycle length, T, try this:
# Create 10 1000-sample cycles of a noisy sine wave.
T = 1000
x = scipy.sin(2*scipy.pi*scipy.arange(10*T)/T) + 0.1*scipy.randn(10*T)
# Find the maximum and minimum of each cycle.
[(min(x[i:i+T]), max(x[i:i+T])) for i in range(0, len(x), T)]
# prints the following:
[(-1.2234858463372265, 1.2508648231644286),
(-1.2272859833650591, 1.2339382830978067),
(-1.2348835727451217, 1.2554960382962332),
(-1.2354184224872098, 1.2305636540601534),
(-1.2367724101594981, 1.2384651681019756),
(-1.2239698560399894, 1.2665865375358363),
(-1.2211500568892304, 1.1687268390393153),
(-1.2471220836642811, 1.296787070454136),
(-1.3047322264307399, 1.1917835644190464),
(-1.3015059337968433, 1.1726658435644288)]
Note that this should work regardless of the phase offset of the sinusoid (with high probability).

Categories