Quickly get coverage of each position in arrary

Quickly get coverage of each position in arrary - python

It is a short problem, There is a list of intervals, for example:
[1,4],[2,5],[3,6]
I want to get the coverage, which means number of intervals that overlap on the current position of each position from smallest to largest number (1 to 6)(like a line sweep from 1 to 6 and get number of hits at each position), the result should be:
[1,2,3,3,2,1]
Is there any way to quickly find this out, I can iterate all positions and see if it is in each interval, but that is too slow, is there any way to get this quick? Because in practice there can be millions of intervals, I am thinking of representing each interval as bit arrary but still cannot figure out, if anyone has idea, please let me know Thanks!

You could map your set of intervals in a set of number pairs (time; delta)
[1,4],[2,5],[3,6]
becomes
(1;+1), (4;-1), (2;+1), (5;-1), (3;+1), (6;-1)
Then sort the set of pairs in ascending order of their time entries
(1;+1), (2;+1), (3;+1), (4;-1), (5;-1), (6;-1)
Finally, go through this list and increment/decrement the coverage count, initialized at zero:
(1;1), (2;2), (3;3), (4;2), (5;1), (6;0)

Related

Find all the times in a (very large) array that have a difference greater than x

My array is time, so it is sorted and increasing.
I have to pull out the beginning/end where the difference in the array is greater than 30. The problem which other solutions don't cover, is that the array is thousands of values so looping through the array seems inefficient.
hugeArr = np.array([0, 2.072, 50.0, 90.0, 91.1])
My desired output for the above array would be something like: (2.072,50) (50,90).
Is there a way to accomplish this?

You can use np.diff and np.where to find the correct indices:
>>> idxs = np.where(np.diff(hugeArr) > 30)[0]
>>> list(zip(hugeArr[idxs], hugeArr[idxs + 1]))
[(2.072, 50.0), (50.0, 90.0)]
(Assuming you require only consecutive values)
And as #not_speshal mentioned, you can use np.column_stack instead of list(zip(...)) to stay within NumPy boundaries:
>>> np.column_stack((hugeArr[idxs], hugeArr[idxs+1]))
array([[ 2.072, 50. ],
[50. , 90. ]])

Try to think about what your'e trying to do. For each value in the array if the next value is larger by more then 30 you'd like to save the tuple of them.
The key words here are for each. This is a classic O(n) complexity algorithm, so decreasing its time complexity seems impossible to me.
However, you can make changes specific to your array to make the algorithm faster.
For example, if your'e looking for a difference of 30 and you know that the average difference is 1, you might be better off looking for index i at
difference = hugeArr[i+15] - hugeArr[i]
and see if this is bigger then 30. If it isn't (and it probably won't be), you can skip these 15 indices as you know that no gap between two consecutive values is larger then the big gap.
If this works for you, run tests, 15 is completely arbitrary and maybe your magic number is 25. Change it a bit and time how long your function takes to run.

A strategy that comes to mind is that we don't have to check numbers between two numbers that have a distance smaller than 30, we can do this because it is sorted. For example if the abs(hugeArr[0] - hugeArr[-1]) < 30 we dont have to check anything because nothing will have a distance of over 30.
We would start at the ends and work our way inwards. So check the starting number and ending number first. Then we go halfway hugeArr[len(hugeArr)//2] and check that number distance against the hugeArr[0] and hugeArr[-1]. Then we go into the ranges (hugeArr[0:len(hugeArr)//2] and hugeArr[len(hugeArr)//2:-1]). We break those two ranges again in half and wherever there is a distance from end to end smaller than 30 we don't check those. We can make this a recursive algorithm.
Worst case you'll have a distance over 30 everywhere and end up with O(n) but it could give you some advantage.
Something like this however you might want to refactor to numpy.
def check(arr):
pairs = []
def check_range(hugeArr):
difference = abs(hugeArr[0] - hugeArr[-1])
if difference < 30:
return
if len(hugeArr) == 2:
pairs.append((hugeArr[0], hugeArr[1]))
return
halfway = len(hugeArr)//2
check_range(hugeArr[:halfway+1])
check_range(hugeArr[halfway:])
check_range(arr)
return pairs

Matrix Math - Maximizing

I have a dataframe that with an index of magic card names. The columns are the same index, resulting in a 1081 x 1081 dataframe of each card in my collection paired with each other card in my collection.
I have code that identifies combos of cards that go well together. For example "Whenever you draw a card" pairs well with "Draw a card" cards. I find the junction of those two cards and increase its value by 1.
Now, I need to find the maximum value for 36 cards.
But, how?
Randomly selecting cards is useless, there are 1.717391336 E+74 potential combinations. I've tried pulling out the lowest values and that reduces the set of potential combinations, but even at 100 cards you're talking about 1.977204582 E+27 potentials.
This has to have been solved by someone smarter than me - can ya'll point me in the right direction?

As you pointed out already, the combinatorics are not on your side here. There are 1081 choose 36 possible sets (binomial coefficient), so it is out of question to check all of them.
I am not aware of any practicable solution to find the optimal set for the general problem, that is without knowing the 1081x1081 matrix.
For an approximate solution for the general problem, you might want to try a greedy approach, while keeping a history of n sets after each step, with e.g. n = 1000.
So you would start with going through all sets with 2 cards, which is 1081 * 1080 / 2 combinations, look up the value in the matrix for each and pick the n max ones.
In the second step, for each of the n kept sets, go through all possible combinations with a third card (and check for duplicate sets), i.e. checking n * 1079 sets, and keep the n max ones.
In the third step, check n * 1078 sets with a fourth card, and so on, and so forth.
Of course, this won't give you the optimal solution for the general case, but maybe it's good enough for your given situation. You can also take a look at the history, to get a feeling for how often it happens that the best set from step x is caught up by another set in a step y > x. Depending on your matrix, it might not happen that often or even never.

Why does peakutils.peak.indexes() seem to ignore the provided threshold value?

I'm retrieving the arrays holding the power levels and frequencies, respectively, of a signal from the plt.psd() method:
Pxx, freqs = plt.psd(signals[0], NFFT=2048, Fs=sdr.sample_rate/1e6, Fc=sdr.center_freq/1e6, scale_by_freq=True, color="green")
Please ignore the green and red signals. Just the blue one is relevant for this question.
I'm able to have the peakutils.peak.indexes() method return the X and Y coordinates of a number of the most significant peaks (of the blue signal):
power_lvls = 10*log10(Pxx/(sdr.sample_rate/1e6))
indexes = peakutils.peak.indexes(np.array(power_lvls), thres=0.6/max(power_lvls), min_dist=120)
print("\nX: {}\n\nY: {}\n".format(freqs[indexes], np.array(power_lvls)[indexes]))
As can be seen, the coordinates fit the blue peaks quite nicely.
What I'm not satisfied with is the number of peak coordinates I receive from the peak.indexes() method. I'd like to have only the coordinates of all peaks above a certain power level returned, e.g., -25 (which would then be exactly 5 peaks for the blue signal). According to the documentation of the peak.indexes() method this is done by providing the desired value as thres parameter.
But no matter what I try as thres, the method seems to entirely ignore my value and instead solely rely on the min_dist parameter to determine the number of returned peaks.
What is wrong with my threshold value (which I believe means "peaks above the lower 60% of the plot" in my code now) and how do I correctly specify a certain power level (instead of a percentage value)?
[EDIT]
I figured out that apparently the thres parameter can only take positive values between float 0. and 1.
So, by changing my line slightly as follows I can now influence the number of returned peaks as desired:
indexes = peakutils.peak.indexes(np.array(power_lvls), thres=0.4, min_dist=1)
But that still leaves me with the question whether it's possible to somehow limit the result to the five highest peaks (provided num_of_peaks above thres >= 5).
I believe something like the following would return the five highest values:
print(power_lvls[np.argsort(power_lvls[indexes])[-5:]])
Unfortunately, though, negative values seem to be interpreted as the highest values in my power_lvls array. Can this line be changed such that (+)10 would be considered higher than, e.g., -40? Or is there another (better?) solution?
[EDIT 2]
These are the values I get as the six "highest" peaks:
power_lvls = 10*log10(Pxx/(sdr.sample_rate/1e6))+10*log10(8/3)
indexes = peakutils.indexes(power_lvls, thres=0.35, min_dist=1)
power_lvls_max = power_lvls[np.argsort(power_lvls[indexes])[-6:]]
print("Highest Peaks in Signal:\nX: \n\nY: {}\n".format(power_lvls_max))
After trying various things for hours without any improvement I'm starting to think that these are neither valleys nor peaks, just some "random" values?! Which leads me to believe that there is a problem with my argsort line that I have to figure out first?!
[EDIT 3]
The bottleneck.partition() method seems to return the correct values (even if apparently it does so in random order, not from leftmost peak to rightmost peak):
import bottleneck as bn
power_lvls_max = -bn.partition(-power_lvls[indexes], 6)[:6]
Luckily, the order of the peaks is not important for what I have planned to do with the coordinates. I do, however, have to figure out yet how to match the Y values I have now to their corresponding X values ...
Also, while I do have a solution now, for learning purposes it would still be interesting to know what was wrong with my argsort attempt.

A simple way to solve this would be to add a constant (for example +50 dB) to your Pxx vector before the processing. That way you would avoid the negative-valued peaks. After the processing is done, you can subtract the constant again to get the right peak values.

I figured it out how to find the corresponding X values and get full coordinates of the six highest peaks:
power_lvls = 10*log10(Pxx/(sdr.sample_rate/1e6))+10*log10(8/3)
indexes = peakutils.indexes(power_lvls, thres=0.35, min_dist=1)
print("Peaks in Signal 1\nX: {}\n\nY: {}\n".format(freqs[indexes], power_lvls[indexes]))
power_lvls_max = -bn.partition(-power_lvls[indexes], 6)[:6]
check = np.isin(power_lvls, power_lvls_max)
indexes_max = np.where(check)
print("Highest Peaks in Signal 1:\nX: {}\n\nY: {}\n".format(freqs[indexes_max], power_lvls[indexes_max]))
Now I have my "peak filtering" (kind of), which I originally tried to achieve by messing around with the thres value of peakutils.peak.indexes(). The code above gives me just the desired result:

Statistics: Optimizing probability calculations within python

Setup:
The question is complex form of a classic probability question:
70 colored balls are placed in an urn, 10 for each of the seven rainbow colors.
What is the expected number of distinct colors in 20 randomly picked balls?
My solution is python's itertools library:
combos = itertools.combinations(urn, 20),
print sum([1 for x in combos])
(where urn is a list of the 70 balls in the urn).
I can unpack the iterator up to a length of combinations(urn, 8) past that my computer can't handle it.
Note: I know this wouldn't give me the answer, this is only the road block in my script, in other words if this worked my script would work.
Question: How could I find the expected colors accurately, without the worlds fastest super computer? Is my way even computationally possible?

Since a couple of people have asked to see the mathematical solution, I'll give it. This is one of the Project Euler problems that can be done in a reasonable amount of time with pencil and paper. The answer is
7(1 - (60 choose 20)/(70 choose 20))
To get this write X, the count of colors present, as a sum X0+X1+X2+...+X6, where Xi is 1 if the ith color is present, and 0 if it is not present.
E(X)
= E(X0+X1+...+X6)
= E(X0) + E(X1) + ... + E(X6) by linearity of expectation
= 7E(X0) by symmetry
= 7 * probability that a particular color is present
= 7 * (1- probability that a particular color is absent)
= 7 * (1 - (# ways to pick 20 avoiding a color)/(# ways to pick 20))
= 7 * (1 - (60 choose 20)/(70 choose 20))
Expectation is always linear. So, when you are asked to find the average value of some random quantity, it often helps to try to rewrite the quantity as a sum of simpler pieces such as indicator (0-1) random variables.
This does not say how to make the OP's approach work. Although there is a direct mathematical solution, it is good to know how to iterate through the cases in an organized and practicable fashion. This could help if you next wanted a more complicated function of the set of colors present than the count. Duffymo's answer suggested something that I'll make more explicit:
You can break up the ways to draw 20 calls from 70 into categories indexed by the counts of colors. For example, the index (5,5,10,0,0,0,0) means we drew 5 of the first color, 5 of the second color, 10 of the third color, and none of the other colors.
The set of possible indices is contained in the collection of 7-tuples of nonnegative integers with sum 20. Some of these are impossible, such as (11,9,0,0,0,0,0) by the problem's assumption that there are only 10 balls of each color, but we can deal with that. The set of 7-tuples of nonnegative numbers adding up to 20 has size (26 choose 6)=230230, and it has a natural correspondence with the ways of choosing 6 dividers among 26 spaces for dividers or objects. So, if you have a way to iterate through the 6 element subsets of a 26 element set, you can convert these to iterate through all indices.
You still have to weight the cases by the counts of the ways to draw 20 balls from 70 to get that case. The weight of (a0,a1,a2,...,a6) is (10 choose a0)(10 choose a1)...*(10 choose a6). This handles the case of impossible indices gracefully, since 10 choose 11 is 0 so the product is 0.
So, if you didn't know about the mathematical solution by the linearity of expectation, you could iterate through 230230 cases and compute a weighted average of the number of nonzero coordinates of the index vector, weighted by a product of small binomial terms.

Wouldn't it just be combinations with repetition?
http://www.mathsisfun.com/combinatorics/combinations-permutations.html

Make an urn with 10 of each color.
Decide on the number of trials you want.
Make a container to hold the result of each trial
for each trial, pick a random sample of twenty items from the urn, make a set of those items, add the length of that set to the results.
find the average of the results

Python: Randomly draw several objects in a list

I am looking for the most efficient way to randomly draw nelements in a list given a list of probabilities stating the probability of each element to be picked.
aList = [3,4,2,1,4,3,5,7,6,4]
MyProba = [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]
It means that at each draw, the first element (which is 3) has a probability of 0.1 to be drawn. Of course,
sum(MyProba) == 1 # always returns True
len(aList) == len(MyProba) # always returns True
Up to now I did the following:
def random_pick(some_list, proba):
x = random.uniform(0, 1)
cumulative_proba = 0.0
for item, item_proba in zip(some_list, proba):
cumulative_proba += item_proba
if x < cumulative_proba:
break
return item
nb_draws = 10
list_of_drawn_elements = []
for one_draw in range(nb_draws):
list_of_drawn_elements.append(random_pick(aList, MyProba))
It works but it is terribly slow for long lists and big values of nb_draws. How can I improve the speed of this process?
Note: In the special case I am facing, nb_draws always equals the length of aList.

The general idea (as outlined by others' answers as well) is that your method is inefficient because the preprocessing (the calculation of the cumulative distribution) is done every time you draw a sample, although it would be enough to do it once before the sampling and then use the preprocessed data to do the sampling.
The preprocessing and sampling could be done efficiently with Walker's alias method. I have implemented it a while ago; take a look at the source code. (Sorry for the external link, but I think it's too long to post it here). My version requires NumPy; if you don't want to use NumPy, there is a NumPy-free alternative as well (on which my version is based).
Edit: the explanation of Walker's alias method is to be found in the first link I provided. In a nutshell, imagine that you somehow managed to construct a rectangular "darts board" that is subdivided into parts such that each part corresponds to one of your original items, and the area of each part is proportional to the desired probability of selecting the corresponding element. You can then start throwing darts at random at the darts board (by generating two random numbers that specify the horizontal and vertical coordinate of where the dart ended up) and check which areas the darts hit. The items corresponding to the areas will be the items you have selected. Walker's alias method is simply a linear-time preprocessing that constructs the dart board. Drawing each element can then be done in constant time. In the end, drawing m elements out of n will have a cost of O(n) for preprocessing and O(m) for generating the samples, yielding a total complexity of O(n + m).

here's my lazy method... build a list with expected number of values for the desired distribution, and use random.choice() to pick a value from the list.
>>> import random
>>>
>>> value_probs = dict(zip([3,4,2,1,4,3,5,7,6,4], [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]))
>>> expected_dist = sum([[i] * int(prob * 100) for i, prob in value_probs.iteritems()], [])
>>> random.choice(expected_dist)

You might try to precalculate the cumulative probability range for each element and make a tree from these intervals. Then you will get a logarithmic complexity for looking up the element corresponding to the generated probability, instead of linear one that you have now.

You're calculating cumulative_proba each time when you call random_pick. I suggest to calculate it outside the method, and use a better data structure to store it, like a binary search tree, which will reduce the time complexity from O(n) to O(lgn).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.