"Running" weighted average - python

I'm constantly adding/removing tuples to a list in Python and am interested in the weighted average (not the list itself). Since this part is computationally quite expensive compared to the rest, I want to optimise it. What's the best way of keeping track of the weighted average? I can think of two methods:
keeping the list and calculating the weighted average every time it gets accessed/changed (my current approach)
just keep track of current weighted average and the sum of all weights and change weight and current weighted average for every add/remove action
I would prefer the 2nd option, but I am worried about "floating point errors" induced by constant addition/subtraction. What's the best way of dealing with this?

Try doing it in integers? Python bignums should make a rational argument for rational numbers (sorry, It's late... really sorry actually).
It really depends on how many terms you are using and what your weighting coefficient is as to weather you will experience much floating point drift. You only get 53 bits of precision, you might not need that much.
If your weighting factor is less than 1, then your error should be bounded since you are constantly decreasing it. Let's say your weight is 0.6 (horrible, because you cannot represent that in binary). That is 0.00110011... represented as 0.0011001100110011001101 (rounded in the last bit). So any error you introduce from that rounding, will be then decreased after you multiply again. The error in the most current term will dominate.
Don't do the final division until you need to. Once again given 0.6 as your weight and 10 terms, your term weights will be 99.22903012752124 for the first term all the way down to 1 for the last term (0.6**-t). Multiply your new term by 99.22..., add it to your running sum and subtract the trailing term out, then divide by 246.5725753188031 (sum([0.6**-x for x in range(0,10)])
If you really want to adjust for that, you can add a ULP to the term you are about to remove, but this will just underestimate intentionally, I think.

Here is an answer that retains floating point for keeping a running total - I think a weighted average requires only two running totals:
Allocate an array to store your numbers in, so that inserting a number means finding an empty space in the array and setting it to that value and deleting a number means setting its value in the array to zero and declaring that space empty - you can use a linked list of free entries to find empty entries in time O(1)
Now you need to work out the sum of an array of size N. Treat the array as a full binary tree, as in heapsort, so offset 0 is the root, 1 and 2 are its children, 3 and 4 are the children of 1, 5 and 6 are the children of 2, and so on - the children of i are at 2i+1 and 2i+2.
For each internal node, keep the sum of all entries at or below that node in the tree. Now when you modify an entry you can recalculate the sum of the values in the array by working your way from that entry up to the root of the tree, correcting the partial sums as you go - this costs you O(log N) where N is the length of the array.

Related

How do you take randomness into account when finding the complexity of a function?

I've been trying to understand complexities, but all the online material has left me confused. Especially the part where they create actual mathematical functions. I have a for loop and a while loop. My confusion arises from the while loop. I know the complexity of the for loop is O(n), but the while loop is based on randomness. A random number if picked, and if this number is not in the list, it is added and the while loop broken. But my confusion arises here, the while loop may run in the worst case (in my thoughts) for an m number of times, until it is done. So I was thinking the complexity would then be O(n*m)?
I'm just really lost, and need some help.
Technically worst-case complexity is O(inf): random.randint if we consider it real random generator (it isn't, of course) can produce arbitrary long sequence with equal elements. However, we can estimate "average-case" complexity. It isn't the real average-case complexity (best, worst and average cases must be defined by input, not randomly), but it can show how many iterations program will do, if we run it for fixed n multiple times and take average of results.
Let's note that the list works as set here (you never add repeated number), so I'd stick with not in set comparison instead which is O(1) (while not in list is O(i)) to remove that complexity source and simplify things a bit: now count of iterations and complexity can be estimated with same big O limits. Single trial here is choosing from uniform integer distribution on [1; n]. Success is choosing number that is not in the list yet.
Then what's expected value of number of trials before getting item that is not in the set? Set size before each step is i in your code. We can pick any of n-i numbers. Thus probability of success is p_i = (n-i)/n (as the distribution is uniform). Every outer iteration is an example of geometrical distribution: count of trials before first success. So estimated count of while iterations is n_i = 1 / p_i = n / (n-i). To get final complexity we should sum this counts for each for iteration: sum(n_i for i in range(n)). This is obviously equal to n * Harmonic(n), where Harmonic(n) is n-th harmonic number (sum of first n reciprocals to natural numbers). Harmonic(n) ~ O(log n), thus "average-case" complexity of this code is O(n log n).
For list it will be sum(i*n / (n-i) for i in range(n)) ~ O(n^2 log(n)) (proof of this equality will be a little longer).
Big 'O' notation is used for worst case scenarios only.
figure out what could be he worst case for given loop.
make a function in 'n' , and take highest power of 'n' and ignore constant if any, you will get time complexity.

Collapsing set of strings based on a given hamming distance

Given a set of strings (first column) along with counts (second column), e.g.:
aaaa 10
aaab 5
abbb 3
cbbb 2
dbbb 1
cccc 8
Are there any algorithms or even implementations (ideally as a Unix executive, R or python) which collapse this set into a new set based on a given hamming distance.
Collapsing implies adding the count
Strings with a lower count are collapsed into strings with higher counts.
For example say for hamming distance 1, the above set would collapse the second string aaab into aaaa since they are 1 hamming distance apart and aaaa has a higher count.
The collapsed entry would have the combined count, here aaaa 15
For this set, we'd, therefore, get the following collapsed set:
aaaa 15
abbb 6
cccc 8
Ideally, an implementation should be efficient, so even heuristics which do not guarantee an optimal solution would be appreciated.
Further background and motivation
Calculating the hamming distance between 2 strings (a pair) is been implemented in most programming languages. A brute force solution would compute compute the distance between all pairs. Maybe there is no way around it. However e.g. I'd imagine an efficient solutions would avoid calculating the distance for all pairs etc. There are maybe clever ways to save some calculations based on metric theory (since hamming distance is a metric), e.g. if hamming distance between x and z is 3, and x and y is 3, I can avoid calculating between y and z. Maybe there is a clever k-mer approach, or maybe some efficient solution for a constant distance (say d=1).
Even if it there was only a brute force solution, I'd be curious if this has been implemented before and how to use it (ideally without me having to implement it myself).
I thought up the following:
This reports the item with the highest score with the sum of its score and the scores of its near by neighbors. Once a neighbor is used it is not reported separately.
I suggest using a Vantage-point tree as the metric index.
The algorithm would look like this:
construct the metric index from the strings and their scores
construct the max heap from the strings and their scores
for the string with the highest score in the max heap:
use the metric index to find the near by strings
print the string, and the sum of its score and its near by strings
remove from the metric index the string and each of the near by strings
remove from the max heap the string and each of the near by strings
repeat 3-7 until the max heap is empty
Perhaps this could be simplified by using a used table rather than removing anything. The metric space index would not need to have efficient deletion nor would the max heap need to support deletion by value. But this would be slower if the neighborhoods are large and overlap frequently. So efficient deletion might be a necessary difficulty.
construct the metric index from the strings and their scores
construct the max heap from the strings and their scores
construct the used table from an empty set
for the string with the highest score in the max heap:
if this string is in the used table: start over with the next string
use the metric index to find the near by strings
remove any near by strings that are in the used table
print the string, and the sum of its score and its near by strings
add the near by strings to the used table
repeat 4-9 until the max heap is empty
I can not provide a complexity analysis.
I was thinking about the second algorithm. The part that I thought was slow was the checking of the neighborhood against used table. This is not needed as deletion from a Vantage-point tree can be done in linear time. When searching for the neighbors, remember where they were found and then remove them later using these locations. If a neighbor is used as a vantage-point, mark it as removed so that a search will not return it, but leave it alone otherwise. This I think restores it to below quadratic. As otherwise it would be something like number of items times size of neighborhood.
In response to the comment. The problem was "Strings with a lower count are collapsed into strings with higher counts." as such this does compute that. It is not a greedy approximation that could result non-optimal result as there was nothing to maximize or minimize. It is an exact algorithm. It returns the the item with the highest score combined with the score of its neighborhood.
This can be viewed as assigning a leader to each neighborhood such that each item has at most one leader and that leader has the largest overall score so far. This can be viewed as a directed graph.
The specification wasn't for dynamic programming or optimization problem. For that you would ask for the item with the highest score in the highest total scoring neighborhood. That can also be solved in a similar way by changing the ranking function strings from its score to the pair of the sum of its score and its neighborhood, and its score.
It does mean that it can't be solved with a max heap over the scores as removing items affects the neighbors of the neighborhood and one would have to recalculate their neighborhood score before again finding the item with the highest total scoring neighborhood.

An algorithm to stochastically pick the top element of a set with some noise

I'd like to find a method (e.g. in python) which given a sorted list, picks the top element with some error epsilon.
One way would be to pick the top element with probability p < 1 and then the 2nd with p' < p and so on with an exponential decay.
Ideally though I'd like a method that takes into account the winning margin of the top element with some noise. I.e:
Given a list [a,b,c,d,e,....] in which a is the largest element, b the second largest and so on,
Pick the top element with probability p < 1, where p depends on the value of a-b, and p' on the value of b-c and so on.
You can't do exactly that, since, if you have n elements, you will only have n-1 differences between consecutive elements. The standard method of doing something similar is fitness proportionate selection (link provides code in java and ruby, should be fairly easy to translate to other languages).
For other variants of the idea, look up selection operators for genetic algorithms (there are various).
One way to do that is to select element k with probability proportional to exp(-(x[k] - x[0])/T) where x[0] is the least element and T is a free parameter, analogous to temperature. This is inspired by an analogy to thermodynamics, in which low-energy (small x[k]) states are more probable, and high-energy (large x[k]) states are possible, but less probable; the effect of temperature is to focus on just the most probable states (T near zero) or to select from all the elements with nearly equal probability (large T).
The method of simulated annealing is based on this analogy, perhaps you can get some inspiration from that.
EDIT: Note that this method gives nearly-equal probability to elements which have nearly-equal values; from your description, it sounds like that's something you want.
SECOND EDIT: I have it backwards; what I wrote above makes lesser values more probable. Probability proportional to exp(-(x[n - 1] - x[k])/T) where x[n - 1] is the greatest value makes greater values more probable instead.

Generate non-uniform random numbers [duplicate]

This question already has an answer here:
Fast way to obtain a random index from an array of weights in python
(1 answer)
Closed 4 years ago.
Algo (Source: Elements of Programming Interviews, 5.16)
You are given n numbers as well as probabilities p0, p1,.., pn-1
which sum up to 1. Given a rand num generator that produces values in
[0,1] uniformly, how would you generate one of the n numbers according
to their specific probabilities.
Example
If numbers are 3, 5, 7, 11, and the probabilities are 9/18, 6/18,
2/18, 1/18, then in 1000000 cals to the program, 3 should appear
500000 times, 7 should appear 111111 times, etc.
The book says to create intervals p0, p0 + p1, p0 + p1 + p2, etc so in the example above the intervals are [0.0, 5.0), [0.5, 0.0.8333), etc and combining these intervals into a sorted array of endpoints could look something like [1/18, 3/18, 9/18, 18/18]. Then run the random function generator, and find the smallest element that is larger than the generated element - the array index that it corresponds to maps to an index in the given n numbers.
This would require O(N) pre-processing time and then O(log N) to binary search for the value.
I have an alternate solution that requires O(N) pre-processing time and O(1) execution time, and am wondering what may be wrong with it.
Why can't we iterate through each number in n, multiplying [n] * 100 * probability that matches with n. E.g [3] * (9/18) * 100. Concatenate all these arrays to get, at the end, a list of 100 elements, with the number of elements for each mapping to how likely it is to occur. Then, run the random num function and index into the array, and return the value.
Wouldn't this be more efficient than the provided solution?
Your number 100 is not independent of the input; it depends on the given p values. Any parameter that depends on the magnitude of the input values is really exponential in the input size, meaning you are actually using exponential space. Just constructing that array would thus take exponential time, even if it was structured to allow constant lookup time after generating the random number.
Consider two p values, 0.01 and 0.99. 100 values is sufficient to implement your scheme. Now consider 0.001 and 0.999. Now you need an array of 1,000 values to model the probability distribution. The amount of space grows with (I believe) the ratio of the largest p value and the smallest, not in the number of p values given.
If you have rational probabilities, you can make that work. Rather than 100, you must use a common denominator of the rational proportions. Insisting on 100 items will not fulfill the specs of your assigned example, let alone more diabolical ones.

Algorithm to calculate point at which to round values in an array up or down in order to least affect the mean

Consider array random array of values between 0 and 1 such as:
[0.1,0.2,0.8,0.9]
is there a way to calculate the point at which the values should be rounded down or up to an integer in order to match the mean of the un-rounded array the closest? (in above case it would be at the mean but that is purely a coincidence)
or is it just trial and error?
im coding in python
thanks for any help
Add them up, then round the sum. That's how many 1s you want. Round so you get that many 1s.
def rounding_point(l):
# if the input is sorted, you don't need the following line
l = sorted(l)
ones_needed = int(round(sum(l)))
# this may require adjustment if there are duplicates in the input
return 1.0 if ones_needed == len(l) else l[-ones_needed]
If sorting the list turns out to be too expensive, you can use a selection algorithm like quickselect. Python doesn't come with a quickselect function built in, though, so don't bother unless your inputs are big enough that the asymptotic advantage of quickselect outweighs the constant factor advantage of the highly-optimized C sorting algorithm.

Categories