calculating mean of a list into sub-lists - python

If I have a big list or numpy array or etc that I need to split into sub-lists, how could I efficiently calculate the stadistics (mean, standar deviation, etc) for the whole list?
As a simple example, let's say that I have this small list:
l = [2,1,4,1,2,1,3,2,1,5]
>>> mean(l)
2.2000000000000002
But, if for some reason I need to split into sub-lists:
l1 = [2,1,4,1]
l2 = [2,1,3,2]
l3 = [1,5]
Of course, you don't need to know a lot about mathematics to know that this is NOT TRUE:
mean(l) = mean(mean(l1), mean(l2), mean(l3))
This may be true just if the lenght of all and every list is the same, which is not in this case.
The background of this question is related to the case when you have a very big dataset that does not fit into memory, and thus, you will need to split it into chucks.

In general, you need to keep the so-called sufficient statistics for each subset. For the mean and standard deviation, the sufficient statistics are the number of data, their sum, and their sum of squares. Given those 3 quantities for each subset, you can compute the mean and standard deviation for the whole set.
The sufficient statistics are not necessarily any smaller than the subset itself. But for mean and standard deviation, the sufficient statistics are just a few numbers.

I assume you know the number of data points you have, i.e., len(l)? Then you could just calculate a sum of each list indidividually (i.e., Map-reduce) or a running sum (i.e, if you are doing a readline()), and then divide by len(l) at the very end?

Related

Generate non-uniform random numbers [duplicate]

This question already has an answer here:
Fast way to obtain a random index from an array of weights in python
(1 answer)
Closed 4 years ago.
Algo (Source: Elements of Programming Interviews, 5.16)
You are given n numbers as well as probabilities p0, p1,.., pn-1
which sum up to 1. Given a rand num generator that produces values in
[0,1] uniformly, how would you generate one of the n numbers according
to their specific probabilities.
Example
If numbers are 3, 5, 7, 11, and the probabilities are 9/18, 6/18,
2/18, 1/18, then in 1000000 cals to the program, 3 should appear
500000 times, 7 should appear 111111 times, etc.
The book says to create intervals p0, p0 + p1, p0 + p1 + p2, etc so in the example above the intervals are [0.0, 5.0), [0.5, 0.0.8333), etc and combining these intervals into a sorted array of endpoints could look something like [1/18, 3/18, 9/18, 18/18]. Then run the random function generator, and find the smallest element that is larger than the generated element - the array index that it corresponds to maps to an index in the given n numbers.
This would require O(N) pre-processing time and then O(log N) to binary search for the value.
I have an alternate solution that requires O(N) pre-processing time and O(1) execution time, and am wondering what may be wrong with it.
Why can't we iterate through each number in n, multiplying [n] * 100 * probability that matches with n. E.g [3] * (9/18) * 100. Concatenate all these arrays to get, at the end, a list of 100 elements, with the number of elements for each mapping to how likely it is to occur. Then, run the random num function and index into the array, and return the value.
Wouldn't this be more efficient than the provided solution?
Your number 100 is not independent of the input; it depends on the given p values. Any parameter that depends on the magnitude of the input values is really exponential in the input size, meaning you are actually using exponential space. Just constructing that array would thus take exponential time, even if it was structured to allow constant lookup time after generating the random number.
Consider two p values, 0.01 and 0.99. 100 values is sufficient to implement your scheme. Now consider 0.001 and 0.999. Now you need an array of 1,000 values to model the probability distribution. The amount of space grows with (I believe) the ratio of the largest p value and the smallest, not in the number of p values given.
If you have rational probabilities, you can make that work. Rather than 100, you must use a common denominator of the rational proportions. Insisting on 100 items will not fulfill the specs of your assigned example, let alone more diabolical ones.

use np.random.multinomial() in python

I have a task to randomly chose 100 element from a population of alpha list [a,b,c,d] with corresponding frequency (probability) [0.1, 0.3, 0.2, 0.4].
There are many different ways to do it. But here I want what returned after this function call (suppose there is one) is a list of the number of elements chosen. Say, it returns (20,20,30,30), then it means 20 of elements a are chosen, 20 of elements c are chosen, etc.
I figured that np.random.multinomial is the way to go. Following the above example, I will need to call the function np.random.multinomial(100, [0.1,0.3,0.2,0.4],1 ). Is this right ? Thanks.
Related:
fast way to uniformly remove 10% of all the elements in a given list of python
Yes, np.random.multinomial(100, [0.1,0.3,0.2,0.4], 1 ) is correct. But since you are doing only one draw you'd maybe prefer the simpler np.random.multinomial(100, [0.1,0.3,0.2,0.4]) (without the ,1) which returns an array instead of an array of (one) array.
I agree with JulienD. The word "choose" and the given probabilities just don't go together.
When use "choose", we mean permutation without order.
When use probabilities given, we mean these are constant probabilities (unless it is stated that it is conditional). So the items are "assigned" to categories with the given probabilities.
Of course, the count in the categories is not 100*probabilities. That would have been the expected value over the long run. Just like if you toss a fair coin, you don't expect it to be HTHTHT...HT. But over the long run the count of H will be half of total tosses.
import numpy.random as npr
npr.seed(123)
npr.multinomial(100, [0.1,0.3,0.2,0.4], 1)
# Out: array([[11, 27, 18, 44]])
As the number of simulations increases, the probability will converge to the given probabilities.
simulations = 1000
sum(npr.multinomial(100, [0.1,0.3,0.2,0.4], simulations))/simulations/100
#Out:array([ 0.09995, 0.29991, 0.19804, 0.4021 ])

"Running" weighted average

I'm constantly adding/removing tuples to a list in Python and am interested in the weighted average (not the list itself). Since this part is computationally quite expensive compared to the rest, I want to optimise it. What's the best way of keeping track of the weighted average? I can think of two methods:
keeping the list and calculating the weighted average every time it gets accessed/changed (my current approach)
just keep track of current weighted average and the sum of all weights and change weight and current weighted average for every add/remove action
I would prefer the 2nd option, but I am worried about "floating point errors" induced by constant addition/subtraction. What's the best way of dealing with this?
Try doing it in integers? Python bignums should make a rational argument for rational numbers (sorry, It's late... really sorry actually).
It really depends on how many terms you are using and what your weighting coefficient is as to weather you will experience much floating point drift. You only get 53 bits of precision, you might not need that much.
If your weighting factor is less than 1, then your error should be bounded since you are constantly decreasing it. Let's say your weight is 0.6 (horrible, because you cannot represent that in binary). That is 0.00110011... represented as 0.0011001100110011001101 (rounded in the last bit). So any error you introduce from that rounding, will be then decreased after you multiply again. The error in the most current term will dominate.
Don't do the final division until you need to. Once again given 0.6 as your weight and 10 terms, your term weights will be 99.22903012752124 for the first term all the way down to 1 for the last term (0.6**-t). Multiply your new term by 99.22..., add it to your running sum and subtract the trailing term out, then divide by 246.5725753188031 (sum([0.6**-x for x in range(0,10)])
If you really want to adjust for that, you can add a ULP to the term you are about to remove, but this will just underestimate intentionally, I think.
Here is an answer that retains floating point for keeping a running total - I think a weighted average requires only two running totals:
Allocate an array to store your numbers in, so that inserting a number means finding an empty space in the array and setting it to that value and deleting a number means setting its value in the array to zero and declaring that space empty - you can use a linked list of free entries to find empty entries in time O(1)
Now you need to work out the sum of an array of size N. Treat the array as a full binary tree, as in heapsort, so offset 0 is the root, 1 and 2 are its children, 3 and 4 are the children of 1, 5 and 6 are the children of 2, and so on - the children of i are at 2i+1 and 2i+2.
For each internal node, keep the sum of all entries at or below that node in the tree. Now when you modify an entry you can recalculate the sum of the values in the array by working your way from that entry up to the root of the tree, correcting the partial sums as you go - this costs you O(log N) where N is the length of the array.

Algorithm to calculate point at which to round values in an array up or down in order to least affect the mean

Consider array random array of values between 0 and 1 such as:
[0.1,0.2,0.8,0.9]
is there a way to calculate the point at which the values should be rounded down or up to an integer in order to match the mean of the un-rounded array the closest? (in above case it would be at the mean but that is purely a coincidence)
or is it just trial and error?
im coding in python
thanks for any help
Add them up, then round the sum. That's how many 1s you want. Round so you get that many 1s.
def rounding_point(l):
# if the input is sorted, you don't need the following line
l = sorted(l)
ones_needed = int(round(sum(l)))
# this may require adjustment if there are duplicates in the input
return 1.0 if ones_needed == len(l) else l[-ones_needed]
If sorting the list turns out to be too expensive, you can use a selection algorithm like quickselect. Python doesn't come with a quickselect function built in, though, so don't bother unless your inputs are big enough that the asymptotic advantage of quickselect outweighs the constant factor advantage of the highly-optimized C sorting algorithm.

Divide set into subsets with equal number of elements

For the purpose of conducting a psychological experiment I have to divide a set of pictures (240) described by 4 features (real numbers) into 3 subsets with equal number of elements in each subset (240/3 = 80) in such a way that all subsets are approximately balanced with respect to these features (in terms of mean and standard deviation).
Can anybody suggest an algorithm to automate that? Are there any packages/modules in Python or R that I could use to do that? Where should I start?
If I understand correctly your problem, you might use random.sample() in python:
import random
pool = set(["foo", "bar", "baz", "123", "456", "789"]) # your 240 elements here
slen = len(pool) / 3 # we need 3 subsets
set1 = set(random.sample(pool, slen)) # 1st random subset
pool -= set1
set2 = set(random.sample(pool, slen)) # 2nd random subset
pool -= set2
set3 = pool # 3rd random subset
I would tackle this as follows:
Divide into 3 equal subsets.
Figure out the mean and variance of each subset. From them construct an "unevenness" measure.
Compare each pair of elements, if swapping would reduce the "unevenness", swap them. Continue until there are either no more pairs to compare, or the total unevenness is below some arbitrary "good enough" threshold.
You can easily do this using the plyr library in R. Here is the code.
require(plyr)
# CREATE DUMMY DATA
mydf = data.frame(feature = sample(LETTERS[1:4], 240, replace = TRUE))
# SPLIT BY FEATURE AND DIVIDE INTO THREE SUBSETS EQUALLY
ddply(mydf, .(feature), summarize, sub = sample(1:3, 60, replace = TRUE))
In case you are still interested in the exhaustive search question. You have 240 choose 80 possibilities to choose the first set and then another 160 choose 80 for the second set, at which point the third set is fixed. In total, this gives you:
120554865392512357302183080835497490140793598233424724482217950647 * 92045125813734238026462263037378063990076729140
Clearly, this is not an option :)
Order your items by their decreasing Mahalanobis distance from the mean; they will be ordered from most extraordinary to most boring, including the effects of whatever correlations exist amongst the measures.
Assign X[3*i] X[3*i+1] X[3*i+2] to the subsets A, B, C, choosing for each i the ordering of A/B/C that minimizes your mismatch measure.
Why decreasing order? The statistically heavy items will be assigned first, and the choice of permutation in the larger number of subsequent rounds will have a better chance of evening out initial imbalances.
The point of this procedure is to maximize the chance that whatever outliers exist in the data set will be assigned to separate subsets.

Categories