compute mean in python for a generator - python

I'm doing some statistics work, I have a (large) collection of random numbers to compute the mean of, I'd like to work with generators, because I just need to compute the mean, so I don't need to store the numbers.
The problem is that numpy.mean breaks if you pass it a generator. I can write a simple function to do what I want, but I'm wondering if there's a proper, built-in way to do this?
It would be nice if I could say "sum(values)/len(values)", but len doesn't work for genetators, and sum already consumed values.
here's an example:
import numpy
def my_mean(values):
n = 0
Sum = 0.0
try:
while True:
Sum += next(values)
n += 1
except StopIteration: pass
return float(Sum)/n
X = [k for k in range(1,7)]
Y = (k for k in range(1,7))
print numpy.mean(X)
print my_mean(Y)
these both give the same, correct, answer, buy my_mean doesn't work for lists, and numpy.mean doesn't work for generators.
I really like the idea of working with generators, but details like this seem to spoil things.

In general if you're doing a streaming mean calculation of floating point numbers, you're probably better off using a more numerically stable algorithm than simply summing the generator and dividing by the length.
The simplest of these (that I know) is usually credited to Knuth, and also calculates variance. The link contains a python implementation, but just the mean portion is copied here for completeness.
def mean(data):
n = 0
mean = 0.0
for x in data:
n += 1
mean += (x - mean)/n
if n < 1:
return float('nan')
else:
return mean
I know this question is super old, but it's still the first hit on google, so it seemed appropriate to post. I'm still sad that the python standard library doesn't contain this simple piece of code.

Just one simple change to your code would let you use both. Generators were meant to be used interchangeably to lists in a for-loop.
def my_mean(values):
n = 0
Sum = 0.0
for v in values:
Sum += v
n += 1
return Sum / n

def my_mean(values):
total = 0
for n, v in enumerate(values, 1):
total += v
return total / n
print my_mean(X)
print my_mean(Y)
There is statistics.mean() in Python 3.4 but it calls list() on the input:
def mean(data):
if iter(data) is data:
data = list(data)
n = len(data)
if n < 1:
raise StatisticsError('mean requires at least one data point')
return _sum(data)/n
where _sum() returns an accurate sum (math.fsum()-like function that in addition to float also supports Fraction, Decimal).

The old-fashioned way to do it:
def my_mean(values):
sum, n = 0, 0
for x in values:
sum += x
n += 1
return float(sum)/n

One way would be
numpy.fromiter(Y, int).mean()
but this actually temporarily stores the numbers.

Your approach is a good one, but you should instead use the for x in y idiom instead of repeatedly calling next until you get a StopIteration. This works for both lists and generators:
def my_mean(values):
n = 0
Sum = 0.0
for value in values:
Sum += value
n += 1
return float(Sum)/n

You can use reduce without knowing the size of the array:
from itertools import izip, count
reduce(lambda c,i: (c*(i[1]-1) + float(i[0]))/i[1], izip(values,count(1)),0)

def my_mean(values):
n = 0
sum = 0
for v in values:
sum += v
n += 1
return sum/n
The above is very similar to your code, except by using for to iterate values you are good no matter if you get a list or an iterator.
The python sum method is however very optimized, so unless the list is really, really long, you might be more happy temporarily storing the data.
(Also notice that since you are using python3, you don't need float(sum)/n)

If you know the length of the generator in advance and you want to avoid storing the full list in memory, you can use:
reduce(np.add, generator)/length

Try:
import itertools
def mean(i):
(i1, i2) = itertools.tee(i, 2)
return sum(i1) / sum(1 for _ in i2)
print mean([1,2,3,4,5])
tee will duplicate your iterator for any iterable i (e.g. a generator, a list, etc.), allowing you to use one duplicate for summing and the other for counting.
(Note that 'tee' will still use intermediate storage).

Related

How to make itertools combinations faster in python?

I tried a lot of things and still don't know why it doesn't work fast. How to I fix it?
It is a CodeWars 6 kyu task:
Given a set of elements (integers or string characters, characters only in RISC-V), where any element may occur more than once, return the number of subsets that do not contain a repeated element.
import itertools
def est_subsets(a):
counter = 0
a = list(set(a))
p = itertools.chain.from_iterable(itertools.combinations(a, r)for r in range(1, len(a) + 1))
for b in p:
counter += 1
return counter
itertools.combinations needs to generate all the values. But you could just compute the number of values that would be generated directly, instead of generating them at all. Just use math.comb (added in 3.8), selecting the length of your input and you'll get the same results in a tiny fraction of the time.
Please take a look at the manual:
https://docs.python.org/3/library/itertools.html#itertools.combinations
The number of items returned is n! / r! / (n-r)! when 0 <= r <= n or zero when r > n.
Which means that you can calculate the number or items it should return.

My program can't run that fast even with memoization

I tried a problem on project euler where I needed to find the sum of all the fibonacci terms under 4 million. It took me a long time but then I found out that I can use memoization to do it but it seems to take still a long time. After a lot of research, I found out that I can use a built-in module called lru_cache. My question is : why isn't it as fast as memoization ?
Here's my code:
from functools import lru_cache
#lru_cache(maxsize=1000000)
def fibonacci_memo(input_value):
global value
fibonacci_cache = {}
if input_value in fibonacci_cache:
return fibonacci_cache[input_value]
if input_value == 0:
value = 1
elif input_value == 1:
value = 1
elif input_value > 1:
value = fibonacci_memo(input_value - 1) + fibonacci_memo(input_value - 2)
fibonacci_cache[input_value] = value
return value
def sumOfFib():
SUM = 0
for n in range(500):
if fibonacci_memo(n) < 4000000:
if fibonacci_memo(n) % 2 == 0:
SUM += fibonacci_memo(n)
return SUM
print(sumOfFib())
The code works by the way. It takes less than a second to run it when I use the lru_cache module.
The other answer is the correct way to calculate the fibonacci sequence, indeed, but you should also know why your memoization wasn't working. To be specific:
fibonacci_cache = {}
This line being inside the function means you were emptying your cache every time fibonacci_memo was called.
You shouldn't be computing the Fibonacci sequence, not even by dynamic programming. Since the Fibonacci sequence satisfies a linear recurrence relation with constant coefficients and constant order, then so will be the sequence of their sums.
Definitely don't cache all the values. That will give you an unnecessary consumption of memory. When the recurrences have constant order, you only need to remember as many previous terms as the order of the recurrence.
Further more, there is a way to turn recurrences of constant order into systems recurrences of order one. The solution of the latter is given by a power of a matrix. This gives a faster algorithm, for large values of n. Each step will be more expensive, though. So, the best method would use a combination of the two, choosing the first method for small values of n and the latter for large inputs.
O(n) using the recurrence for the sum
Denote S_n=F_0+F_1+...+F_n the sum of the first Fibonacci numbers F_0,F_1,...,F_n.
Observe that
S_{n+1}-S_n=F_{n+1}
S_{n+2}-S_{n+1}=F_{n+2}
S_{n+3}-S_{n+2}=F_{n+3}
Since F_{n+3}=F_{n+2}+F_{n+1} we get that S_{n+3}-S_{n+2}=S_{n+2}-S_n. So
S_{n+3}=2S_{n+2}-S_n
with the initial conditions S_0=F_0=1, S_1=F_0+F_1=1+1=2, and S_2=S_1+F_2=2+2=4.
One thing that you can do is compute S_n bottom up, remembering the values of only the previous three terms at each step. You don't need to remember all of the values of S_k, from k=0 to k=n. This gives you an O(n) algorithm with O(1) amount of memory.
O(ln(n)) by matrix exponentiation
You can also get an O(ln(n)) algorithm in the following way:
Call X_n to be the column vector with components S_{n+2},S_{n+1},S_{n}
So, the recurrence above gives the recurrence
X_{n+1}=AX_n
where A is the matrix
[
[2,0,-1],
[1,0,0],
[0,1,0],
]
Therefore, X_n=A^nX_0. We have X_0. To multiply by A^n we can do exponentiation by squaring.
For the sake of completeness here are implementations of the general ideas described in #NotDijkstra's answer plus my humble optimizations including the "closed form" solution implemented in integer arithmetic.
We can see that the "smart" methods are not only an order of magnitude faster but also seem to scale better compatible with the fact (thanks #NotDijkstra) that Python big ints use better than naive multiplication.
import numpy as np
import operator as op
from simple_benchmark import BenchmarkBuilder, MultiArgument
B = BenchmarkBuilder()
def pow(b,e,mul=op.mul,unit=1):
if e == 0:
return unit
res = b
for bit in bin(e)[3:]:
res = mul(res,res)
if bit=="1":
res = mul(res,b)
return res
def mul_fib(a,b):
return (a[0]*b[0]+5*a[1]*b[1])>>1 , (a[0]*b[1]+a[1]*b[0])>>1
def fib_closed(n):
return pow((1,1),n+1,mul_fib)[1]
def fib_mat(n):
return pow(np.array([[1,1],[1,0]],'O'),n,op.matmul)[0,0]
def fib_sequential(n):
t1,t2 = 1,1
for i in range(n-1):
t1,t2 = t2,t1+t2
return t2
def sum_fib_direct(n):
t1,t2,res = 1,1,1
for i in range(n):
t1,t2,res = t2,t1+t2,res+t2
return res
def sum_fib(n,method="closed"):
if method == "direct":
return sum_fib_direct(n)
return globals()[f"fib_{method}"](n+2)-1
methods = "closed mat sequential direct".split()
def f(method):
def f(n):
return sum_fib(n,method)
f.__name__ = method
return f
for method in methods:
B.add_function(method)(f(method))
B.add_arguments('N')(lambda:(2*(1<<k,) for k in range(23)))
r = B.run()
r.plot()
import matplotlib.pylab as P
P.savefig(fib.png)
I am not sure how you are taking anything near a second. Here is the memoized version without fanciness:
class fibs(object):
def __init__(self):
self.thefibs = {0:0, 1:1}
def __call__(self, n):
if n not in self.thefibs:
self.thefibs[n] = self(n-1)+self(n-2)
return self.thefibs[n]
dog = fibs()
sum([dog(i) for i in range(40) if dog(i) < 4000000])

This code is too inefficient, how can I increase memory and execution efficiency?

I'm trying to complete the following challenge: https://app.codesignal.com/challenge/ZGBMLJXrFfomwYiPs.
I have written code that appears to work, however, it is so inefficient that it fails the test (too long to execute and uses too much memory). Are there any ways I can make this more efficient? I'm quite new to building efficient scripts. Someone mentioned "map()" can be used in lieu of "for i in range(1, n)". Thank you Xero Smith and others for the suggestions of optimising it this far:
from functools import reduce
from operator import mul
from itertools import combinations
# Starting from the maximum, we can divide our bag combinations to see the total number of integer factors
def prime_factors(n):
p = 2
dct = {}
while n != 1:
if n % p:
p += 1
else:
dct[p] = dct.get(p, 0) + 1
n = n//p
return dct
def number_of_factors(n):
return reduce(mul, (i+1 for i in prime_factors(n).values()), 1)
def kinderLevon(bags):
candies = list()
for x in (combinations(bags, i) for i in range(1, len(bags)+1)):
for j in x:
candies.append(sum(j))
satisfied_kids = [number_of_factors(i) for i in candies]
return candies[satisfied_kids.index(max(satisfied_kids))]
Any help would be greatly appreciated.
Thanks,
Aaron
Following my comment, I can already identify a memory & complexity improvement. In your factors function since you only need the number of factors, you could only count them instead of storing them.
def factors(n):
k = 2
for i in range(2, n//2 +1):
if n % i == 0:
k += 1
return k
EDIT: as suggested in the comments stop the counter earlier.
This actually reduces time complexity for huge numbers, but not really for smaller ones.
This is a much better improvement than the one using list comprehensions (that still allocates memory)
Moreover, it is pointless to allocate your combinations list twice. You're doing
x = list(combinations(bags, i));
for j in list(x):
...
The first line you convert a tuple (returned by combinations) into a list, hence duplicating the data. The second line list(x) re-allocates a copy of the list, taking even more memory! There you should really just write:
for j in combination(bags, i):
...
As a matter of syntax, please don't use semicolons ; in Python !
First things first, combinations are iterable. This means you do not have to convert them into lists before you iterate over them; infact it is terribly inefficient to do so.
Next thing that can be improved significantly is your factors procedure. Currently it is linear. We can do better. We can get the number of factors of an integer N via the following algorithm:
get the prime factorisation of Nsuch that N = p1^n1 * p2^n2 * ...
the number of factors of N is (1+n1) * (1+n2) * ...
see https://www.wikihow.com/Find-How-Many-Factors-Are-in-a-Number for details.
Something else, your current solution has a lot of variables and computations that are not used. Get rid of them.
With these, we get the following which should work:
from functools import reduce
from operator import mul
from itertools import combinations
# Starting from the maximum, we can divide our bag combinations to see the total number of integer factors
def prime_factors(n):
p = 2
dct = {}
while n != 1:
if n % p:
p += 1
else:
dct[p] = dct.get(p, 0) + 1
n = n//p
return dct
def number_of_factors(n):
return reduce(mul, (i+1 for i in prime_factors(n).values()), 1)
def kinderLevon(bags):
candies = list()
for x in (combinations(bags, i) for i in range(1, len(bags)+1)):
for j in x:
candies.append(sum(j))
satisfied_kids = [number_of_factors(i) for i in candies]
return candies[satisfied_kids.index(max(satisfied_kids))]
Use list comprehensions. The factors function can be transformed like this :
def factors(n):
return len([i for i in range(1, n + 1) if n % i == 0])

How do I return the product of a while loop

I don't get the concept of loops yet. I got the following code:
x=0
while x < n:
x = x+1
print x
which prints 1,2,3,4,5.
That's fine, but how do I access the computation, that was done in the loop? e.g., how do I return the product of the loop( 5*4*3*2*1)?
Thanks.
Edit:
That was my final code:
def factorial(n):
result = 1
while n >= 1:
result = result *n
n=n-1
return result
You want to introduce one more variable (total) which contains accumulated value of a bunch of actions:
total = 1
x = 1
while x <= 5:
total *= x
x += 1
print x, total
print 'total:', total
Actually, more pythonic way:
total = 1
n = 5
for x in xrange(1, n + 1):
total *= x
print total
Note, that the initial value of total must be 1 and not 0 since in the latter case you will always receive 0 as a result (0*1*.. is always equals to 0).
By storing that product and returning that result:
def calculate_product(n):
product = 1
for x in range(n):
product *= x + 1
return product
Now we have a function that produces your calculation, and it returns the result:
print calculate_product(5)
A "one-liner"
>>> import operator
>>> reduce(operator.mul, xrange(1, n + 1))
120
>>>
Alternately you could use the yield keyword which will return the value from within the while loop. For instance:
def yeild_example():
current_answer = 1
for i in range(1,n+1):
current_answer *= i
yield current_answer
Which will lazily evaluate the answers for you. If you just want everything once this is probably the way to go, but if you know you want to store things then you should probably use return as in other answers, but this is nice for a lot of other applications.
This is called a generator function with the idea behind it being that it is a function that will "generate" answers when asked. In contrast to a standard function that will generate everything at once, this allows you to only perform calculations when you need to and will generally be more memory efficient, though performance is best evaluated on a case-by-case basis. As always.
**Edit: So this is not quite the question OP is asking, but I think it would be a good introduction into some of the really neat and flexible things about python.
use a for loop:
sum_ = 1
for i in range(1, 6):
sum_ *= i
print sum_
If you prefer to keep your while loop structure, you could do it like (there are 1000 +1 ways to do it ...):
x=1
result = 1
while x <= n:
x += 1
result *= x
Where result will store the factorial. You can then just return or print out result, or whatever you want to do with it.
to access the computation done in the loop, you must use counter(with useful and understandable name), where you will store the result of computation. After computation you just return or use the counter as the product of the loop.
sum_counter=0
x=0
while x < 10:
sum_counter +=x
x+=1
print sum_counter

Running average in Python

Is there a pythonic way to build up a list that contains a running average of some function?
After reading a fun little piece about Martians, black boxes, and the Cauchy distribution, I thought it would be fun to calculate a running average of the Cauchy distribution myself:
import math
import random
def cauchy(location, scale):
p = 0.0
while p == 0.0:
p = random.random()
return location + scale*math.tan(math.pi*(p - 0.5))
# is this next block of code a good way to populate running_avg?
sum = 0
count = 0
max = 10
running_avg = []
while count < max:
num = cauchy(3,1)
sum += num
count += 1
running_avg.append(sum/count)
print running_avg # or do something else with it, besides printing
I think that this approach works, but I'm curious if there might be a more elegant approach to building up that running_avg list than using loops and counters (e.g. list comprehensions).
There are some related questions, but they address more complicated problems (small window size, exponential weighting) or aren't specific to Python:
calculate exponential moving average in python
How to efficiently calculate a running standard deviation?
Calculating the Moving Average of a List
You could write a generator:
def running_average():
sum = 0
count = 0
while True:
sum += cauchy(3,1)
count += 1
yield sum/count
Or, given a generator for Cauchy numbers and a utility function for a running sum generator, you can have a neat generator expression:
# Cauchy numbers generator
def cauchy_numbers():
while True:
yield cauchy(3,1)
# running sum utility function
def running_sum(iterable):
sum = 0
for x in iterable:
sum += x
yield sum
# Running averages generator expression (** the neat part **)
running_avgs = (sum/(i+1) for (i,sum) in enumerate(running_sum(cauchy_numbers())))
# goes on forever
for avg in running_avgs:
print avg
# alternatively, take just the first 10
import itertools
for avg in itertools.islice(running_avgs, 10):
print avg
You could use coroutines. They are similar to generators, but allows you to send in values. Coroutines was added in Python 2.5, so this won't work in versions before that.
def running_average():
sum = 0.0
count = 0
value = yield(float('nan'))
while True:
sum += value
count += 1
value = yield(sum/count)
ravg = running_average()
next(ravg) # advance the corutine to the first yield
for i in xrange(10):
avg = ravg.send(cauchy(3,1))
print 'Running average: %.6f' % (avg,)
As a list comprehension:
ravg = running_average()
next(ravg)
ravg_list = [ravg.send(cauchy(3,1)) for i in xrange(10)]
Edits:
Using the next() function instead of the it.next() method. This is so it also will work with Python 3. The next() function has also been back-ported to Python 2.6+.
In Python 2.5, you can either replace the calls with it.next(), or define a next function yourself.
(Thanks Adam Parkin)
I've got two possible solutions here for you. Both are just generic running average functions that work on any list of numbers. (could be made to work with any iterable)
Generator based:
nums = [cauchy(3,1) for x in xrange(10)]
def running_avg(numbers):
for count in xrange(1, len(nums)+1):
yield sum(numbers[:count])/count
print list(running_avg(nums))
List Comprehension based (really the same code as the earlier):
nums = [cauchy(3,1) for x in xrange(10)]
print [sum(nums[:count])/count for count in xrange(1, len(nums)+1)]
Generator-compatabile Generator based:
Edit: This one I just tested to see if I could make my solution compatible with generators easily and what it's performance would be. This is what I came up with.
def running_avg(numbers):
sum = 0
for count, number in enumerate(numbers):
sum += number
yield sum/(count+1)
See the performance stats below, well worth it.
Performance characteristics:
Edit: I also decided to test Orip's interesting use of multiple generators to see the impact on performance.
Using timeit and the following (1,000,000 iterations 3 times):
print "Generator based:", ', '.join(str(x) for x in Timer('list(running_avg(nums))', 'from __main__ import nums, running_avg').repeat())
print "LC based:", ', '.join(str(x) for x in Timer('[sum(nums[:count])/count for count in xrange(1, len(nums)+1)]', 'from __main__ import nums').repeat())
print "Orip's:", ', '.join(str(x) for x in Timer('list(itertools.islice(running_avgs, 10))', 'from __main__ import itertools, running_avgs').repeat())
print "Generator-compatabile Generator based:", ', '.join(str(x) for x in Timer('list(running_avg(nums))', 'from __main__ import nums, running_avg').repeat())
I get the following results:
Generator based: 17.653908968, 17.8027219772, 18.0342400074
LC based: 14.3925321102, 14.4613749981, 14.4277560711
Orip's: 30.8035550117, 30.3142540455, 30.5146529675
Generator-compatabile Generator based: 3.55352187157, 3.54164409637, 3.59098005295
See comments for code:
Orip's genEx based: 4.31488609314, 4.29926609993, 4.30518198013
Results are in seconds, and show the LC new generator-compatible generator method to be consistently faster, your results may vary though. I expect the massive difference between my original generator and the new one is the fact that the sum isn't calculated on the fly.

Categories