Running average in Python - python

Is there a pythonic way to build up a list that contains a running average of some function?
After reading a fun little piece about Martians, black boxes, and the Cauchy distribution, I thought it would be fun to calculate a running average of the Cauchy distribution myself:
import math
import random
def cauchy(location, scale):
p = 0.0
while p == 0.0:
p = random.random()
return location + scale*math.tan(math.pi*(p - 0.5))
# is this next block of code a good way to populate running_avg?
sum = 0
count = 0
max = 10
running_avg = []
while count < max:
num = cauchy(3,1)
sum += num
count += 1
running_avg.append(sum/count)
print running_avg # or do something else with it, besides printing
I think that this approach works, but I'm curious if there might be a more elegant approach to building up that running_avg list than using loops and counters (e.g. list comprehensions).
There are some related questions, but they address more complicated problems (small window size, exponential weighting) or aren't specific to Python:
calculate exponential moving average in python
How to efficiently calculate a running standard deviation?
Calculating the Moving Average of a List

You could write a generator:
def running_average():
sum = 0
count = 0
while True:
sum += cauchy(3,1)
count += 1
yield sum/count
Or, given a generator for Cauchy numbers and a utility function for a running sum generator, you can have a neat generator expression:
# Cauchy numbers generator
def cauchy_numbers():
while True:
yield cauchy(3,1)
# running sum utility function
def running_sum(iterable):
sum = 0
for x in iterable:
sum += x
yield sum
# Running averages generator expression (** the neat part **)
running_avgs = (sum/(i+1) for (i,sum) in enumerate(running_sum(cauchy_numbers())))
# goes on forever
for avg in running_avgs:
print avg
# alternatively, take just the first 10
import itertools
for avg in itertools.islice(running_avgs, 10):
print avg

You could use coroutines. They are similar to generators, but allows you to send in values. Coroutines was added in Python 2.5, so this won't work in versions before that.
def running_average():
sum = 0.0
count = 0
value = yield(float('nan'))
while True:
sum += value
count += 1
value = yield(sum/count)
ravg = running_average()
next(ravg) # advance the corutine to the first yield
for i in xrange(10):
avg = ravg.send(cauchy(3,1))
print 'Running average: %.6f' % (avg,)
As a list comprehension:
ravg = running_average()
next(ravg)
ravg_list = [ravg.send(cauchy(3,1)) for i in xrange(10)]
Edits:
Using the next() function instead of the it.next() method. This is so it also will work with Python 3. The next() function has also been back-ported to Python 2.6+.
In Python 2.5, you can either replace the calls with it.next(), or define a next function yourself.
(Thanks Adam Parkin)

I've got two possible solutions here for you. Both are just generic running average functions that work on any list of numbers. (could be made to work with any iterable)
Generator based:
nums = [cauchy(3,1) for x in xrange(10)]
def running_avg(numbers):
for count in xrange(1, len(nums)+1):
yield sum(numbers[:count])/count
print list(running_avg(nums))
List Comprehension based (really the same code as the earlier):
nums = [cauchy(3,1) for x in xrange(10)]
print [sum(nums[:count])/count for count in xrange(1, len(nums)+1)]
Generator-compatabile Generator based:
Edit: This one I just tested to see if I could make my solution compatible with generators easily and what it's performance would be. This is what I came up with.
def running_avg(numbers):
sum = 0
for count, number in enumerate(numbers):
sum += number
yield sum/(count+1)
See the performance stats below, well worth it.
Performance characteristics:
Edit: I also decided to test Orip's interesting use of multiple generators to see the impact on performance.
Using timeit and the following (1,000,000 iterations 3 times):
print "Generator based:", ', '.join(str(x) for x in Timer('list(running_avg(nums))', 'from __main__ import nums, running_avg').repeat())
print "LC based:", ', '.join(str(x) for x in Timer('[sum(nums[:count])/count for count in xrange(1, len(nums)+1)]', 'from __main__ import nums').repeat())
print "Orip's:", ', '.join(str(x) for x in Timer('list(itertools.islice(running_avgs, 10))', 'from __main__ import itertools, running_avgs').repeat())
print "Generator-compatabile Generator based:", ', '.join(str(x) for x in Timer('list(running_avg(nums))', 'from __main__ import nums, running_avg').repeat())
I get the following results:
Generator based: 17.653908968, 17.8027219772, 18.0342400074
LC based: 14.3925321102, 14.4613749981, 14.4277560711
Orip's: 30.8035550117, 30.3142540455, 30.5146529675
Generator-compatabile Generator based: 3.55352187157, 3.54164409637, 3.59098005295
See comments for code:
Orip's genEx based: 4.31488609314, 4.29926609993, 4.30518198013
Results are in seconds, and show the LC new generator-compatible generator method to be consistently faster, your results may vary though. I expect the massive difference between my original generator and the new one is the fact that the sum isn't calculated on the fly.

Related

Hash multiple iterations of a value over itself

I am trying to write a function which calculates multiple iteration hashes of a specific value (and output each iteration in the meantime).
However, I can't get my head over how to perform, for instance, md5 hash function on itself multiple times. For instance:
a = hashlib.md5('fun').hexdigest()
b = hashlib.md5(a).hexdigest()
c = hashlib.md5(b).hexdigest()
d = hashlib.md5(c).hexdigest()
.......
I think the recursion is the solution, but I just can't seem to implement it properly. This is the general factorial recursion example, but how do I adapt it to hashes:
def factorial(n):
if n == 0:
return 1
else:
return n * factorial(n - 1)
This is a classic application of generators. Python allows a maximum of 500 recursions due to its unusually heavy stack. For anything which might be executed anywhere near that many times, iteration will often be faster. Using a generator allows you to break after any desired number of executions and allows flat usage of the desired logic in your code. The following example prints the output of 10 such executions.
from itertools import islice
def hashes(n):
while True:
n = hashlib.md5(n).hexdigest()
yield n
for h in islice(hashes('fun'), 10):
print(h)
In general, you are looking for a loop like
while True:
x = f(x)
where you repeatedly replace the input with the result of the most recent application.
For your specific example,
def iterated_hash(x):
while True:
x = hashlib.md5(x).hexdigest()
return x
However, since you don't really want to do this an infinite number of times, you need to supply a count:
def iterated_hash(x, n):
while True:
if n == 0:
return x
x = hashlib.md5(x).hexdigest()
or with a for loop,
def iterated_hash(x, n):
for _ in range(n):
x = hashlib.md5(x).hexdigest()
return x
(Practically speaking, you want to use the for loop, but it's nice to see how the for loop is just a finite special case of the more general infinite loop.)
Just iterate as many times as needed:
def make_hash(text, iterations):
a = hashlib.md5(text).hexdigest()
for _ in range(iterations):
a = hashlib.md5(a).hexdigest()
return a
a = make_hash('fun', 5) # 5 iterations

Using python how do I repeatedly divide a number by 2 until it is less than 1.0?

I am unsure of how to create the loop to keep dividing the number by two? Please help. I know you you can divide a number by 2 don't know how to create the loop to keep dividing until it is less than 1.0.
It depends on what exactly you're after as it isn't clear from the question. A function that just divides a number by zero until it is less than 1.0 would look like this:
def dividingBy2(x):
while x > 1.0:
x = x/2
But this serves no purpose other than understanding while loops, as it gives you no information. If you wanted to see how many times you can divide by 2 before a number is less than 1.0, then you could always add a counter:
def dividingBy2Counter(x):
count = 0
while x > 1.0:
x = x/2
count = count + 1
return count
Or if you wanted to see each number as x becomes increasingly small:
def dividingBy2Printer(x):
while x > 1.0:
x = x/2
print(x)
b=[] #initiate a list to store the result of each division
#creating a recursive function replaces the while loop
#this enables the non-technical user to call the function easily
def recursive_func(a=0): #recursive since it will call itself later
if a>=1: #specify the condition that will make the function run again
a = a/2 #perform the desired calculation(s)
recursive_func(a) #function calls itself
b.append(a) #records the result of each division in a list
#this is how the user calls the function as an example
recursive_func(1024)
print (b)

Random prime Number in python

I currently have ↓ set as my randprime(p,q) function. Is there any way to condense this, via something like a genexp or listcomp? Here's my function:
n = randint(p, q)
while not isPrime(n):
n = randint(p, q)
It's better to just generate the list of primes, and then choose from that line.
As is, with your code there is the slim chance that it will hit an infinite loop, either if there are no primes in the interval or if randint always picks a non-prime then the while loop will never end.
So this is probably shorter and less troublesome:
import random
primes = [i for i in range(p,q) if isPrime(i)]
n = random.choice(primes)
The other advantage of this is there is no chance of deadlock if there are no primes in the interval. As stated this can be slow depending on the range, so it would be quicker if you cached the primes ahead of time:
# initialising primes
minPrime = 0
maxPrime = 1000
cached_primes = [i for i in range(minPrime,maxPrime) if isPrime(i)]
#elsewhere in the code
import random
n = random.choice([i for i in cached_primes if p<i<q])
Again, further optimisations are possible, but are very much dependant on your actual code... and you know what they say about premature optimisations.
Here is a script written in python to generate n random prime integers between tow given integers:
import numpy as np
def getRandomPrimeInteger(bounds):
for i in range(bounds.__len__()-1):
if bounds[i + 1] > bounds[i]:
x = bounds[i] + np.random.randint(bounds[i+1]-bounds[i])
if isPrime(x):
return x
else:
if isPrime(bounds[i]):
return bounds[i]
if isPrime(bounds[i + 1]):
return bounds[i + 1]
newBounds = [0 for i in range(2*bounds.__len__() - 1)]
newBounds[0] = bounds[0]
for i in range(1, bounds.__len__()):
newBounds[2*i-1] = int((bounds[i-1] + bounds[i])/2)
newBounds[2*i] = bounds[i]
return getRandomPrimeInteger(newBounds)
def isPrime(x):
count = 0
for i in range(int(x/2)):
if x % (i+1) == 0:
count = count+1
return count == 1
#ex: get 50 random prime integers between 100 and 10000:
bounds = [100, 10000]
for i in range(50):
x = getRandomPrimeInteger(bounds)
print(x)
So it would be great if you could use an iterator to give the integers from p to q in random order (without replacement). I haven't been able to find a way to do that. The following will give random integers in that range and will skip anything that it's tested already.
import random
fail = False
tested = set([])
n = random.randint(p,q)
while not isPrime(n):
tested.add(n)
if len(tested) == p-q+1:
fail = True
break
while n in s:
n = random.randint(p,q)
if fail:
print 'I failed'
else:
print n, ' is prime'
The big advantage of this is that if say the range you're testing is just (14,15), your code would run forever. This code is guaranteed to produce an answer if such a prime exists, and tell you there isn't one if such a prime does not exist. You can obviously make this more compact, but I'm trying to show the logic.
next(i for i in itertools.imap(lambda x: random.randint(p,q)|1,itertools.count()) if isPrime(i))
This starts with itertools.count() - this gives an infinite set.
Each number is mapped to a new random number in the range, by itertools.imap(). imap is like map, but returns an iterator, rather than a list - we don't want to generate a list of inifinite random numbers!
Then, the first matching number is found, and returned.
Works efficiently, even if p and q are very far apart - e.g. 1 and 10**30, which generating a full list won't do!
By the way, this is not more efficient than your code above, and is a lot more difficult to understand at a glance - please have some consideration for the next programmer to have to read your code, and just do it as you did above. That programmer might be you in six months, when you've forgotten what this code was supposed to do!
P.S - in practice, you might want to replace count() with xrange (NOT range!) e.g. xrange((p-q)**1.5+20) to do no more than that number of attempts (balanced between limited tests for small ranges and large ranges, and has no more than 1/2% chance of failing if it could succeed), otherwise, as was suggested in another post, you might loop forever.
PPS - improvement: replaced random.randint(p,q) with random.randint(p,q)|1 - this makes the code twice as efficient, but eliminates the possibility that the result will be 2.

Finding the index of an element in a sorted list efficiently

I have a sorted list l (of around 20,000 elements), and would like to find the first element in l that exceeds a given value t_min. Currently, my code is as follows.
def find_index(l):
first=next((t for t in l if t>t_min), None)
if first==None:
return None
else:
return l.index(first)
To benchmark the code, I used cProfile to run a testing loop, and stripped out the time required to randomly generate lists by comparing the time to a control loop:
import numpy
import cProfile
def test_loop(n):
for _ in range(n):
test_l=sorted(numpy.random.random_sample(20000))
find_index(test_l, 0.5)
def control_loop(n):
for _ in range(n):
test_l=sorted(numpy.random.random_sample(20000))
# cProfile.run('test_loop(1000)') takes 10.810 seconds
# cProfile.run('control_loop(1000)') takes 9.650 seconds
Each function call for find_index takes about 1.16 ms. Is there a way to improve the code to make it more efficient, given that we know the list is sorted?
The standard library bisect module is useful for this, and the docs contain an example of exactly this use case.
def find_gt(a, x):
'Find leftmost value greater than x'
i = bisect_right(a, x)
if i != len(a):
return a[i]
raise ValueError

compute mean in python for a generator

I'm doing some statistics work, I have a (large) collection of random numbers to compute the mean of, I'd like to work with generators, because I just need to compute the mean, so I don't need to store the numbers.
The problem is that numpy.mean breaks if you pass it a generator. I can write a simple function to do what I want, but I'm wondering if there's a proper, built-in way to do this?
It would be nice if I could say "sum(values)/len(values)", but len doesn't work for genetators, and sum already consumed values.
here's an example:
import numpy
def my_mean(values):
n = 0
Sum = 0.0
try:
while True:
Sum += next(values)
n += 1
except StopIteration: pass
return float(Sum)/n
X = [k for k in range(1,7)]
Y = (k for k in range(1,7))
print numpy.mean(X)
print my_mean(Y)
these both give the same, correct, answer, buy my_mean doesn't work for lists, and numpy.mean doesn't work for generators.
I really like the idea of working with generators, but details like this seem to spoil things.
In general if you're doing a streaming mean calculation of floating point numbers, you're probably better off using a more numerically stable algorithm than simply summing the generator and dividing by the length.
The simplest of these (that I know) is usually credited to Knuth, and also calculates variance. The link contains a python implementation, but just the mean portion is copied here for completeness.
def mean(data):
n = 0
mean = 0.0
for x in data:
n += 1
mean += (x - mean)/n
if n < 1:
return float('nan')
else:
return mean
I know this question is super old, but it's still the first hit on google, so it seemed appropriate to post. I'm still sad that the python standard library doesn't contain this simple piece of code.
Just one simple change to your code would let you use both. Generators were meant to be used interchangeably to lists in a for-loop.
def my_mean(values):
n = 0
Sum = 0.0
for v in values:
Sum += v
n += 1
return Sum / n
def my_mean(values):
total = 0
for n, v in enumerate(values, 1):
total += v
return total / n
print my_mean(X)
print my_mean(Y)
There is statistics.mean() in Python 3.4 but it calls list() on the input:
def mean(data):
if iter(data) is data:
data = list(data)
n = len(data)
if n < 1:
raise StatisticsError('mean requires at least one data point')
return _sum(data)/n
where _sum() returns an accurate sum (math.fsum()-like function that in addition to float also supports Fraction, Decimal).
The old-fashioned way to do it:
def my_mean(values):
sum, n = 0, 0
for x in values:
sum += x
n += 1
return float(sum)/n
One way would be
numpy.fromiter(Y, int).mean()
but this actually temporarily stores the numbers.
Your approach is a good one, but you should instead use the for x in y idiom instead of repeatedly calling next until you get a StopIteration. This works for both lists and generators:
def my_mean(values):
n = 0
Sum = 0.0
for value in values:
Sum += value
n += 1
return float(Sum)/n
You can use reduce without knowing the size of the array:
from itertools import izip, count
reduce(lambda c,i: (c*(i[1]-1) + float(i[0]))/i[1], izip(values,count(1)),0)
def my_mean(values):
n = 0
sum = 0
for v in values:
sum += v
n += 1
return sum/n
The above is very similar to your code, except by using for to iterate values you are good no matter if you get a list or an iterator.
The python sum method is however very optimized, so unless the list is really, really long, you might be more happy temporarily storing the data.
(Also notice that since you are using python3, you don't need float(sum)/n)
If you know the length of the generator in advance and you want to avoid storing the full list in memory, you can use:
reduce(np.add, generator)/length
Try:
import itertools
def mean(i):
(i1, i2) = itertools.tee(i, 2)
return sum(i1) / sum(1 for _ in i2)
print mean([1,2,3,4,5])
tee will duplicate your iterator for any iterable i (e.g. a generator, a list, etc.), allowing you to use one duplicate for summing and the other for counting.
(Note that 'tee' will still use intermediate storage).

Categories