Python iterate through unique commutative equations - python

Inspired by the math of this video, I was curious how long it takes get a number, n, where I can use the number, 1, and a few operations like addition and subtraction, in the binary base. Currently, I have things coded up like this:
import itertools as it
def brute(m):
out = set()
for combo in it.product(['+','*','|'], repeat=m):
x = parse(combo)
if type(x) == int and 0 < x-1:
out.add(x)
return out
def parse(ops):
eq = ""
last = 1
for op in ops:
if op == "|":
last *= 2
last += 1
else:
eq += str(last)
eq += op
last = 1
eq += str(last)
return eval(eq)
Here, "|" refers to concatenation, thus combo = ("+","|","*","|","|") would parse as 1+11*111 = 1+3*7 = 22. I'm using this to build a table of how many numbers are m operations away and study how this grows. Currently I am using the itertools product function to search all possible operation combos, despite this being a bit repetitive, as "+" and "*" are both commutative operations. Is there a clean way to only generate the commutatively unique expressions?
Currently, my best idea is to have a second brute force program, brute2(k) which only uses the operations "*" and "|", and then have:
def brute(m):
out = set()
for k in range(m):
for a,b in it.product(brute(k),brute2(m-k)):
out.add(a+b)
if I memoize things, this would be pretty decent, but I'm not sure if there's a more pythonic or efficient way than this. Furthermore, this fails to cleanly generalize if I decide to add more operations in like subtraction.
What I'm hoping to achieve is insight on if there is some module or simple method which will efficiently iterate through the operation combos. Currently, itertools.product has complexity O(3^m). However, with memoizing and using the brute2 method, it seems I could get things down to complexity O(|brute(m-1)|), which seems to be asymptotically ~O(1.5^m). While I am moderately happy with this second method, it would be nice if there was a more generalizable method which would extend for arbitrary amounts of operations, some of which are commutative.
Update:
I've now gotten my second idea coded up. With this, I was able to quickly get all numbers reachable in 42 operations, when my old code got stuck for hours after 20 operations. Here is the new code:
memo1 = {0:{1}}
def brute1(m):
if m not in memo1:
out = set(brute2(m+1))
for i in range(m):
for a,b in it.product(brute1(i),brute2(m-i)):
out.add(a+b)
memo1[m] = set(out)
return memo1[m]
memo2 = {0:{1}}
def brute2(m):
if m not in memo2:
out = set()
for i in range(m):
for x in brute2(i):
out.add(x*(2**(m-i)-1))
memo2[m] = set(out)
return memo2[m]
I were to generalize this, you order all your commutative operations, [op_1,op_2,... op_x], have brute(n,0) return all numbers reachable with n non-commutative operations, and then have:
memo = {}
def brute(n,i):
if (n,i) not in memo:
out = brute(n,i-1) #in case I don't use op_i
for x in range(1,n): #x is the number of operations before my last use of op_i, if combo = (op_i,"|","+","+","*","+",op_i,"*","|"), this case would be covered when x = 6
for a,b in it.product(brute(x,i),brute(n-x-1,i-1)):
out.add(a op_i b)
memo[(n,i)] = out
return memo[(n,i)]
However, you have to be careful about order of operations. If I did addition and then multiplication, things would be totally different, and this would say different things. If there's a hierarchy like PEMDAS, and you don't want to consider parentheses, I believe you just list the operations in decreasing order of priority, i.e. op_1 is the first operation you do etc. If you allow general parenthesization, then you should have something like this I believe:
memo = {0:1}
def brute(m, ops):
if m not in memo:
out = set()
for i in range(m):
for a,b in it.product(brute(i),brute(m-i-1)):
for op in ops:
out.add( op(a,b) )
memo[m] = out
return memo[m]
I'm more interested in the case where we have some sort of PEMDAS system in place, and the parentheses are implied by the sequence of operations, but feedback on the validity/efficiency to either case is still welcome.

Related

My program can't run that fast even with memoization

I tried a problem on project euler where I needed to find the sum of all the fibonacci terms under 4 million. It took me a long time but then I found out that I can use memoization to do it but it seems to take still a long time. After a lot of research, I found out that I can use a built-in module called lru_cache. My question is : why isn't it as fast as memoization ?
Here's my code:
from functools import lru_cache
#lru_cache(maxsize=1000000)
def fibonacci_memo(input_value):
global value
fibonacci_cache = {}
if input_value in fibonacci_cache:
return fibonacci_cache[input_value]
if input_value == 0:
value = 1
elif input_value == 1:
value = 1
elif input_value > 1:
value = fibonacci_memo(input_value - 1) + fibonacci_memo(input_value - 2)
fibonacci_cache[input_value] = value
return value
def sumOfFib():
SUM = 0
for n in range(500):
if fibonacci_memo(n) < 4000000:
if fibonacci_memo(n) % 2 == 0:
SUM += fibonacci_memo(n)
return SUM
print(sumOfFib())
The code works by the way. It takes less than a second to run it when I use the lru_cache module.
The other answer is the correct way to calculate the fibonacci sequence, indeed, but you should also know why your memoization wasn't working. To be specific:
fibonacci_cache = {}
This line being inside the function means you were emptying your cache every time fibonacci_memo was called.
You shouldn't be computing the Fibonacci sequence, not even by dynamic programming. Since the Fibonacci sequence satisfies a linear recurrence relation with constant coefficients and constant order, then so will be the sequence of their sums.
Definitely don't cache all the values. That will give you an unnecessary consumption of memory. When the recurrences have constant order, you only need to remember as many previous terms as the order of the recurrence.
Further more, there is a way to turn recurrences of constant order into systems recurrences of order one. The solution of the latter is given by a power of a matrix. This gives a faster algorithm, for large values of n. Each step will be more expensive, though. So, the best method would use a combination of the two, choosing the first method for small values of n and the latter for large inputs.
O(n) using the recurrence for the sum
Denote S_n=F_0+F_1+...+F_n the sum of the first Fibonacci numbers F_0,F_1,...,F_n.
Observe that
S_{n+1}-S_n=F_{n+1}
S_{n+2}-S_{n+1}=F_{n+2}
S_{n+3}-S_{n+2}=F_{n+3}
Since F_{n+3}=F_{n+2}+F_{n+1} we get that S_{n+3}-S_{n+2}=S_{n+2}-S_n. So
S_{n+3}=2S_{n+2}-S_n
with the initial conditions S_0=F_0=1, S_1=F_0+F_1=1+1=2, and S_2=S_1+F_2=2+2=4.
One thing that you can do is compute S_n bottom up, remembering the values of only the previous three terms at each step. You don't need to remember all of the values of S_k, from k=0 to k=n. This gives you an O(n) algorithm with O(1) amount of memory.
O(ln(n)) by matrix exponentiation
You can also get an O(ln(n)) algorithm in the following way:
Call X_n to be the column vector with components S_{n+2},S_{n+1},S_{n}
So, the recurrence above gives the recurrence
X_{n+1}=AX_n
where A is the matrix
[
[2,0,-1],
[1,0,0],
[0,1,0],
]
Therefore, X_n=A^nX_0. We have X_0. To multiply by A^n we can do exponentiation by squaring.
For the sake of completeness here are implementations of the general ideas described in #NotDijkstra's answer plus my humble optimizations including the "closed form" solution implemented in integer arithmetic.
We can see that the "smart" methods are not only an order of magnitude faster but also seem to scale better compatible with the fact (thanks #NotDijkstra) that Python big ints use better than naive multiplication.
import numpy as np
import operator as op
from simple_benchmark import BenchmarkBuilder, MultiArgument
B = BenchmarkBuilder()
def pow(b,e,mul=op.mul,unit=1):
if e == 0:
return unit
res = b
for bit in bin(e)[3:]:
res = mul(res,res)
if bit=="1":
res = mul(res,b)
return res
def mul_fib(a,b):
return (a[0]*b[0]+5*a[1]*b[1])>>1 , (a[0]*b[1]+a[1]*b[0])>>1
def fib_closed(n):
return pow((1,1),n+1,mul_fib)[1]
def fib_mat(n):
return pow(np.array([[1,1],[1,0]],'O'),n,op.matmul)[0,0]
def fib_sequential(n):
t1,t2 = 1,1
for i in range(n-1):
t1,t2 = t2,t1+t2
return t2
def sum_fib_direct(n):
t1,t2,res = 1,1,1
for i in range(n):
t1,t2,res = t2,t1+t2,res+t2
return res
def sum_fib(n,method="closed"):
if method == "direct":
return sum_fib_direct(n)
return globals()[f"fib_{method}"](n+2)-1
methods = "closed mat sequential direct".split()
def f(method):
def f(n):
return sum_fib(n,method)
f.__name__ = method
return f
for method in methods:
B.add_function(method)(f(method))
B.add_arguments('N')(lambda:(2*(1<<k,) for k in range(23)))
r = B.run()
r.plot()
import matplotlib.pylab as P
P.savefig(fib.png)
I am not sure how you are taking anything near a second. Here is the memoized version without fanciness:
class fibs(object):
def __init__(self):
self.thefibs = {0:0, 1:1}
def __call__(self, n):
if n not in self.thefibs:
self.thefibs[n] = self(n-1)+self(n-2)
return self.thefibs[n]
dog = fibs()
sum([dog(i) for i in range(40) if dog(i) < 4000000])

How do I not repeat my list comprehension in a lambda

I am trying to be unnecessarily fancy with a challenge from codesignal. The problem: "Given a number and a range, find the largest integer within the given range that's divisible by the given number."
I have l for left boundary, r for right boundary, and d for the divisor. If none of the numbers within the boundary are divisible, then the function must return a -1. Otherwise, return the largest divisible number.
Is there a way to avoid repeating the list comprehension?
Is there a better way to do this altogether? (that is equally unreadable and unnecessary of course)
These receive a NameError: name '_' is not defined, which makes sense.
maxDivisor = lambda l,r,d: _[0] if [i for i in range(l,r+1)[::-1] if i%d==0] else -1
maxDivisor = lambda l,r,d: [i for i in range(l,r+1)[::-1] if i%d==0][0] if _ else -1
This works, but I don't want to repeat myself:
maxDivisor = lambda l,r,d: [i for i in range(l,r+1)[::-1] if i%d==0][0] if [i for i in range(l,r+1)[::-1] if i%d==0] else -1
This works, but is too readable:
def maxDivisor(left, right, divisor):
for i in range(left,right+1)[::-1]:
if i%divisor ==0:
return i
return -1
Just to reiterate:
maxDivisor(-99,-96,5) should return -1 and
maxDivisor(1,10,3) should return 9.
Thank you for your help with my unnecessary request.
Do not write bad and unreadable code just for the sake of writing bad and unreadable code.1)
Instead, I'd suggest using max with a generator expression and a default, which does, and reads, exactly what you want: Get the max number in this range which is a divisor, or -1 if no such thing exists.
res = max((x for x in range(l, r+1) if x%d==0), default=-1)
Similar, but maybe closer in spirit to what you were trying, you could use next on the filtered reversed range to get the largest such element, or -1 as default.
res = next((x for x in range(r, l-1, -1) if x%d==0), -1)
If you really want to be "fancy", though, how about this: Instead of testing all the numbers, just get the result directly in O(1):
res = r - (r % d) if (r - (r % d) >= l) else -1
(All of the parens are unnecessary here, but IMHO make it more readable, so this even fulfills part of your requirement.)
From your comment, it seems like you are trying "Code Golf", where the goal is to have the shortest code possible. In this case, you might go with the third approach, but use this variant without the ternary ... if ... else .... This should also fully qualify for your "unnecessary and unreadable" requirement:
x=[r-r%d,-1][r-r%d<l] # for code-golf only!
I will not tell you how it works, though, you have to find this out for yourself.
1) Unless this is some sort of obfuscated-code-challenge, maybe.

What's a fast and pythonic/clean way of removing a sorted list from another sorted list in python?

I am creating a fast method of generating a list of primes in the range(0, limit+1). In the function I end up removing all integers in the list named removable from the list named primes. I am looking for a fast and pythonic way of removing the integers, knowing that both lists are always sorted.
I might be wrong, but I believe list.remove(n) iterates over the list comparing each element with n. meaning that the following code runs in O(n^2) time.
# removable and primes are both sorted lists of integers
for composite in removable:
primes.remove(composite)
Based off my assumption (which could be wrong and please confirm whether or not this is correct) and the fact that both lists are always sorted, I would think that the following code runs faster, since it only loops over the list once for a O(n) time. However, it is not at all pythonic or clean.
i = 0
j = 0
while i < len(primes) and j < len(removable):
if primes[i] == removable[j]:
primes = primes[:i] + primes[i+1:]
j += 1
else:
i += 1
Is there perhaps a built in function or simpler way of doing this? And what is the fastest way?
Side notes: I have not actually timed the functions or code above. Also, it doesn't matter if the list removable is changed/destroyed in the process.
For anyone interested the full functions is below:
import math
# returns a list of primes in range(0, limit+1)
def fastPrimeList(limit):
if limit < 2:
return list()
sqrtLimit = int(math.ceil(math.sqrt(limit)))
primes = [2] + range(3, limit+1, 2)
index = 1
while primes[index] <= sqrtLimit:
removable = list()
index2 = index
while primes[index] * primes[index2] <= limit:
composite = primes[index] * primes[index2]
removable.append(composite)
index2 += 1
for composite in removable:
primes.remove(composite)
index += 1
return primes
This is quite fast and clean, it does O(n) set membership checks, and in amortized time it runs in O(n) (first line is O(n) amortized, second line is O(n * 1) amortized, because a membership check is O(1) amortized):
removable_set = set(removable)
primes = [p for p in primes if p not in removable_set]
Here is the modification of your 2nd solution. It does O(n) basic operations (worst case):
tmp = []
i = j = 0
while i < len(primes) and j < len(removable):
if primes[i] < removable[j]:
tmp.append(primes[i])
i += 1
elif primes[i] == removable[j]:
i += 1
else:
j += 1
primes[:i] = tmp
del tmp
Please note that constants also matter. The Python interpreter is quite slow (i.e. with a large constant) to execute Python code. The 2nd solution has lots of Python code, and it can indeed be slower for small practical values of n than the solution with sets, because the set operations are implemented in C, thus they are fast (i.e. with a small constant).
If you have multiple working solutions, run them on typical input sizes, and measure the time. You may get surprised about their relative speed, often it is not what you would predict.
The most important thing here is to remove the quadratic behavior. You have this for two reasons.
First, calling remove searches the entire list for values to remove. Doing this takes linear time, and you're doing it once for each element in removable, so your total time is O(NM) (where N is the length of primes and M is the length of removable).
Second, removing elements from the middle of a list forces you to shift the whole rest of the list up one slot. So, each one takes linear time, and again you're doing it M times, so again it's O(NM).
How can you avoid these?
For the first, you either need to take advantage of the sorting, or just use something that allows you to do constant-time lookups instead of linear-time, like a set.
For the second, you either need to create a list of indices to delete and then do a second pass to move each element up the appropriate number of indices all at once, or just build a new list instead of trying to mutate the original in-place.
So, there are a variety of options here. Which one is best? It almost certainly doesn't matter; changing your O(NM) time to just O(N+M) will probably be more than enough of an optimization that you're happy with the results. But if you need to squeeze out more performance, then you'll have to implement all of them and test them on realistic data.
The only one of these that I think isn't obvious is how to "use the sorting". The idea is to use the same kind of staggered-zip iteration that you'd use in a merge sort, like this:
def sorted_subtract(seq1, seq2):
i1, i2 = 0, 0
while i1 < len(seq1):
if seq1[i1] != seq2[i2]:
i2 += 1
if i2 == len(seq2):
yield from seq1[i1:]
return
else:
yield seq1[i1]
i1 += 1

Subset sum for large sums

The subset sum problem is well-known for being NP-complete, but there are various tricks to solve versions of the problem somewhat quickly.
The usual dynamic programming algorithm requires space that grows with the target sum. My question is: can we reduce this space requirement?
I am trying to solve a subset sum problem with a modest number of elements but a very large target sum. The number of elements is too large for the exponential time algorithm (and shortcut method) and the target sum is too large for the usual dynamic programming method.
Consider this toy problem that illustrates the issue. Given the set A = [2, 3, 6, 8] find the number of subsets that sum to target = 11 . Enumerating all subsets we see the answer is 2: (3, 8) and (2, 3, 6).
The dynamic programming solution gives the same result, of course - ways[11] returns 2:
def subset_sum(A, target):
ways = [0] * (target + 1)
ways[0] = 1
ways_next = ways[:]
for x in A:
for j in range(x, target + 1):
ways_next[j] += ways[j - x]
ways = ways_next[:]
return ways[target]
Now consider targeting the sum target = 1100 the set A = [200, 300, 600, 800]. Clearly there are still 2 solutions: (300, 800) and (200, 300, 600). However, the ways array has grown by a factor of 100.
Is it possible to skip over certain weights when filling out the dynamic programming storage array? For my example problem I could compute the greatest common denominator of the input set and then reduce all items by that constant, but this won't work for my real application.
This SO question is related, but those answers don't use the approach I have in mind. The second comment by Akshay on this page says:
...in the cases where n is very small (eg. 6) and sum is very large
(eg. 1 million) then the space complexity will be too large. To avoid
large space complexity n HASHTABLES can be used.
This seems closer to what I'm looking for, but I can't seem to actually implement the idea. Is this really possible?
Edited to add: A smaller example of a problem to solve. There is 1 solution.
target = 5213096522073683233230240000
A = [2316931787588303659213440000,
1303274130518420808307560000,
834095443531789317316838400,
579232946897075914803360000,
425558899761116998631040000,
325818532629605202076890000,
257436865287589295468160000,
208523860882947329329209600,
172333769324749858949760000,
144808236724268978700840000,
123386899930738064691840000,
106389724940279249657760000,
92677271503532146368537600,
81454633157401300519222500,
72153585080604612224640000,
64359216321897323867040000,
57762842349846905631360000,
52130965220736832332302400,
47284322195679666514560000,
43083442331187464737440000,
39418499221729173786240000,
36202059181067244675210000,
33363817741271572692673536,
30846724982684516172960000,
28604096143065477274240000,
26597431235069812414440000,
24794751591313594450560000,
23169317875883036592134400,
21698632766175580575360000,
20363658289350325129805625,
19148196591638873216640000,
18038396270151153056160000,
17022355990444679945241600]
A real problem is:
target = 262988806539946324131984661067039976436265064677212251086885351040000
A = [116883914017753921836437627140906656193895584300983222705282378240000,
65747201634986581032996165266759994109066266169303062771721337760000,
42078209046391411861117545770726396229802410348353960173901656166400,
29220978504438480459109406785226664048473896075245805676320594560000,
21468474003260924418937523352411426647858372626711204170357987840000,
16436800408746645258249041316689998527266566542325765692930334440000,
12987101557528213537381958571211850688210620477887024745031375360000,
10519552261597852965279386442681599057450602587088490043475414041600,
8693844844295746252297013588993057072273225278585528961549928960000,
7305244626109620114777351696306666012118474018811451419080148640000,
6224587137040149683597270084426981690799173128454727836375984640000,
5367118500815231104734380838102856661964593156677801042589496960000,
4675356560710156873457505085636266247755823372039328908211295129600,
4109200102186661314562260329172499631816641635581441423232583610000,
3639983481521748430892521260443459881470796742937193786669693440000,
3246775389382053384345489642802962672052655119471756186257843840000,
2914003396564502206448583502127866774917064428556368433095682560000,
2629888065399463241319846610670399764362650646772122510868853510400,
2385386000362324935437502594712380738650930291856800463373109760000,
2173461211073936563074253397248264268068306319646382240387482240000,
1988573206351200938616141104476672789688204647842814753019927040000,
1826311156527405028694337924076666503029618504702862854770037160000,
1683128361855656474444701830829055849192096413934158406956066246656,
1556146784260037420899317521106745422699793282113681959093996160000,
1443011284169801504153550952356872298690068941987447193892375040000,
1341779625203807776183595209525714165491148289169450260647374240000,
1250838556670374906691960338012080744048823137584838292922165760000,
1168839140177539218364376271409066561938955843009832227052823782400,
1094646437211014876720019400903392201607763016346356924399106560000,
1027300025546665328640565082293124907954160408895360355808145902500,
965982760477305139144112620999228563585913919842836551283325440000,
909995870380437107723130315110864970367699185734298446667423360000,
858738960130436976757500934096457065914334905068448166814319513600,
811693847345513346086372410700740668013163779867939046564460960000,
768411414287644482489363509326632509674989232073666182868912640000,
728500849141125551612145875531966693729266107139092108273920640000,
691620793004461075955252231602997965644352569828303092930664960000,
657472016349865810329961652667599941090662661693030627717213377600,
625791330255672395317036671188673352614551016483550865168079360000,
596346500090581233859375648678095184662732572964200115843277440000,
568931977371436071675467087219123799753953628290345594563299840000,
543365302768484140768563349312066067017076579911595560096870560000,
519484062301128541495278342848474027528424819115480989801255014400,
497143301587800234654035276119168197422051161960703688254981760000,
476213321032044045508347054897310957784092466595223632570186240000,
456577789131851257173584481019166625757404626175715713692509290000,
438132122515529069774235170457376054037925971973698044293020160000,
420782090463914118611175457707263962298024103483539601739016561664,
404442609057972047876946806715939986830088526993021531852188160000,
389036696065009355224829380276686355674948320528420489773499040000,
374494562534633427030238036407319297168052779889230688624970240000,
360752821042450376038387738089218074672517235496861798473093760000,
347753793771829850091880543559722282890929011143421158461997158400,
335444906300951944045898802381428541372787072292362565161843560000,
323778155173833578494287055791985197213007158728485381455075840000,
312709639167593726672990084503020186012205784396209573230541440000,
302199145693704480473409550206308504954053507241841138853071360000,
292209785044384804591094067852266640484738960752458056763205945600,
282707666261699891568916593460940582033071824431295083135592960000,
273661609302753719180004850225848050401940754086589231099776640000,
265042888929147215048611399412486748738992254650755607041456640000,
256825006386666332160141270573281226988540102223840088952036475625,
248983485481605987343890803377079267631966925138189113455039385600,
241495690119326284786028155249807140896478479960709137820831360000,
234340660761814501342824380545368657996226388663143017230461440000,
227498967595109276930782578777716242591924796433574611666855840000,
220952578483466770957349011608519198854244960871423861446658560000,
214684740032609244189375233524114266478583726267112041703579878400,
208679870295533683104133831435857945991878646837700655494453760000,
202923461836378336521593102675185167003290944966984761641115240000,
197401994025105141026072179446079922264038329650750423033879040000,
192102853571911120622340877331658127418747308018416545717228160000,
187014262428406274938300203425450649910232934881573156328451805184,
182125212285281387903036468882991673432316526784773027068480160000,
177425404985627474536673746714144021883127046501745489011223040000,
172905198251115268988813057900749491411088142457075773232666240000,
168555556186474170249629649778586749838977769381324948621621760000,
164368004087466452582490413166899985272665665423257656929303344400]
In the particular comment you linked to, the suggestion is to use a hashtable to only store values which actually arise as a sum of some subset. In the worst case, this is exponential in the number of elements, so it is basically equivalent to the brute force approach you already mentioned and ruled out.
In general, there are two parameters to the problem - the number of elements in the set and the size of the target sum. Naive brute force is exponential in the first, while the standard dynamic programming solution is exponential in the second. This works well when one of the parameters is small, but you already indicated that both parameters are too big for an exponential solution. Therefore, you are stuck with the "hard" general case of the problem.
Most NP-Complete problems have some underlying graph whether implicit or explicit. Using graph partitioning and DP, it can be solved exponential in the treewidth of the graph but only polynomial in the size of the graph with treewidth held constant. Of course, without access to your data, it is impossible to say what the underlying graph might look like or whether it is in one of the classes of graphs that have bounded treewidths and hence can be solved efficiently.
Edit: I just wrote the following code to show what I meant by reducing it mod small numbers. The following code solves your first problem in less than a second, but it doesn't work on the larger problem (though it does reduce it to n=57, log(t)=68).
target = 5213096522073683233230240000
A = [2316931787588303659213440000,
1303274130518420808307560000,
834095443531789317316838400,
579232946897075914803360000,
425558899761116998631040000,
325818532629605202076890000,
257436865287589295468160000,
208523860882947329329209600,
172333769324749858949760000,
144808236724268978700840000,
123386899930738064691840000,
106389724940279249657760000,
92677271503532146368537600,
81454633157401300519222500,
72153585080604612224640000,
64359216321897323867040000,
57762842349846905631360000,
52130965220736832332302400,
47284322195679666514560000,
43083442331187464737440000,
39418499221729173786240000,
36202059181067244675210000,
33363817741271572692673536,
30846724982684516172960000,
28604096143065477274240000,
26597431235069812414440000,
24794751591313594450560000,
23169317875883036592134400,
21698632766175580575360000,
20363658289350325129805625,
19148196591638873216640000,
18038396270151153056160000,
17022355990444679945241600]
import itertools, time
from fractions import gcd
def gcd_r(seq):
return reduce(gcd, seq)
def miniSolve(t, vals):
vals = [x for x in vals if x and x <= t]
for k in range(len(vals)):
for sub in itertools.combinations(vals, k):
if sum(sub) == t:
return sub
return None
def tryMod(n, state, answer):
t, vals, mult = state
mods = [x%n for x in vals if x%n]
if (t%n or mods) and sum(mods) < n:
print 'Filtering with', n
print t.bit_length(), len(vals)
else:
return state
newvals = list(vals)
tmod = t%n
if not tmod:
for x in vals:
if x%n:
newvals.remove(x)
else:
if len(set(mods)) != len(mods):
#don't want to deal with the complexity of multisets for now
print 'skipping', n
else:
mini = miniSolve(tmod, mods)
if mini is None:
return None
mini = set(mini)
for x in vals:
mod = x%n
if mod:
if mod in mini:
t -= x
answer.add(x*mult)
newvals.remove(x)
g = gcd_r(newvals + [t])
t = t//g
newvals = [x//g for x in newvals]
mult *= g
return (t, newvals, mult)
def solve(t, vals):
answer = set()
mult = 1
for d in itertools.count(2):
if not t:
return answer
elif not vals or t < min(vals):
return None #no solution'
res = tryMod(d, (t, vals, mult), answer)
if res is None:
return None
t, vals, mult = res
if len(vals) < 23:
break
if (d % 10000) == 0:
print 'd', d
#don't want to deal with the complexity of multisets for now
assert(len(set(vals)) == len(vals))
rest = miniSolve(t, vals)
if rest is None:
return None
answer.update(x*mult for x in rest)
return answer
start_t = time.time()
answer = solve(target, A)
assert(answer <= set(A) and sum(answer) == target)
print answer

compute mean in python for a generator

I'm doing some statistics work, I have a (large) collection of random numbers to compute the mean of, I'd like to work with generators, because I just need to compute the mean, so I don't need to store the numbers.
The problem is that numpy.mean breaks if you pass it a generator. I can write a simple function to do what I want, but I'm wondering if there's a proper, built-in way to do this?
It would be nice if I could say "sum(values)/len(values)", but len doesn't work for genetators, and sum already consumed values.
here's an example:
import numpy
def my_mean(values):
n = 0
Sum = 0.0
try:
while True:
Sum += next(values)
n += 1
except StopIteration: pass
return float(Sum)/n
X = [k for k in range(1,7)]
Y = (k for k in range(1,7))
print numpy.mean(X)
print my_mean(Y)
these both give the same, correct, answer, buy my_mean doesn't work for lists, and numpy.mean doesn't work for generators.
I really like the idea of working with generators, but details like this seem to spoil things.
In general if you're doing a streaming mean calculation of floating point numbers, you're probably better off using a more numerically stable algorithm than simply summing the generator and dividing by the length.
The simplest of these (that I know) is usually credited to Knuth, and also calculates variance. The link contains a python implementation, but just the mean portion is copied here for completeness.
def mean(data):
n = 0
mean = 0.0
for x in data:
n += 1
mean += (x - mean)/n
if n < 1:
return float('nan')
else:
return mean
I know this question is super old, but it's still the first hit on google, so it seemed appropriate to post. I'm still sad that the python standard library doesn't contain this simple piece of code.
Just one simple change to your code would let you use both. Generators were meant to be used interchangeably to lists in a for-loop.
def my_mean(values):
n = 0
Sum = 0.0
for v in values:
Sum += v
n += 1
return Sum / n
def my_mean(values):
total = 0
for n, v in enumerate(values, 1):
total += v
return total / n
print my_mean(X)
print my_mean(Y)
There is statistics.mean() in Python 3.4 but it calls list() on the input:
def mean(data):
if iter(data) is data:
data = list(data)
n = len(data)
if n < 1:
raise StatisticsError('mean requires at least one data point')
return _sum(data)/n
where _sum() returns an accurate sum (math.fsum()-like function that in addition to float also supports Fraction, Decimal).
The old-fashioned way to do it:
def my_mean(values):
sum, n = 0, 0
for x in values:
sum += x
n += 1
return float(sum)/n
One way would be
numpy.fromiter(Y, int).mean()
but this actually temporarily stores the numbers.
Your approach is a good one, but you should instead use the for x in y idiom instead of repeatedly calling next until you get a StopIteration. This works for both lists and generators:
def my_mean(values):
n = 0
Sum = 0.0
for value in values:
Sum += value
n += 1
return float(Sum)/n
You can use reduce without knowing the size of the array:
from itertools import izip, count
reduce(lambda c,i: (c*(i[1]-1) + float(i[0]))/i[1], izip(values,count(1)),0)
def my_mean(values):
n = 0
sum = 0
for v in values:
sum += v
n += 1
return sum/n
The above is very similar to your code, except by using for to iterate values you are good no matter if you get a list or an iterator.
The python sum method is however very optimized, so unless the list is really, really long, you might be more happy temporarily storing the data.
(Also notice that since you are using python3, you don't need float(sum)/n)
If you know the length of the generator in advance and you want to avoid storing the full list in memory, you can use:
reduce(np.add, generator)/length
Try:
import itertools
def mean(i):
(i1, i2) = itertools.tee(i, 2)
return sum(i1) / sum(1 for _ in i2)
print mean([1,2,3,4,5])
tee will duplicate your iterator for any iterable i (e.g. a generator, a list, etc.), allowing you to use one duplicate for summing and the other for counting.
(Note that 'tee' will still use intermediate storage).

Categories