How to stop loop running out of memory? - python

I've come back to programming after a long haitus so please forgive any stupid errors/inefficient code.
I am creating an encryption program that uses the RSA method of encryption which involves finding the coprimes of numbers to generate a key. I am using the Euclidean algorithm to generate highest common factors and then add the coprime to the list if HCF == 1. I generate two lists of coprimes for different numbers then compare to find coprimes in both sets. The basic code is below:
def gcd(a, b):
while b:
a,b=b,a%b
return a
def coprimes(n):
cp = []
for i in range(1,n):
if gcd(i, n) == 1:
cp.append(i)
print(cp)
def compare(n,m):
a = coprimes(n)
b = coprimes(m)
c = []
for i in a:
if i in b:
c.append(i)
print(c)
This code works perfectly for small numbers and gives me what I want but execution takes forever and is finally Killed when comupting for extremely large numbers in the billions range, which is necessary for even a moderate level of security.
I assume this is a memory issue but I cant work out how to do this in a non memory intensive way. I tried multiprocessing but that just made my computer unusable due to the amount of processes running.
How can I calculate the coprimes of large numbers and then compare two sets of coprimes in an efficent and workable way?

If the only thing you're worried about is running out of memory here you could use generators.
def coprimes(n):
for i in range(1,n):
if gcd(i, n) == 1:
yield i
This way you can use the coprime value then discard it once you don't need it. However, nothing is going to change the fact your code is O(N^2) and will always perform slow for large primes. And this assumes Euclid's algorithm is constant time, which it is not.

You could change the strategy and approach this from the perspective of common prime factors. The common coprimes between n and m will be all numbers that are not divisible by any of their common prime factors.
def primeFactors(N):
p = 2
while p*p<=N:
count = 0
while N%p == 0:
count += 1
N //= p
if count: yield p
p += 1 + (p&1)
if N>1: yield N
import math
def compare2(n,m):
# skip list for multiples of common prime factors
skip = { p:p for p in primeFactors(math.gcd(n,m)) }
for x in range(1,min(m,n)):
if x in skip:
p = skip[x] # skip multiple of common prime
m = x + p # determine next multiple to skip
while m in skip: m += p # for that prime
skip[m] = p
else:
yield x # comon coprime of n and m
The performance is considerably better than matching lists of coprimes, especially on larger numbers:
from timeit import timeit
timeit(lambda:list(compare2(10**5,2*10**5)),number=1)
# 0.025 second
timeit(lambda:list(compare2(10**6,2*10**6)),number=1)
# 0.196 second
timeit(lambda:list(compare2(10**7,2*10**7)),number=1)
# 2.18 seconds
timeit(lambda:list(compare2(10**8,2*10**8)),number=1)
# 20.3 seconds
timeit(lambda:list(compare2(10**9,2*10**9)),number=1)
# 504 seconds
At some point, building lists of all the coprimes becomes a bottleneck and you should just use/process them as they come out of the generator (for example to count how many there are):
timeit(lambda:sum(1 for _ in compare2(10**9,2*10**9)),number=1)
# 341 seconds
Another way to approach this, which is somewhat slower than the prime factor approach but much simpler to code, would be to list coprimes of the gcd between n and m:
import math
def compare3(n,m):
d = math.gcd(n,m)
for c in range(1,min(n,m)):
if math.gcd(c,d) == 1:
yield c
timeit(lambda:list(compare3(10**6,2*10**6)),number=1)
# 0.28 second
timeit(lambda:list(compare3(10**7,2*10**7)),number=1)
# 2.84 seconds
timeit(lambda:list(compare3(10**8,2*10**8)),number=1)
# 30.8 seconds
Given that it uses no memory resource, it could be advantageous in some cases:
timeit(lambda:sum(1 for _ in compare3(10**9,2*10**9)),number=1)
# 326 seconds

Related

Why does this 'optimized' prime checker run at the same speed as the regular version?

Given this plain is_prime1 function which checks all the divisors from 1 to sqrt(p) with some bit-playing in order to avoid even numbers which are of-course not primes.
import time
def is_prime1(p):
if p & 1 == 0:
return False
# if the LSD is 5 then it is divisible by 5 (i.e. not a prime)
elif p % 10 == 5:
return False
for k in range(2, int(p ** 0.5) + 1):
if p % k == 0:
return False
return True
Versus this "optimized" version. The idea is to save all the primes we have found until a certain number p, then we iterate on the primes (using this basic arithmetic rule that every number is a product of primes) so we don't iterate through the numbers until sqrt(p) but over the primes we found which supposed to be a tiny bit compared to sqrt(p). We also iterate only on half the elements, because then the largest prime would most certainly won't "fit" in the number p.
import time
global mem
global lenMem
mem = [2]
lenMem = 1
def is_prime2(p):
global mem
global lenMem
# if p is even then the LSD is off
if p & 1 == 0:
return False
# if the LSD is 5 then it is divisible by 5 (i.e. not a prime)
elif p % 10 == 5:
return False
for div in mem[0: int(p ** 0.5) + 1]:
if p % div == 0:
return False
mem.append(p)
lenMem += 1
return True
The only idea I have in mind is that "global variables are expensive and time consuming" but I don't know if there is another way, and if there is, will it really help?
On average, when running this same program:
start = time.perf_counter()
for p in range(2, 100000):
print(f'{p} is a prime? {is_prime2(p)}') # change to is_prime1 or is_prime2
end = time.perf_counter()
I get that for is_prime1 the average time for checking the numbers 1-100K is ~0.99 seconds and so is_prime2 (maybe a difference of +0.01s on average, maybe as I said the usage of global variables ruin some performance?)
The difference is a combination of three things:
You're just not doing that much less work. Your test case includes testing a ton of small numbers, where the distinction between testing "all numbers from 2 to square root" and testing "all primes from 2 to square root" just isn't that much of a difference. Your "average case" is roughly the midpoint of the range, 50,000, square root of 223.6, which means testing 48 primes, or testing 222 numbers if the number is prime, but most numbers aren't prime, and most numbers have at least one small factor (proof left as exercise), so you short-circuit and don't actually test most of the numbers in either set (if there's a factor below 8, which applies to ~77% of all numbers, you've saved maybe two tests by limiting yourself to primes)
You're slicing mem every time, which is performed eagerly, and completely, even if you don't use all the values (and as noted, you almost never do for the non-primes). This isn't a huge cost, but then, you weren't getting huge savings from skipping non-primes, so it likely eats what little savings you got from the other optimization.
(You found this one, good show) Your slice of primes took a number of primes to test equal to the square root of number to test, not all primes less than the square root of the number to test. So you actually performed the same number of tests, just with different numbers (many of them primes larger than the square root that definitely don't need to be tested).
A side-note:
Your up-front tests aren't actually saving you much work; you redo both tests in the loop, so they're wasted effort when the number is prime (you test them both twice). And your test for divisibility by five is pointless; % 10 is no faster than % 5 (computers don't operate in base-10 anyway), and if not p % 5: is a slightly faster, more direct, and more complete (your test doesn't recognize multiples of 10, just multiples of 5 that aren't multiples of 10) way to test for divisibility.
The tests are also wrong, because they don't exclude the base case (they say 2 and 5 are not prime, because they're divisible by 2 and 5 respectively).
First of all, you should remove the print call, it is very time consuming.
You should just time your function, not the print function, so you could do it like this:
start = time.perf_counter()
for p in range(2, 100000):
## print(f'{p} is a prime? {is_prime2(p)}') # change to is_prime1 or is_prime2
is_prime1(p)
end = time.perf_counter()
print ("prime1", end-start)
start = time.perf_counter()
for p in range(2, 100000):
## print(f'{p} is a prime? {is_prime2(p)}') # change to is_prime1 or is_prime2
is_prime2(p)
end = time.perf_counter()
print ("prime2", end-start)
is_prime1 is still faster for me.
If you want to hold primes in global memory to accelerate multiple calls, you need to ensure that the primes list is properly populated even when the function is called with numbers in random order. The way is_prime2() stores and uses the primes assumes that, for example, it is called with 7 before being called with 343. If not, 343 will be treated as a prime because 7 is not yet in the primes list.
So the function must compute and store all primes up to √49 before it can respond to the is_prime(343) call.
In order to quickly build a primes list, the Sieve of Eratosthenes is one of the fastest method. But, since you don't know in advance how many primes you need, you can't allocate the sieve's bit flags in advance. What you can do is use a rolling window of the sieve to move forward by chunks (of let"s say 1000000 bits at a time). When a number beyond your maximum prime is requested, you just generate more primes chunk by chunk until you have enough to respond.
Also, since you're going to build a list of primes, you might as well make it a set and check if the requested number is in it to respond to the function call. This will require generating more primes than needed for divisions but, in the spirit of accelerating subsequent calls, that should not be an issue.
Here's an example of an isPrime() function that uses that approach:
primes = {3}
sieveMax = 3
sieveChunk = 1000000 # must be an even number
def isPrime(n):
if not n&1: return n==2
global primes,sieveMax, sieveChunk
while n>sieveMax:
base,sieveMax = sieveMax, sieveMax + sieveChunk
sieve = [True]* sieveChunk
for p in primes:
i = (p - base%p)%p
sieve[i::p]=[False]*len(sieve[i::p])
for i in range(0, sieveChunk,2):
if not sieve[i]: continue
p = i + base
primes.add(p)
sieve[i::p] = [False]*len(sieve[i::p])
return n in primes
On the first call to an unknown prime, it will perform slower than the divisions approach but as the prime list builds up, it will provide much better response time.

Using dictionaries to improve algorithm efficiency

The nth triangle number is defined as the sum 1+2+...+n. I'm working on Project Euler problem 12 which asks to find the smallest triangle number that has over 500 divisors, so (in Python) I wrote two functions, mytri(n) and mydiv(n), to compute the nth triangle number, and the number of divisors of n, respectively. Then, I used a while loop that iterates until mydiv(mytri(n)) is greater than or equal to 500:
import math
def mytri(n):
return n*(n+1)/2
def mydivs(n):
num = 0
max = math.floor(n/2)
for k in range(1,max+1):
if n%k == 0:
num += 1
return num+1
n = 1
while (mydivs(mytri(n)) <= 500): n += 1
print(mytri(n))
I thought I wrote mytri() and mydiv() pretty efficiently, but based on some tests, it seems like this program gets unwieldy very quickly. To compute the first number with over 100 divisors takes less than a second, but to compute the first number with over 150 divisors takes about 8-9 seconds, indicating that it's probably exponential in time? I don't have much experience with computational complexity or writing efficient algorithms but I once saw an example of using dictionaries (memoization I think?) to greatly improve a recursive algorithm to compute the Fibonacci numbers, and I was wondering if a similar idea could be used here.
For example, the nth triangle number can be expressed as n(n+1)/2, so without loss of generality it's the product of an odd and even number, say n and (n+1)/2 respectively. If you could store the divisors for each number up to n in a dictionary, then you wouldn't have to redo the computations in mydiv(), and instead you could just reference the dictionary. The only issue is finding out which divisors between n and (n+1)/2 overlap to get the right number of them. Is this a reasonable line of attack? Or am I missing something here?
Additionally, what is the time complexity of my algorithm and how would I calculate it?
mytri(n)'s Time complexity is O(1). mydivs(n)'s time complexity is O(n/2) which is O(n).while (mydivs(mytri(n)) <= 500)'s time complexity is O(n^3) since it is loop inside a loop, one loop runs N times and other runs N^2 times.. You can reduce the mydivs(n)'s time complexity to O(sqrt(n).
def new_mydivs(n):
res=set()
for i in range(1,int(n**0.5)+1):
#print(i)
if n%i==0:
res.update([i,n//i])
#print(res)
return len(res) #returns the number of divisors.
The time complexity of new_mydivs(n) is O(sqrt(n)).
Your code performance time for finding a number with 250 divisors.
import time
import timeit
import math
def mytri(n):
return n*(n+1)/2
def mydivs(n):
num = 0
max = math.floor(n/2)
for k in range(1,max+1):
if n%k == 0:
num += 1
return num+1
def main():
n = 1
while (mydivs(mytri(n)) <= 250): n += 1
print(mytri(n))
startTime=time.time()
main()
print(time.time()-startTime)
output:
2162160.0
100.24735450744629
My code performance time for 250 divisors:
import time
import timeit
import math
def mytri(n):
return n*(n+1)/2
def mydivs(n):
res=set()
for i in range(1,int(n**0.5)+1):
#print(i)
if n%i==0:
res.update([i,n//i])
#print(res)
return len(res) #returns the number of divisors.
def main():
n = 1
while (mydivs(mytri(n)) <= 250): n += 1
print(mytri(n))
startTime=time.time()
main()
print(time.time()-startTime)
output:
2162160.0
0.22459840774536133
for 500 divisors:
76576500.0
5.7917985916137695
for 750 divisors:
236215980.0
17.126375198364258
See the performance increased drastically.
RE: time complexity. You have two loops one inside another, one running up to N, another running up to N^2. This gives us the O(N^3) time complexity.
You may use dictionaries to save the partial result, but the overall complexity will still be O(N^3), however with a smaller constant factor, because you still have to loop over the rest of the values.

Handling a List of 1000000+ Elements

I am trying to find a way to find the distribution of prime gaps for primes less than 100,000,000.
My method:
Step 1: Start with a TXT file "primes.txt" which has a list of primes (up to 10,000,000).
Step 2: Have the program read the file and then insert each number into a list, p1.
Step 3: Find the square root of the upper bound (10 times the upper bound of the primes in the TXT file, in this case, 100,000,000) and create another list, p2, with all the primes less than or equal to that square root, rounded up.
Step 4: Define an isPrime() method that checks whether the input is a prime number (N.B.: because I know that the numbers that will be checked are all less than 100,000,000, I only have to check whether the number is divisible by all the primes less than or equal to the square root of 100,000,000, which is 10,000)
Step 5: Add a list l which collects all the prime gaps, then iterate from 1 to 100,000,000, checking the primality of each number. If the number IS prime, then record the gap between it and the last prime number before it, and also write it to another document "primes2.txt".
Step 6: Output the list l.
The problem:
The program seems to take a LONG time to run. I have a feeling that the problem has to do with how I am managing the list due to its size (the Prime Number Theorem estimates about 620,420 elements in that list from "primes.txt"). Is there a way to reduce the runtime for this program by handling the list differently?
I've attached my code below.
import math
import sys
f = open("primes.txt","r")
p1 = []
for i in f:
p1.append(int(i))
f.close()
ml = 10000000
ms = math.sqrt(10*ml)
p2 = []
x1 = 0
while x1 < len(p1) and p1[x1] <= int(ms+0.5):
p2.append(p1[x1])
x1 += 1
def isPrime(n):
for i in p2:
if n%i == 0:
if n/i == 1:
return True
return False
return True
def main():
l = [0]*1001 #1,2,4,6,8,10,12,...,2000 (1, and then all evens up to 2000)
lastprime = -1
diff = 0
fileobject = open("primes2.txt",'w')
for i in xrange(1,10*ml):
if isPrime(i):
if i > 2:
diff = i - lastprime
if diff == 1:
l[0] += 1
else:
l[diff/2] += 1
lastprime = i
fileobject.write(str(i)+"\n")
if i%(ml/100) == 0:
print i/float(ml/10), "% complete"
fileobject.close()
print l
main()
EDIT: Made a change in how the program reads from the file
Tips:
For the first few X million, you can generate primes as fast as you can read them from a file, but you need to do it efficiently. See below. (Generating n up to 100 million on my recent MacBook Pro takes about 7 seconds in Python. Generating primes up to 100,000,000 takes 4 minutes. It will be faster in PyPy and way faster in C or Swift or with Python with Numpy)
You are careless with memory. Example: ps = f.read().split('\n') and then use that p1 = [int(i) for i in ps] while ps sits in memory unused. Wasteful. Use a loop that reads the file line-by-line so the memory can be more efficiently used. When you read line-by-line, the file does not sit idly in memory after conversion.
There are very good reasons that big primes are useful for cryptography; they take a long time to generate. Python is not the most efficient language to tackle this with.
Here is a sieve of Eratosthenes to try:
def sieve_of_e(n):
""" Returns a list of primes < n """
sieve = [True] * n
for i in xrange(3,int(n**0.5)+1,2):
if sieve[i]:
sieve[i*i::2*i]=[False]*((n-i*i-1)/(2*i)+1)
return [2] + [i for i in xrange(3,n,2) if sieve[i]]
Use the sieve of eratosthenes to upgrade the prime function. Here's a link:
Sieve of Eratosthenes - Finding Primes Python

Most efficient way to find all factors with GMPY2 (or GMP)?

I know there's already a question similar to this, but I want to speed it up using GMPY2 (or something similar with GMP).
Here is my current code, it's decent but can it be better?
Edit: new code, checks divisors 2 and 3
def factors(n):
result = set()
result |= {mpz(1), mpz(n)}
def all_multiples(result, n, factor):
z = mpz(n)
while gmpy2.f_mod(mpz(z), factor) == 0:
z = gmpy2.divexact(z, factor)
result |= {mpz(factor), z}
return result
result = all_multiples(result, n, 2)
result = all_multiples(result, n, 3)
for i in range(1, gmpy2.isqrt(n) + 1, 6):
i1 = mpz(i) + 1
i2 = mpz(i) + 5
div1, mod1 = gmpy2.f_divmod(n, i1)
div2, mod2 = gmpy2.f_divmod(n, i2)
if mod1 == 0:
result |= {i1, div1}
if mod2 == 0:
result |= {i2, div2}
return result
If it's possible, I'm also interested in an implementation with divisors only within n^(1/3) and 2^(2/3)*n(1/3)
As an example, mathematica's factor() is much faster than the python code. I want to factor numbers between 20 and 50 decimal digits. I know ggnfs can factor these in less than 5 seconds.
I am interested if any module implementing fast factorization exists in python too.
I just made some quick changes to your code to eliminate redundant name lookups. The algorithm is still the same but it is about twice as fast on my computer.
import gmpy2
from gmpy2 import mpz
def factors(n):
result = set()
n = mpz(n)
for i in range(1, gmpy2.isqrt(n) + 1):
div, mod = divmod(n, i)
if not mod:
result |= {mpz(i), div}
return result
print(factors(12345678901234567))
Other suggestions will need more information about the size of the numbers, etc. For example, if you need all the possible factors, it may be faster to construct those from all the prime factors. That approach will let you decrease the limit of the range statement as you proceed and also will let you increment by 2 (after removing all the factors of 2).
Update 1
I've made some additional changes to your code. I don't think your all_multiplies() function is correct. Your range() statement isn't optimal since 2 is check again but my first fix made it worse.
The new code delays computing the co-factor until it knows the remainder is 0. I also tried to use the built-in functions as much as possible. For example, mpz % integer is faster than gmpy2.f_mod(mpz, integer) or gmpy2.f_mod(integer, mpz) where integer is a normal Python integer.
import gmpy2
from gmpy2 import mpz, isqrt
def factors(n):
n = mpz(n)
result = set()
result |= {mpz(1), n}
def all_multiples(result, n, factor):
z = n
f = mpz(factor)
while z % f == 0:
result |= {f, z // f}
f += factor
return result
result = all_multiples(result, n, 2)
result = all_multiples(result, n, 3)
for i in range(1, isqrt(n) + 1, 6):
i1 = i + 1
i2 = i + 5
if not n % i1:
result |= {mpz(i1), n // i1}
if not n % i2:
result |= {mpz(i2), n // i2}
return result
print(factors(12345678901234567))
I would change your program to just find all the prime factors less than the square root of n and then construct all the co-factors later. Then you decrease n each time you find a factor, check if n is prime, and only look for more factors if n isn't prime.
Update 2
The pyecm module should be able to factor the size numbers you are trying to factor. The following example completes in about a second.
>>> import pyecm
>>> list(pyecm.factors(12345678901234567890123456789012345678901, False, True, 10, 1))
[mpz(29), mpz(43), mpz(43), mpz(55202177), mpz(2928109491677), mpz(1424415039563189)]
There exist different Python factoring modules in the Internet. But if you want to implement factoring yourself (without using external libraries) then I can suggest quite fast and very easy to implement Pollard-Rho Algorithm. I implemented it fully in my code below, you just scroll down directly to my code (at the bottom of answer) if you don't want to read.
With great probability Pollard-Rho algorithm finds smallest non-trivial factor P (not equal to 1 or N) within time of O(Sqrt(P)). To compare, Trial Division algorithm that you implemented in your question takes O(P) time to find factor P. It means for example if a prime factor P = 1 000 003 then trial division will find it after 1 000 003 division operations, while Pollard-Rho on average will find it just after 1 000 operations (Sqrt(1 000 003) = 1 000), which is much much faster.
To make Pollard-Rho algorithm much faster we should be able to detect prime numbers, to exclude them from factoring and don't wait unnecessarily time, for that in my code I used Fermat Primality Test which is very fast and easy to implement within just 7-9 lines of code.
Pollard-Rho algorithm itself is very short, 13-15 lines of code, you can see it at the very bottom of my pollard_rho_factor() function, the remaining lines of code are supplementary helpers-functions.
I implemented all algorithms from scratch without using extra libraries (except random module). That's why you can see my gcd() function there although you can use built-in Python's math.gcd() instead (which finds Greatest Common Divisor).
You can see function Int() in my code, it is used just to convert Python's integers to GMPY2. GMPY2 ints will make algorithm faster, you can just use Python's int(x) instead. I didn't use any specific GMPY2 function, just converted all ints to GMPY2 ints to have around 50% speedup.
As an example I factor first 190 digits of Pi!!! It takes 3-15 seconds to factor them. Pollard-Rho algorithm is randomized hence it takes different time to factor same number on each run. You can restart program again and see that it will print different running time.
Of course factoring time depends greatly on size of prime divisors. Some 50-200 digits numbers can be factoring within fraction of second, some will take months. My example 190 digits of Pi has quite small prime factors, except largest one, that's why it is fast. Other digits of Pi may be not that fast to factor. So digit-size of number doesn't matter very much, only size of prime factors matter.
I intentionally implemented pollard_rho_factor() function as one standalone function, without breaking it into smaller separate functions. Although it breaks Python's style guide, which (as I remember) suggests not to have nested functions and place all possible functions at global scope. Also style guide suggests to do all imports at global scope in first lines of script. I did single function intentionally so that it is easy copy-pastable and fully ready to use in your code. Fermat primality test is_fermat_probable_prime() sub-function is also copy pastable and works without extra dependencies.
In very rare cases Pollard-Rho algorithm may fail to find non-trivial prime factor, especially for very small factors, for example you can replace n inside test() with small number 4 and see that Pollard-Rho fails. For such small failed factors you can easily use your Trial Division algorithm that you implemented in your question.
Try it online!
def pollard_rho_factor(N, *, trials = 16):
# https://en.wikipedia.org/wiki/Pollard%27s_rho_algorithm
import math, random
def Int(x):
import gmpy2
return gmpy2.mpz(x) # int(x)
def is_fermat_probable_prime(n, *, trials = 32):
# https://en.wikipedia.org/wiki/Fermat_primality_test
import random
if n <= 16:
return n in (2, 3, 5, 7, 11, 13)
for i in range(trials):
if pow(random.randint(2, n - 2), n - 1, n) != 1:
return False
return True
def gcd(a, b):
# https://en.wikipedia.org/wiki/Greatest_common_divisor
# https://en.wikipedia.org/wiki/Euclidean_algorithm
while b != 0:
a, b = b, a % b
return a
def found(f, prime):
print(f'Found {("composite", "prime")[prime]} factor, {math.log2(f):>7.03f} bits... {("Pollard-Rho failed to fully factor it!", "")[prime]}')
return f
N = Int(N)
if N <= 1:
return []
if is_fermat_probable_prime(N):
return [found(N, True)]
for j in range(trials):
i, stage, y, x = 0, 2, Int(1), Int(random.randint(1, N - 2))
while True:
r = gcd(N, abs(x - y))
if r != 1:
break
if i == stage:
y = x
stage <<= 1
x = (x * x + 1) % N
i += 1
if r != N:
return sorted(pollard_rho_factor(r) + pollard_rho_factor(N // r))
return [found(N, False)] # Pollard-Rho failed
def test():
import time
# http://www.math.com/tables/constants/pi.htm
# pi = 3.
# 1415926535 8979323846 2643383279 5028841971 6939937510 5820974944 5923078164 0628620899 8628034825 3421170679
# 8214808651 3282306647 0938446095 5058223172 5359408128 4811174502 8410270193 8521105559 6446229489 5493038196
# n = first 190 fractional digits of Pi
n = 1415926535_8979323846_2643383279_5028841971_6939937510_5820974944_5923078164_0628620899_8628034825_3421170679_8214808651_3282306647_0938446095_5058223172_5359408128_4811174502_8410270193_8521105559_6446229489
tb = time.time()
print('N:', n)
print('Factors:', pollard_rho_factor(n))
print(f'Time: {time.time() - tb:.03f} sec')
test()
Output:
N: 1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489
Found prime factor, 1.585 bits...
Found prime factor, 6.150 bits...
Found prime factor, 20.020 bits...
Found prime factor, 27.193 bits...
Found prime factor, 28.311 bits...
Found prime factor, 545.087 bits...
Factors: [mpz(3), mpz(71), mpz(1063541), mpz(153422959), mpz(332958319), mpz(122356390229851897378935483485536580757336676443481705501726535578690975860555141829117483263572548187951860901335596150415443615382488933330968669408906073630300473)]
Time: 2.963 sec

Optimizing Prime Number Python Code

I'm relatively new to the python world, and the coding world in general, so I'm not really sure how to go about optimizing my python script. The script that I have is as follows:
import math
z = 1
x = 0
while z != 0:
x = x+1
if x == 500:
z = 0
calculated = open('Prime_Numbers.txt', 'r')
readlines = calculated.readlines()
calculated.close()
a = len(readlines)
b = readlines[(a-1)]
b = int(b) + 1
for num in range(b, (b+1000)):
prime = True
calculated = open('Prime_Numbers.txt', 'r')
for i in calculated:
i = int(i)
q = math.ceil(num/2)
if (q%i==0):
prime = False
if prime:
calculated.close()
writeto = open('Prime_Numbers.txt', 'a')
num = str(num)
writeto.write("\n" + num)
writeto.close()
print(num)
As some of you can probably guess I'm calculating prime numbers. The external file that it calls on contains all the prime numbers between 2 and 20.
The reason that I've got the while loop in there is that I wanted to be able to control how long it ran for.
If you have any suggestions for cutting out any clutter in there could you please respond and let me know, thanks.
Reading and writing to files is very, very slow compared to operations with integers. Your algorithm can be sped up 100-fold by just ripping out all the file I/O:
import itertools
primes = {2} # A set containing only 2
for n in itertools.count(3): # Start counting from 3, by 1
for prime in primes: # For every prime less than n
if n % prime == 0: # If it divides n
break # Then n is composite
else:
primes.add(n) # Otherwise, it is prime
print(n)
A much faster prime-generating algorithm would be a sieve. Here's the Sieve of Eratosthenes, in Python 3:
end = int(input('Generate primes up to: '))
numbers = {n: True for n in range(2, end)} # Assume every number is prime, and then
for n, is_prime in numbers.items(): # (Python 3 only)
if not is_prime:
continue # For every prime number
for i in range(n ** 2, end, n): # Cross off its multiples
numbers[i] = False
print(n)
It is very inefficient to keep storing and loading all primes from a file. In general file access is very slow. Instead save the primes to a list or deque. For this initialize calculated = deque() and then simply add new primes with calculated.append(num). At the same time output your primes with print(num) and pipe the result to a file.
When you found out that num is not a prime, you do not have to keep checking all the other divisors. So break from the inner loop:
if q%i == 0:
prime = False
break
You do not need to go through all previous primes to check for a new prime. Since each non-prime needs to factorize into two integers, at least one of the factors has to be smaller or equal sqrt(num). So limit your search to these divisors.
Also the first part of your code irritates me.
z = 1
x = 0
while z != 0:
x = x+1
if x == 500:
z = 0
This part seems to do the same as:
for x in range(500):
Also you limit with x to 500 primes, why don't you simply use a counter instead, that you increase if a prime is found and check for at the same time, breaking if the limit is reached? This would be more readable in my opinion.
In general you do not need to introduce a limit. You can simply abort the program at any point in time by hitting Ctrl+C.
However, as others already pointed out, your chosen algorithm will perform very poor for medium or large primes. There are more efficient algorithms to find prime numbers: https://en.wikipedia.org/wiki/Generating_primes, especially https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes.
You're writing a blank line to your file, which is making int() traceback. Also, I'm guessing you need to rstrip() off your newlines.
I'd suggest using two different files - one for initial values, and one for all values - initial and recently computed.
If you can keep your values in memory a while, that'd be a lot faster than going through a file repeatedly. But of course, this will limit the size of the primes you can compute, so for larger values you might return to the iterate-through-the-file method if you want.
For computing primes of modest size, a sieve is actually quite good, and worth a google.
When you get into larger primes, trial division by the first n primes is good, followed by m rounds of Miller-Rabin. If Miller-Rabin probabilistically indicates the number is probably a prime, then you do complete trial division or AKS or similar. Miller Rabin can say "This is probably a prime" or "this is definitely composite". AKS gives a definitive answer, but it's slower.
FWIW, I've got a bunch of prime-related code collected together at http://stromberg.dnsalias.org/~dstromberg/primes/

Categories